![banner.png](banner.png)

---

<h2 style="color:#01386a; background-color:#e72564; padding: 10px; text-align:left; border: 1px solid #e72564;">1. Introduction to Pydantic</h2>

### 1. Introduction to Pydantic

Pydantic is a powerful Python library that uses Python type hints to define clear data models and automatically validate data. It helps ensure that your code works with data in the expected format, reducing bugs and making your code more maintainable.

**Key Benefits:**
- **Automatic Data Validation:** Checks if the data conforms to the expected types.
- **Type Conversion:** Converts data into the correct type (e.g., turning `"30"` into `30`).
- **Informative Error Messages:** Provides detailed messages when data is invalid.
- **Improved Readability:** Uses Python type hints, making your code self-documenting.

In the next cells, we will see how to import Pydantic, define a simple data model, and observe both valid and invalid data examples.

In [31]:
# Importing the necessary components from Pydantic:
from pydantic import BaseModel, ValidationError

- BaseModel: The core class from Pydantic. When you create a data model, you inherit from BaseModel. It provides automatic validation and type conversion.

 - ValidationError: This exception is raised when the data provided does not match the schema defined in your model. It helps you understand what went wrong during data validation.

In [32]:
# Let's create a simple Pydantic model to see how it works.

# Define a simple model with two fields: name and age
class Person(BaseModel):
    name: str  # Expected to be a string
    age: int  # Expected to be an integer


# Valid instance: This should work fine because the data matches the expected types.
person = Person(name="Alice", age=30)
print("Valid Person:", person)

Valid Person: name='Alice' age=30


- **Valid Person Output:**  
  The valid instance `Person(name="Alice", age=30)` works correctly because the data matches the model's expectations. 

In [33]:
# Invalid instance: This will raise a validation error because the field names are incorrect.
try:
    # Here, 'name2' and 'age2' are used instead of the correct field names 'name' and 'age'.
    invalid_person = Person(name2=1234, age2="30")
except ValidationError as e:
    print("Validation Error:", e)

Validation Error: 2 validation errors for Person
name
  Field required [type=missing, input_value={'name2': 1234, 'age2': '30'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
age
  Field required [type=missing, input_value={'name2': 1234, 'age2': '30'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing



- **Validation Error Output:**  
When we try to create an instance using incorrect field names (`name2` and `age2`), Pydantic cannot find the required fields (`name` and `age`). This leads to a `ValidationError` indicating that both required fields are missing.  


-- Pydantic not only checks for correct types but also for the presence of all required fields as defined in the model. Incorrect field names or missing fields will result in an error, ensuring that the data conforms exactly to the specified schema.

In [34]:
# Let's create a Person instance using a string for 'age'
person = Person(name="Alice", age="30")

print("Value of age:", person.age)
print("Type of age:", type(person.age))

Value of age: 30
Type of age: <class 'int'>


- Although we provided the age as the string `"30"`, Pydantic automatically converts it to the integer `30` based on the type definition in our Person model.
- This example demonstrates that our model enforces data consistency by converting types as needed.
  
Now that we have seen how a string input is automatically converted to an integer, we will move on to exploring further automatic type conversion features in Section 2.


---

#### 1.1. Demonstrating Automatic Type Conversion

In this section, we'll explore how Pydantic automatically converts input data to match the expected types. Although our previous example showed converting a string to an integer, Pydantic supports several other conversions as well. Let's take a look at these behaviors.


In [35]:
# Example: Converting an integer to a float
class Measurement(BaseModel):
    value: float  # 'value' is expected to be a float


# Passing an integer for 'value'; Pydantic converts it to a float.
measurement = Measurement(value=42)
print("Measurement:", measurement)
print("Type of 'value':", type(measurement.value))

Measurement: value=42.0
Type of 'value': <class 'float'>


- When a field expects a float, providing an integer (like `42`) results in an automatic conversion to `42.0`.
- This conversion is particularly useful in contexts where calculations require a floating-point representation.

In [36]:
# Example: Converting a string to a boolean
class Flag(BaseModel):
    active: bool  # 'active' is expected to be a boolean


# Pydantic converts common string representations of booleans.
flag1 = Flag(active="true")
flag2 = Flag(active="False")
print("Flag conversion flag1:", flag1)
print("Flag conversion flag2:", flag2)

Flag conversion flag1: active=True
Flag conversion flag2: active=False


- Pydantic recognizes strings like `"true"` and `"False"` and converts them into their corresponding boolean values.
- This is especially useful when processing data from sources such as user inputs or external APIs where booleans might be represented as strings.


In [37]:
# Example: Parsing a date/time string into a datetime object
from datetime import datetime


class Event(BaseModel):
    timestamp: datetime  # 'timestamp' is expected to be a datetime object


# Provide an ISO 8601 formatted date/time string.
event = Event(timestamp="2023-02-22T12:34:56")
print("Event:", event)
print("Type of 'timestamp':", type(event.timestamp))

Event: timestamp=datetime.datetime(2023, 2, 22, 12, 34, 56)
Type of 'timestamp': <class 'datetime.datetime'>


- When a field expects a `datetime` object, Pydantic can automatically parse a well-formatted date/time string (ISO 8601) into a Python `datetime` instance.
- This conversion is essential for applications that handle time-based data, such as event logging, scheduling, or time series analysis.

**Overall, these examples illustrate how Pydantic's built-in type conversion capabilities help maintain data consistency and reduce the need for manual data cleaning.**


---

#### 1.2. Informative Error Messages

One of Pydantic's powerful features is its ability to provide clear and detailed error messages when data doesn't match the expected schema.

**Why It Matters:**
- **Clarity:** Detailed error messages help pinpoint which field has an issue and why.
- **Debugging:** They make it easier to diagnose and fix problems quickly.
- **User Feedback:** In user-facing applications, clear errors help guide users to provide correct data.


In [38]:
# Let's see an example of an informative error message.
# We'll deliberately pass an invalid value to trigger a validation error.

try:
    # Here, 'age' is expected to be an integer, but we pass a non-convertible string.
    invalid_person = Person(name="Charlie", age="not_a_number")
except ValidationError as e:
    print("Validation Error:")
    print(e)

Validation Error:
1 validation error for Person
age
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='not_a_number', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/int_parsing


- The error output clearly states that the `age` field is problematic.
- It shows the expected type, the provided value, and even a reference link for more details.
- This level of detail assists developers in quickly identifying and resolving data issues.


---

#### 1.3. Improved Code Readability

Pydantic not only validates data but also improves the readability of your code by leveraging Python's type hints and clear model definitions.

**Benefits:**
- **Self-Documenting Code:**  
  Type annotations clearly show the expected data types.
- **Clear Structure:**  
  Models define expected fields, simplifying onboarding and reducing errors.
- **Maintainability:**  
  Centralized models localize changes, making updates easier.


In [39]:
# Example of a clear and concise Pydantic model
class User(BaseModel):
    username: str
    email: str
    age: int


# Creating a valid user instance
user = User(username="johndoe", email="john@example.com", age=25)
print("User:", user)

User: username='johndoe' email='john@example.com' age=25


- **Self-Documenting:**  
  Type hints make it obvious what data each field should hold.
- **Clear Structure:**  
  Models centralize the expected data, easing onboarding and error prevention.
- **Maintainability:**  
  Changes are localized to the model definition, enhancing long-term code upkeep.


---

### 2. Custom Validation and Advanced Model Configuration

While Pydantic's automatic validation is powerful, there are times when you need custom logic to enforce your data's correctness. In this section, we will:
- Learn how to add custom validators to your models.
- See how to perform cross-field validation.
- Explore how to handle complex data structures with nested models.

#### Understanding Custom Validators (Pydantic V2 Style)

Custom validators let you enforce rules beyond basic type checking. They allow you to implement additional logic—like ensuring a field's value meets specific criteria—during model initialization.

- **Decorator Usage:**  
  Use `@field_validator('field_name')` to attach a custom validation method to a specific field.

- **Execution Timing:**  
  Validators run during model initialization. They check the field's value and, if it doesn't meet the criteria, raise an error.

- **Error Handling:**  
  If the validator raises an error (e.g., via `ValueError`), Pydantic wraps this into a `ValidationError` that explains what went wrong.

In the example below, the `name_must_be_alpha` validator ensures that the `name` field contains only alphabetic characters. If the value contains non-alphabetic characters, a `ValueError` is raised, preventing the model instance from being created with invalid data.


In [40]:
from pydantic import field_validator


class Person(BaseModel):
    name: str
    age: int

    # Custom validator: ensure the name contains only alphabetic characters.
    @field_validator("name")
    def name_must_be_alpha(cls, value):
        if not value.isalpha():
            raise ValueError("Name must contain only alphabetic characters.")
        return value

- The `@field_validator('name')` decorator attaches the `name_must_be_alpha` method to the `name` field.
- During model initialization, this validator checks if the `name` value contains only alphabetic characters.
- If not, a `ValueError` is raised, which is then wrapped in a `ValidationError` by Pydantic.


In [41]:
# Example 1: Valid instance
try:
    person_valid = Person(name="Vincent", age=25)
    print("Valid Person:", person_valid)
except ValidationError as e:
    print("Validation Error:", e)

Valid Person: name='Vincent' age=25


In this example, "Vincent" is a valid name (only alphabetic), so the instance is created without issues.

In [42]:
# Example 2: Invalid instance (name contains non-alphabetic characters)
try:
    person_invalid = Person(name="Tangol16", age=25)
except ValidationError as e:
    print("Invalid:", e)

Invalid: 1 validation error for Person
name
  Value error, Name must contain only alphabetic characters. [type=value_error, input_value='Tangol16', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error


Here, `"Tangol16"` contains numbers, so the custom validator fails. This results in a ValidationError with the message "Name must contain only alphabetic characters."

- **Custom Validators in Action:**  
  The examples show how custom validators enforce specific rules. A valid input ("Vincent") passes the check, while an invalid input ("Tangol16") triggers an error.

- **Immediate Feedback:**  
  This approach provides immediate feedback during model initialization, ensuring that only correctly formatted data is accepted.

---

### 3. Data Ingestion with Pydantic Validation

In this section, we'll integrate Pydantic into a data ingestion workflow. We'll define a simple model to validate raw data, load data into a DataFrame, and validate each row. Any rows that fail validation will be logged separately for review, which is useful for monitoring data quality in production pipelines.

In [43]:
import pandas as pd
from pydantic import BaseModel, ValidationError, field_validator


class RawData(BaseModel):
    name: str
    age: int

    # Custom validator: ensure the name contains only alphabetic characters.
    @field_validator("name")
    def name_must_be_alpha(cls, value):
        if not value.isalpha():
            raise ValueError("Name must contain only alphabetic characters.")
        return value

**Before we validate the data:**  
We'll simulate raw CSV data by creating a Pandas DataFrame. Then, we define a function that processes the DataFrame row-by-row:
- Each row is converted to a dictionary.
- We attempt to create a `RawData` instance from the dictionary.
- Valid records are collected, while any record that fails validation is logged along with its error message.

In [44]:
# Mock DataFrame simulating raw CSV data
raw_data_df = pd.DataFrame(
    {
        "name": ["Nica", "Bob123", "Denna", "Samn", "Mich", "3lena"],
        "age": [30, 25, "twenty", 40, 55, 22],
    }
)


def load_and_validate_data(df: pd.DataFrame):
    valid_records = []
    invalid_records = []

    for idx, record in df.iterrows():
        row_dict = record.to_dict()
        try:
            validated = RawData(**row_dict)
            valid_records.append(validated.dict())
        except ValidationError as e:
            invalid_records.append({"index": idx, "data": row_dict, "error": str(e)})

    return pd.DataFrame(valid_records), invalid_records


valid_df, invalid_entries = load_and_validate_data(raw_data_df)

print("Valid DataFrame:")
print(valid_df, "\n")

print("Invalid Entries:")
for entry in invalid_entries:
    print(entry)

Valid DataFrame:
   name  age
0  Nica   30
1  Samn   40
2  Mich   55 

Invalid Entries:
{'index': 1, 'data': {'name': 'Bob123', 'age': 25}, 'error': "1 validation error for RawData\nname\n  Value error, Name must contain only alphabetic characters. [type=value_error, input_value='Bob123', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.10/v/value_error"}
{'index': 2, 'data': {'name': 'Denna', 'age': 'twenty'}, 'error': "1 validation error for RawData\nage\n  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='twenty', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.10/v/int_parsing"}
{'index': 5, 'data': {'name': '3lena', 'age': 22}, 'error': "1 validation error for RawData\nname\n  Value error, Name must contain only alphabetic characters. [type=value_error, input_value='3lena', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.10/v/value_e

/var/folders/00/9__n8w194ss9nhts_6bys9qr0000gn/T/ipykernel_18849/3537455370.py:18: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  valid_records.append(validated.dict())


**Discussion:**

- **Model Definition:**  
  The `RawData` model enforces that the `name` must consist only of alphabetic characters and that `age` is an integer.

- **Row-by-Row Validation:**  
  Each row in the DataFrame is converted to a dictionary and validated:
  - **Valid Records:** Successfully validated rows are added to the valid DataFrame.
  - **Invalid Records:** Any row that fails validation is captured with its index, data, and error message.

- **Real-World Relevance:**  
  In a production pipeline, you might discard or flag invalid records for manual review, ensuring that only high-quality data is used for further processing.

- **Performance Note:**  
  While row-by-row processing is simple and clear, it might be slow for very large datasets. Consider batch processing or parallel validation for scalability.

In [45]:
# Optional: Read data from a CSV file and validate it.
# Ensure you have created a CSV file named 'raw_data.csv' in your repository.

# Read the CSV file
df_csv = pd.read_csv("raw_data.csv")

# Validate the CSV data using the previously defined load_and_validate_data function
valid_df_csv, invalid_entries_csv = load_and_validate_data(df_csv)

print("Valid CSV Data:")
print(valid_df_csv, "\n")

print("Invalid CSV Entries:")
for entry in invalid_entries_csv:
    print(entry)

Valid CSV Data:
      name  age
0    Alice   30
1  Charlie   28
2      Eve   35
3    Frank   42 

Invalid CSV Entries:
{'index': 1, 'data': {'name': 'Bob123', 'age': '25'}, 'error': "1 validation error for RawData\nname\n  Value error, Name must contain only alphabetic characters. [type=value_error, input_value='Bob123', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.10/v/value_error"}
{'index': 3, 'data': {'name': 'Diana', 'age': 'forty'}, 'error': "1 validation error for RawData\nage\n  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='forty', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.10/v/int_parsing"}


/var/folders/00/9__n8w194ss9nhts_6bys9qr0000gn/T/ipykernel_18849/3537455370.py:18: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  valid_records.append(validated.dict())
