
# Week 8 â€” Data Validation

---

## Learning Goals
>
>By the end of this week, you will be able to:
>
> - Understand **why data validation is critical** in data science pipelines
> - Explain the difference between **data cleaning** and **data validation**
> - Use **Pandera** to validate Pandas DataFrames
> - Define **schemas**, **checks**, and **constraints**
> - Integrate validation naturally into Pandas-based workflows
>
> This week focuses on **thinking validating data like a data professional**.



## Why Data Validation Exists

> In most data science courses, we jump quickly to:
> - EDA
> - feature engineering
> - modeling
>
> But in real systems, **data validation happens before all of that**.
>
> Data validation answers questions such as:
> - Are the columns I expect actually present?
> - Do values have the correct types?
> - Are numeric values within acceptable ranges?
> - Are categories restricted to known values?
>
> Without validation:
> - Errors propagate silently
> - Models learn from corrupted data
> - Bugs appear far from their root cause



## Setup

> We will use:
> - `pandas` for data manipulation
> - `pandera` for **DataFrame-level validation**
>
> Pandera was designed **specifically** to work with Pandas.
> It feels natural if you already know Pandas.


In [None]:
!pip install pandera

In [None]:
import pandas as pd
import pandera.pandas as pa
from pandera import Column, DataFrameSchema, Check


## 1. A Dataset with Hidden Problems

> Let's start with a dataset that *looks* reasonable,
> but contains several validation issues.


In [None]:

data = {
    "name": ["Alice", "Bob", "", "Diana"],
    "age": [25, -5, 30, "forty"],
    "salary": [3000, 2800, 2500, -100],
    "department": ["IT", "HR", "IT", "Marketing"]
}

df = pd.DataFrame(data)
df



> Pandas will happily accept this data.
> **No errors are raised.**
>
> This is exactly why validation libraries exist.



## 2. Introducing Pandera

> **Pandera** is a data validation library for Pandas.
>
> Core ideas:
> - A **schema** describes how data *should* look
> - Validation fails fast if data violates the schema
> - Errors are explicit and informative
>
> Think of Pandera as:
>
> > `scikit-learn` for data correctness



## 2.1. Defining a Basic Schema

> A Pandera schema defines:
> - column names
> - data types
> - constraints (checks)


In [None]:

employee_schema = DataFrameSchema({
    "name": Column(str),
    "age": Column(int),
    "salary": Column(float),
    "department": Column(str),
})



> This schema defines **structure**, but not **business rules** yet.



## 2.2 Validating a DataFrame

> Validation is done explicitly using `.validate()`.


In [None]:

try:
    employee_schema.validate(df)
except pa.errors.SchemaError as e:
    print(e)



## 2.3 Adding Checks (Business Rules)

> Most real validation rules are **semantic**, not just types.
>
> Examples:
> - age must be >= 0
> - salary must be positive
> - name must not be empty


In [None]:

employee_schema = DataFrameSchema({
    "name": Column(
        str,
        Check.str_length(min_value=1)
    ),
    "age": Column(
        int,
        Check.ge(0)
    ),
    "salary": Column(
        float,
        Check.ge(0)
    ),
    "department": Column(
        str,
        Check.isin(["IT", "HR", "Finance", "Marketing"])
    ),
})



### 3. Validating the Data

> Using the schema above:
>
> 1. Run the validation
> 2. Observe which rows fail
> 3. Read the error message carefully


In [None]:
try:
    employee_schema.validate(df)
except pa.errors.SchemaError as e:
    print(e)



## 3.1. Filtering Bad Rows

> Sometimes we don't want the pipeline to stop.
> Instead, we want to **identify and handle invalid rows**.


In [None]:
validated_df = employee_schema.validate(
    df,
    lazy=True
)

validated_df



> Rows that violate constraints are reported,
> but valid rows are preserved.



### 3.2. Count Invalid Rows

> Using the lazy validation result:
>
> 1. Count how many rows are invalid
> 2. Print the number


In [None]:

try:
    employee_schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    print(f'Number of invalid rows: {len(e.failure_cases)}')
    print(e)



## 4. Custom Checks

> Pandera allows **custom validation logic**.
> This is useful for complex rules.


In [None]:
def reasonable_salary(s):
    return s < 10000

employee_schema = DataFrameSchema({
    "name": Column(str, Check.str_length(min_value=1)),
    "age": Column(int, Check.ge(0)),
    "salary": Column(float, [Check.ge(0), Check(reasonable_salary)]),
    "department": Column(str),
})



### Add a Custom Age Rule

> Add a rule so that:
>
> - age must be 0 <= age <= 120


In [None]:

employee_schema = DataFrameSchema({
    "name": Column(str, Check.str_length(min_value=1)),
    "age": Column(int, [Check.ge(0), Check.le(120)]),
    "salary": Column(float, Check.ge(0)),
    "department": Column(str),
})



## Summary

> In this week, you learned:
>
> - Why validation is essential in data science
> - Why Pandas alone is not enough
> - How Pandera integrates naturally with Pandas
> - How to define schemas and business rules
>
> **Other proeminent library is**:
>
> > Pydantic https://docs.pydantic.dev/latest/
