# Data Validation and Quality Assurance with Pandas
In this notebook, we will explore how to validate data and ensure quality using `pandas` in Python.
We will demonstrate how to apply validation rules, detect data anomalies, and set up quality monitoring.

## Step 1: Defining Validation Rules
We will start by setting up basic validation rules to ensure data accuracy. These rules may include checks for ranges, data types, and required fields.

In [1]:
import pandas as pd

# Sample dataset
data = {'Name': ['John', 'Alice', 'Bob', 'Clara', None],
        'Age': [25, 22, -5, 30, 28],  # Invalid age '-5'
        'Salary': [50000, 70000, 60000, None, 45000]}  # Missing salary

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,John,25,50000.0
1,Alice,22,70000.0
2,Bob,-5,60000.0
3,Clara,30,
4,,28,45000.0


## Step 2: Validating Data
We will check for the following issues:
- Missing values in required fields.
- Invalid data types.
- Values out of the acceptable range.

In [2]:
# Check for missing values
missing_values = df.isnull().sum()
print('Missing Values:')
print(missing_values)

# Check for invalid ages (ages should be between 0 and 120)
invalid_ages = df[(df['Age'] < 0) | (df['Age'] > 120)]
print('\nInvalid Age Entries:')
print(invalid_ages)


Missing Values:
Name      1
Age       0
Salary    1
dtype: int64

Invalid Age Entries:
  Name  Age   Salary
2  Bob   -5  60000.0


## Step 3: Applying Corrections
After detecting the data quality issues, we will apply corrections.
- Replace missing salary values with the mean salary.
- Replace invalid ages with `NaN`.

In [3]:
import numpy as np

# Replacing missing salaries with the mean salary
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Replacing invalid ages with NaN
df['Age'] = df['Age'].apply(lambda x: np.nan if x < 0 or x > 120 else x)

df

Unnamed: 0,Name,Age,Salary
0,John,25.0,50000.0
1,Alice,22.0,70000.0
2,Bob,,60000.0
3,Clara,30.0,56250.0
4,,28.0,45000.0


## Step 4: Setting Up Data Quality Monitoring
To continuously ensure data quality, we can implement automated checks. Here, we'll simulate ongoing monitoring to detect any future issues.

In [4]:
# Function to validate incoming data in real-time
def validate_data(df):
    missing_values = df.isnull().sum()
    invalid_ages = df[(df['Age'] < 0) | (df['Age'] > 120)]
    if missing_values.any():
        print('Data contains missing values!')
    if not invalid_ages.empty:
        print('Data contains invalid ages!')
    else:
        print('Data is valid.')

# Simulate validating a new dataset
new_data = {'Name': ['Eve', 'Frank'], 'Age': [45, -10], 'Salary': [52000, 61000]}
df_new = pd.DataFrame(new_data)
validate_data(df_new)


Data contains invalid ages!


## Step 5: Logging Data Validation Results
It's important to log data quality issues for auditing and future reference. Let's create a simple log file to store validation results.

In [5]:
# Create a log of validation results
validation_log = []

# Append issues to log
validation_log.append({'Issue': 'Missing Values', 'Count': missing_values.sum()})
validation_log.append({'Issue': 'Invalid Ages', 'Count': len(invalid_ages)})

validation_log


[{'Issue': 'Missing Values', 'Count': 2},
 {'Issue': 'Invalid Ages', 'Count': 1}]

## Summary
In this notebook, we have demonstrated how to:
- Set up validation rules for data accuracy.
- Detect common data quality issues such as missing values and invalid ranges.
- Apply corrections to the dataset.
- Set up real-time monitoring to validate incoming data.
- Log data validation results for auditing.

In the next notebook, we will focus on integrating these concepts with the overall platform.