# Data Validation

## 1. Data Validation (Missing Values):

This code checks for missing values in critical columns essential for reporting.

It raises an error if missing values are found, alerting you to potential data quality issues.

In [24]:
import pandas as pd

# Read the CSV data
data = pd.read_csv("Basic_data.csv")

# Check for missing values
missing_values = data.isnull().sum()

# Raise an error if critical columns have missing values
critical_columns = ["Name", "Age", "Balance", "AccStatus"]
if missing_values[critical_columns].any():
    raise ValueError("Missing values found in critical columns!")

print("Data validation passed (no missing values in critical columns)")


Data validation passed (no missing values in critical columns)


## 2. Data Type Validation

This code defines expected data types for each column.

It attempts to convert the data types and catches errors during conversion.

You can handle errors by raising exceptions, fixing data, or logging them for further investigation.

In [25]:
# Validate data types for specific columns
data_types = {
    "AccID": str,
    "Name": str,
    "Gender": str,
    "Age": int,
    "AccOpen": pd.to_datetime,  # Ensure datetime format
    "Balance": float,
    "AccStatus": str,
}

# Try converting data types and catch potential errors
try:
    for col, dtype in data_types.items():
        if dtype != pd.to_datetime and not pd.api.types.is_string_dtype(col):
            data[col] = data[col].astype(dtype)  # Convert data types
except (ValueError, TypeError) as e:
    print(f"Data type validation error: {e}")
    # Handle the error (e.g., raise an exception or fix data)

print("Data type validation passed (expected data types found)")


Data type validation passed (expected data types found)


## 3. Data Format Validation

This code validates the format of dates in the AccOpen column (adjust the pattern as needed).

It checks if all dates match the expected format and raises a warning if inconsistencies are found.

In [26]:
# Validate date format (optional, adjust pattern as needed)
valid_date_format = "%d-%b-%y"  # Adjust for your date format

# Check if 'AccOpen' format matches the expected pattern
if not all(pd.to_datetime(data["AccOpen"]).dt.strftime(valid_date_format) == data["AccOpen"]):
    print("Warning: Some dates might not be in the expected format.")

print("Data format validation passed (date format mostly consistent)")


Data format validation passed (date format mostly consistent)


## 4. Data Value Validation (Range or Specific Values)

This code checks if the minimum age in the data is above a certain threshold.

It also validates if the account status values are within a defined set of valid options.

You can customize these validations based on your specific data and reporting requirements.

In [27]:
# Minimum age validation (optional, adjust as needed)
min_age = 18
if data["Age"].min() < min_age:
    print(f"Warning: Age data might contain values below {min_age}.")

# Validate specific account status values (optional)
valid_statuses = ["Active", "Inactive"]
if not set(data["AccStatus"]) <= set(valid_statuses):
    print(f"Warning: Account status might contain invalid values.")

print("Data value validation passed (values mostly within expected ranges/formats)")


Data value validation passed (values mostly within expected ranges/formats)


## Remember

Adjust the validation criteria according to your data characteristics and reporting needs.

Consider implementing logging or raising specific exceptions for different validation errors to improve data quality control.

By incorporating these validations, you can enhance the reliability and accuracy of your data, leading to more meaningful reports.