# Chapter 21: Validating Data Before Analysis

⚠️ **DO NOT SKIP THIS CELL**

## Run the Next cell.
### Before executing any other cell you must run the next cell to set up the project folder environment.

In [None]:
from pathlib import Path

try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    drive.mount("/content/drive")
    PROJECT_ROOT = Path("/content/drive/MyDrive/DataScience/census-education-analysis")
else:
    PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
STAGING_DIR = DATA_DIR / "staging"
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUTS_DIR = PROJECT_ROOT / "outputs"

PROJECT_ROOT


## Problem 1: What Data Are We Validating?

In [None]:
import pandas as pd

edu_path = PROCESSED_DIR / "cleaned" / "west_bengal_cleaned.csv"
edu_df = pd.read_csv(edu_path)

edu_df.head()

## Problem 2: What Does “Validation” Mean in Practice?

## Problem 3: Which Numeric Rules Must Always Hold?

## Problem 4: How Do I Validate Total Population Counts?

In [None]:
edu_df["persons_check"] = (
    edu_df["male_persons"]
    + edu_df["female_persons"]
    - edu_df["total_persons"]
)

In [None]:
edu_df["persons_check"].value_counts()

## Problem 5: How Do I Validate Illiterate Counts?

In [None]:
edu_df["illiterate_check"] = (
    edu_df["male_illiterate"]
    + edu_df["female_illiterate"]
    - edu_df["total_illiterate"]
)

edu_df["illiterate_check"].value_counts()

## Problem 6: How Do I Validate Literate Counts?

In [None]:
edu_df["literate_check"] = (
    edu_df["male_literate"]
    + edu_df["female_literate"]
    - edu_df["total_literate"]
)

edu_df["literate_check"].value_counts()

## Problem 7: How Do I Combine All Checks into One Trust Indicator?

In [None]:
edu_df["is_row_valid"] = (
    (edu_df["persons_check"] == 0) &
    (edu_df["illiterate_check"] == 0) &
    (edu_df["literate_check"] == 0)
)

edu_df["is_row_valid"].value_counts()

## Problem 8: Why Are We Not Validating Across Rows Yet?

## Problem 9: Should Invalid Rows Be Removed Now?

## Problem 10: How Do I Save the Validated Dataset?

In [None]:
validated_path = PROCESSED_DIR / "validated" / "west_bengal_validated.csv"
validated_path.parent.mkdir(parents=True, exist_ok=True)

edu_df.to_csv(validated_path, index=False)

validated_path

## End-of-Chapter Direction