# Chapter 20: Cleaning a State Dataset

⚠️ **DO NOT SKIP THIS CELL**

## Run the Next cell.
### Before executing any other cell you must run the next cell to set up the project folder environment.

In [None]:
from pathlib import Path

try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    drive.mount("/content/drive")
    PROJECT_ROOT = Path("/content/drive/MyDrive/DataScience/census-education-analysis")
else:
    PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
STAGING_DIR = DATA_DIR / "staging"
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUTS_DIR = PROJECT_ROOT / "outputs"

PROJECT_ROOT


## Problem 1: Why Is the Raw Excel Sheet Not Ready for Analysis?

## Problem 2: How Do I Load the Raw Data Without Interpreting It?

In [None]:
import pandas as pd

edu_path = STAGING_DIR / "education" / "west_bengal.xlsx"
df = pd.read_excel(edu_path, skiprows=7, header=None)

df.head()

## Problem 3: Which Columns Do We Actually Need?

## Problem 4: How Do I Project Only the Relevant Columns?

In [None]:
edu_selected = df.iloc[:, [
    2,   # District Code
    4,   # Area Type
    5,   # Age Group
    6,   # Total Persons
    7,   # Male Persons
    8,   # Female Persons
    9,   # Total Illiterate
    10,  # Male Illiterate
    11,  # Female Illiterate
    12,  # Total Literate
    13,  # Male Literate
    14   # Female Literate
]].copy()

edu_selected.head()

## Problem 5: Why Must Columns Be Renamed Before Filtering Rows?

In [None]:
edu_selected.columns = [
    "district_code",
    "area_type",
    "age_group",
    "total_persons",
    "male_persons",
    "female_persons",
    "total_illiterate",
    "male_illiterate",
    "female_illiterate",
    "total_literate",
    "male_literate",
    "female_literate",
]

edu_selected.head()

## Problem 6: Which Rows Are Relevant for Our Analysis?

## Problem 7: How Do I Filter Rows Using Meaning, Not Guesswork?

In [None]:
edu_clean = edu_selected[
    (edu_selected["age_group"] == "All Ages")
]

edu_clean.head()

## Problem 8: Why Did Filtering by `"All Ages"` Suddenly Return Zero Rows?

In [None]:
# Remove rows that do not represent real geographic data
edu_selected = edu_selected[edu_selected["district_code"].notna()]

# Clean column labels (defensive, even if already renamed)
edu_selected.columns = edu_selected.columns.str.strip()

# Replace missing numeric values with 0
edu_selected = edu_selected.fillna(0)

## Problem 9: How Do I Make Text Columns Safe for Filtering?

In [None]:
edu_selected["district_code"] = edu_selected["district_code"].astype(int)

edu_selected["age_group"] = (
    edu_selected["age_group"]
    .astype(str)
    .str.strip()
    .str.lower()
)

edu_selected["area_type"] = (
    edu_selected["area_type"]
    .astype(str)
    .str.strip()
    .str.lower()
)

edu_selected[["age_group", "area_type"]].head()

## Problem 10: How Do I Filter Only “All Ages” While Keeping Total, Urban, and Rural?

In [None]:
edu_age_filtered = edu_selected[
    edu_selected["age_group"] == "all ages"
]

edu_age_filtered.head()

## Problem 11: What Does This Dataset Represent Now?

## Problem 12: Why Are We Not Removing State Rows Yet?

## Problem 13: How Do I Save This Cleaned Dataset for Validation?

In [None]:
processed_path = PROCESSED_DIR / "cleaned" / "west_bengal_cleaned.csv"
processed_path.parent.mkdir(parents=True, exist_ok=True)

edu_age_filtered.to_csv(processed_path, index=False)
processed_path

## End-of-Chapter Direction