# Chapter 24: Building a National Dataset

⚠️ **DO NOT SKIP THIS CELL**

## Run the Next cell.
### Before executing any other cell you must run the next cell to set up the project folder environment.

In [None]:
from pathlib import Path

try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    drive.mount("/content/drive")
    PROJECT_ROOT = Path("/content/drive/MyDrive/DataScience/census-education-analysis")
else:
    PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
STAGING_DIR = DATA_DIR / "staging"
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUTS_DIR = PROJECT_ROOT / "outputs"

PROJECT_ROOT


## Problem 1: Why Can’t We Analyze States Separately Forever?

## Problem 2: What Exactly Are We Combining at This Stage?

## Problem 3: Why Did We Keep `district_code == "000"` Until Now?

## Problem 4: Why Must State-Level Rows Be Removed Before National Aggregation?

## Problem 5: How Do I Load and Prepare All State Files?

In [None]:
from pathlib import Path
import pandas as pd

processed_dir = PROCESSED_DIR / "education"
state_files = list(processed_dir.glob("*_processed.csv"))

len(state_files)

## Problem 6: How Do I Add State Identity to Each Row?

In [None]:
def load_and_label_state(file_path):
    df = pd.read_csv(file_path)
    state_name = file_path.stem.replace("_processed", "")
    df["state_name"] = state_name
    return df

## Problem 7: How Do I Remove State-Level Rows Safely?

In [None]:
national_parts = []

for file_path in state_files:
    state_df = load_and_label_state(file_path)
    district_df = state_df[state_df["district_code"] != 0]
    national_parts.append(district_df)

## Problem 8: How Do I Combine All States into One Dataset?

In [None]:
india_df = pd.concat(national_parts, ignore_index=True)

india_df.shape
india_df["state_name"].nunique()

## Problem 9: Why Is This Called a “Normalized National Dataset”?

## Problem 10: How Do I Save the National Dataset for Analysis?

In [None]:
national_path = PROCESSED_DIR /  "india_national.csv"
india_df.to_csv(national_path, index=False)

national_path

## End-of-Chapter Direction