# Chapter 28: Preparing Data for BI and ML

⚠️ **DO NOT SKIP THIS CELL**

## Run the Next cell.
### Before executing any other cell you must run the next cell to set up the project folder environment.

In [None]:
from pathlib import Path

try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    drive.mount("/content/drive")
    PROJECT_ROOT = Path("/content/drive/MyDrive/DataScience/census-education-analysis")
else:
    PROJECT_ROOT = Path.cwd().parent

DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
STAGING_DIR = DATA_DIR / "staging"
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUTS_DIR = PROJECT_ROOT / "outputs"

PROJECT_ROOT


## Problem 1: What Dataset Are We Preparing for BI and ML?

In [None]:
import pandas as pd

india_path = PROCESSED_DIR / "state_ranked_by_literacy.csv"
india_df = pd.read_csv(india_path)

india_df.head()

## Problem 2: What Problem Are We Solving Before BI or ML?

## Problem 3: Which Columns Should Be Kept?

In [None]:
selected_columns = [
    "state_name",
    "area_type",
    "total_persons",
    "male_persons",
    "female_persons",
    "total_literate",
    "male_literate",
    "female_literate",
    "literacy_rate",
    "gender_literacy_gap"
]

## Problem 4: How Do We Create a Focused, Stable Dataset?

In [None]:
final_df = india_df[selected_columns].copy()
final_df.head()

## Problem 5: How Do We Ensure All Numeric Fields Are Truly Numeric?

In [None]:
numeric_cols = final_df.columns.difference(["state_name", "area_type"])

In [None]:
final_df[numeric_cols] = final_df[numeric_cols].apply(
    pd.to_numeric, errors="coerce"
)

## Problem 6: How Do We Handle Missing Values at This Stage?

In [None]:
final_df.isna().sum()

## Problem 7: Why Is This Dataset Ready for BI and ML?

## Problem 8: How Do We Freeze the Dataset for Reuse?

In [None]:
output_path = OUTPUTS_DIR / "india_education_bi_ml_ready.csv"
final_df.to_csv(output_path, index=False)

output_path

## End-of-Chapter Direction