# Data Cleaning

## Purpose
This notebook performs data quality checks and cleaning on the combined life expectancy dataset. Since we used **inner join** in notebook 03, most data quality issues should already be filtered out. This notebook focuses on:

1. Identifying and handling missing values (NaN/empty fields)
2. Detecting and removing Census API error codes (`-666666666`)
3. Verifying data integrity and consistency
4. Saving the cleaned dataset for feature engineering

## Input
- `data_cleaned/processed/combined_by_year/combined_all_years.csv`
- Expected: County-year observations with 19 ACS variables
- Years: 2012-2019

## Output
- `data_cleaned/combined_all_years_cleaned.csv`
- Clean dataset ready for feature engineering

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

## 2. Load Data

Read the combined dataset from notebook 03.

In [None]:
# Load the combined dataset
df = pd.read_csv('../data_cleaned/processed/combined_by_year/combined_all_years.csv')

print(f"Dataset shape: {df.shape}")
print(f"  - Rows (county-year observations): {df.shape[0]:,}")
print(f"  - Columns (features + identifiers): {df.shape[1]}")

## 3. Initial Data Inspection

Review the structure, data types, and first few rows of the dataset.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display dataset info
print("Dataset information:")
df.info()

In [None]:
# Display all column names
print(f"Total columns: {len(df.columns)}\n")
print("Column names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2}. {col}")

## 4. Check for Missing Values

Identify any missing values (NaN or empty fields) in the dataset.

In [None]:
# Count missing values per column
missing_values = df.isnull().sum()

print("Missing values per column:")
print("=" * 60)
for col in df.columns:
    if missing_values[col] > 0:
        pct = (missing_values[col] / len(df)) * 100
        print(f"{col:40s}: {missing_values[col]:5} ({pct:.2f}%)")

total_missing = missing_values.sum()
print("=" * 60)
print(f"Total missing values: {total_missing}")

if total_missing == 0:
    print("\n✓ No missing values detected!")

In [None]:
# If there are missing values, examine the affected rows
if df.isnull().any().any():
    print("\nRows with missing values:")
    print("=" * 60)
    rows_with_missing = df[df.isnull().any(axis=1)]
    print(f"Total rows affected: {len(rows_with_missing)}")
    print("\nSample of affected rows:")
    print(rows_with_missing.head(10))
else:
    print("\n✓ No rows with missing values!")

## 5. Check for Census Error Codes

The Census API uses `-666666666` to indicate "data not available" or other data quality issues. Identify any occurrences of this error code.

In [None]:
# Check for -666666666 error codes
error_code_mask = (df == -666666666).any(axis=1)
rows_with_errors = df[error_code_mask]

print("Census API Error Code Check:")
print("=" * 60)
print(f"Rows with -666666666 error codes: {len(rows_with_errors)}")

if len(rows_with_errors) > 0:
    print(f"Percentage of dataset: {(len(rows_with_errors) / len(df)) * 100:.2f}%")
    print("\nColumns affected:")
    for col in df.columns:
        count = (df[col] == -666666666).sum()
        if count > 0:
            print(f"  {col}: {count} occurrences")
    
    print("\nSample of affected rows:")
    print(rows_with_errors.head(10))
else:
    print("✓ No Census error codes found!")

## 6. Remove Problematic Rows

Remove all rows that contain either missing values or Census error codes. Since we used inner join, these should be minimal.

In [None]:
# Create cleaned dataset by removing problematic rows
print("Cleaning dataset...")
print("=" * 60)

# Starting size
print(f"Original dataset: {len(df):,} rows")

# Step 1: Remove rows with missing values
df_cleaned = df.dropna()
print(f"After removing missing values: {len(df_cleaned):,} rows ({len(df) - len(df_cleaned)} removed)")

# Step 2: Remove rows with Census error codes
df_cleaned = df_cleaned[(df_cleaned != -666666666).all(axis=1)]
print(f"After removing error codes: {len(df_cleaned):,} rows ({len(df) - len(df_cleaned)} total removed)")

print("=" * 60)
print(f"Final cleaned dataset: {len(df_cleaned):,} rows")
print(f"Rows removed: {len(df) - len(df_cleaned)} ({((len(df) - len(df_cleaned)) / len(df)) * 100:.2f}%)")

## 7. Verify Cleaned Dataset

Perform final checks to ensure the cleaned dataset has no remaining issues.

In [None]:
# Verification checks
print("Cleaned Dataset Verification:")
print("=" * 60)

# Check for missing values
missing_after = df_cleaned.isnull().sum().sum()
print(f"Missing values: {missing_after}")

# Check for error codes
error_codes_after = (df_cleaned == -666666666).any().any()
print(f"Error codes remaining: {error_codes_after}")

# Check data integrity
print(f"\nShape: {df_cleaned.shape}")
print(f"Years present: {sorted(df_cleaned['Year'].unique())}")
print(f"Counties per year:")
print(df_cleaned.groupby('Year').size())

print("\n" + "=" * 60)
if missing_after == 0 and not error_codes_after:
    print("✓ Dataset is clean and ready for feature engineering!")
else:
    print("⚠ Warning: Issues remain in the dataset")

In [None]:
# Summary statistics for the target variable
print("\nSummary Statistics for Life Expectancy:")
print("=" * 60)
print(df_cleaned['mean_life_expectancy'].describe())

In [None]:
# Display first few rows of cleaned dataset
print("Sample of cleaned dataset:")
df_cleaned.head(10)

## 8. Feature Engineering: Race Percentages

Calculate race percentages from population counts.

In [None]:
print("Calculating race percentages...")
print("=" * 60)

# White Population (%)
df_cleaned['White Population (%)'] = (df_cleaned['White Population'] / df_cleaned['Total Population']) * 100

# Hispanic Population (%)
df_cleaned['Hispanic Population (%)'] = (df_cleaned['Hispanic Population'] / df_cleaned['Total Population']) * 100

# Black Population (%)
df_cleaned['Black Population (%)'] = (df_cleaned['Black Population'] / df_cleaned['Total Population']) * 100

print("✓ Race percentages calculated")
print("\nNew columns created:")
print("  - White Population (%)")
print("  - Hispanic Population (%)")
print("  - Black Population (%)")

# Display summary statistics
print("\nSummary statistics:")
print("\nWhite Population (%):")
print(df_cleaned['White Population (%)'].describe())
print("\nHispanic Population (%):")
print(df_cleaned['Hispanic Population (%)'].describe())
print("\nBlack Population (%):")
print(df_cleaned['Black Population (%)'].describe())

## 9. Feature Engineering: Housing and Family Percentages

Calculate percentages for housing and family characteristics.

In [None]:
print("Calculating housing and family percentages...")
print("=" * 60)

# Households with No Vehicle (%)
df_cleaned['Households with No Vehicle (%)'] = (
    (df_cleaned['No Vehicle (Owner)'] + df_cleaned['No Vehicle (Renter)']) / 
    df_cleaned['Total Occupied Households']
) * 100

# Rent Burden (+50% of HI)
df_cleaned['Rent Burden (+50% of HI)'] = (
    df_cleaned['Rent Burden Count (+50%)'] / 
    df_cleaned['Rent Denominator']
) * 100

# Single Mother Families (%)
df_cleaned['Single Mother Families (%)'] = (
    df_cleaned['Total Families (Single Mother)'] / 
    df_cleaned['Total Families']
) * 100

print("✓ Housing and family percentages calculated")
print("\nNew columns created:")
print("  - Households with No Vehicle (%)")
print("  - Rent Burden (+50% of HI)")
print("  - Single Mother Families (%)")

# Display summary statistics
print("\nSummary statistics:")
print("\nHouseholds with No Vehicle (%):")
print(df_cleaned['Households with No Vehicle (%)'].describe())
print("\nRent Burden (+50% of HI):")
print(df_cleaned['Rent Burden (+50% of HI)'].describe())
print("\nSingle Mother Families (%):")
print(df_cleaned['Single Mother Families (%)'].describe())

## 10. Drop Raw Count Columns

Remove the raw count columns that were used to create percentages, keeping only Total Population.

In [None]:
print("Dropping raw count columns...")
print("=" * 60)

# Columns to drop (all used for percentage calculations except Total Population)
columns_to_drop = [
    'White Population',
    'Hispanic Population',
    'Black Population',
    'No Vehicle (Owner)',
    'No Vehicle (Renter)',
    'Total Occupied Households',
    'Rent Burden Count (+50%)',
    'Rent Denominator',
    'Total Families (Single Mother)',
    'Total Families'
]

df_cleaned = df_cleaned.drop(columns=columns_to_drop)

print(f"✓ Dropped {len(columns_to_drop)} columns")
print("\nColumns dropped:")
for col in columns_to_drop:
    print(f"  - {col}")

print(f"\nDataset shape after dropping: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")

## 11. Rename Target Variable

Rename the target variable for consistency.

In [None]:
print("Renaming target variable...")
print("=" * 60)

# Rename mean_life_expectancy to Mean Life Expectancy
df_cleaned = df_cleaned.rename(columns={'mean_life_expectancy': 'Mean Life Expectancy'})

print("✓ Renamed 'mean_life_expectancy' to 'Mean Life Expectancy'")
print(f"\nTarget variable summary:")
print(df_cleaned['Mean Life Expectancy'].describe())

## 12. Final Dataset Verification

Review the final dataset structure with all engineered features.

In [None]:
print("Final Dataset Structure:")
print("=" * 60)
print(f"Shape: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")

print("\nAll column names:")
for i, col in enumerate(df_cleaned.columns, 1):
    print(f"{i:2}. {col}")

print("\nSample of final dataset:")
df_cleaned.head(10)

## 13. Save Final Dataset

Export the cleaned and engineered dataset for machine learning models.

In [None]:
# Create output directory
output_dir = Path('../data_cleaned/demographics_final')
output_dir.mkdir(parents=True, exist_ok=True)

# Save final cleaned and engineered dataset
output_path = output_dir / 'combined_all_years_cleaned_final.csv'
df_cleaned.to_csv(output_path, index=False)

print("=" * 60)
print("DATASET SAVED")
print("=" * 60)
print(f"✓ Final dataset saved to: {output_path}")
print(f"\n  - Shape: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")
print(f"  - File size: {output_path.stat().st_size / (1024*1024):.2f} MB")
print(f"\n  - Engineered features: 6")
print(f"  - Ready for machine learning!")

## 14. Summary

**Data Cleaning Results:**
- Original observations: See output in Section 6
- Rows removed: See output in Section 6
- Final clean observations: See output in Section 6

**Quality Checks Passed:**
- ✓ No missing values (NaN)
- ✓ No Census error codes (-666666666)
- ✓ All years present (2012-2019)
- ✓ Consistent county counts across years

**Feature Engineering Completed:**
- **Race Percentages (3 features):**
  - White Population (%) = (White Population / Total Population) × 100
  - Hispanic Population (%) = (Hispanic Population / Total Population) × 100
  - Black Population (%) = (Black Population / Total Population) × 100

- **Housing & Family Percentages (3 features):**
  - Households with No Vehicle (%) = [(No Vehicle Owner + Renter) / Total Households] × 100
  - Rent Burden (+50% of HI) = (Rent Burden Count / Rent Denominator) × 100
  - Single Mother Families (%) = (Single Mother Families / Total Families) × 100

**Columns Dropped:**
- Raw count columns used for percentages (10 columns dropped)
- Kept: Total Population (useful as a feature)

**Target Variable:**
- Renamed `mean_life_expectancy` → `Mean Life Expectancy`

**Final Dataset:**
- **Location:** `data_cleaned/demographics_final/combined_all_years_cleaned_final.csv`
- **Features:** 19 variables (6 engineered + 13 original) + identifiers + target
- **Ready for:** Exploratory Data Analysis (EDA) and Machine Learning models

**Next Steps:**
- Proceed to notebook 05 for EDA and additional analysis
- Then notebooks 06+ for machine learning model development