# Data Transformation: Final Dataset Integration (AQI + Income + Race + Population Density)

This notebook documents the final join in our data pipeline. We integrate the population density metric with our combined dataset. 

**Note on Normalization:** The geographic identifiers must be standardized. The density data includes suffixes like "County" or "Parish", which we strip to match the format of our primary environmental dataset.

In [None]:
import pandas as pd
import os
import re

# File Paths
BASE_DIR = '..'
AQI_INCOME_RACE_PATH = os.path.join(BASE_DIR, 'JOINED-aqi-income-race', 'aqi-income-race-joined.csv')
DENSITY_PATH = os.path.join(BASE_DIR, 'cleaned-datasets', 'populationDensity-by-county', 'cleaned-population-density-by-county.csv')
OUTPUT_PATH = 'aqi_income_race_density_joined.csv'

def load_data():
    main_df = pd.read_csv(AQI_INCOME_RACE_PATH)
    density_df = pd.read_csv(DENSITY_PATH)
    return main_df, density_df

main_df, density_df = load_data()

print(f"Main Dataset Records: {main_df.shape[0]}")
print(f"Density Dataset Records: {density_df.shape[0]}")

## 1. Normalizing Density Data

We split the `County_Area` and strip common geographic suffixes to ensure names like "Baldwin County" become "Baldwin", matching the environmental dataset.

In [None]:
# Split County_Area
density_df[['County', 'State']] = density_df['County_Area'].str.split(', ', expand=True)

# Strip whitespace
density_df['County'] = density_df['County'].str.strip()
density_df['State'] = density_df['State'].str.strip()

# Regex to remove common suffixes: County, Parish, Borough, Census Area, Municipality, etc.
suffix_regex = r' (County|Parish|Borough|Census Area|Municipality|City and Borough|City)$'
density_df['County'] = density_df['County'].str.replace(suffix_regex, '', regex=True, flags=re.IGNORECASE)

# Select relevant columns
density_join_df = density_df[['State', 'County', 'Total_Population', 'Land_Area_SqMi', 'population_density']].copy()

print("Normalized Density Data Sample (County names stripped):")
display(density_join_df.head())

## 2. Final Triple-Join Integration

Performing the inner join on `State` and `County`.

In [None]:
# Perform the merge
final_df = pd.merge(main_df, density_join_df, on=['State', 'County'], how='inner')

print(f"Final Integrated Dataset Records: {final_df.shape[0]}")
final_df.head()

## 3. Data Export

The consolidated dataset is exported.

In [None]:
final_df.to_csv(OUTPUT_PATH, index=False)
print(f"Final integrated dataset exported to: {OUTPUT_PATH}")