# Data Integration: AQI vs. Racial Demographics

### Objective
This notebook documents the process of joining air quality data with racial demographic data across US counties. This integrated dataset enables analysis of environmental justice by examining whether certain racial groups are disproportionately exposed to higher levels of air pollution.

### Key Steps:
1. **Normalization**: Splitting the combined "County, State" field in the race dataset into separate columns and standardizing county names.
2. **Merge Operation**: Performing an inner join to combine datasets while ensuring data integrity.
3. **Export**: Saving the refined dataset for further analysis.

In [None]:
import pandas as pd
import os

# File Paths (adjusted to be run from project root or inside the folder)
if os.path.exists('aqi-datasets'):
    AQI_DATA_PATH = 'aqi-datasets/ml_target_dataset.csv'
    RACE_DATA_PATH = 'race-by-county/cleaned_race_by_county.csv'
else:
    AQI_DATA_PATH = '../aqi-datasets/ml_target_dataset.csv'
    RACE_DATA_PATH = '../race-by-county/cleaned_race_by_county.csv'

OUTPUT_PATH = 'aqi_race_joined.csv'

def load_data():
    aqi_df = pd.read_csv(AQI_DATA_PATH)
    race_df = pd.read_csv(RACE_DATA_PATH)
    return aqi_df, race_df

aqi_df, race_df = load_data()

print(f"AQI Dataset: {aqi_df.shape[0]} rows")
print(f"Race Dataset: {race_df.shape[0]} rows")
aqi_df.head(3)

## 1. Data Normalization

The racial demographic dataset stores location information as a single string (e.g., `"Autauga County, Alabama"`). We need to split this into separate `State` and `County` columns to match the structure of the AQI dataset.

Additionally, we must remove suffixes like " County", " Borough", etc., to ensure that "Baldwin" in one dataset matches "Baldwin County" in the other.

In [None]:
def normalize_race_data(df):
    # Split 'County_Area' into County and State
    split_data = df['County_Area'].str.split(', ', expand=True)
    df['County'] = split_data[0]
    df['State'] = split_data[1]
    
    # Suffixes to remove to ensure consistency with AQI data format
    suffixes = [
        ' County', ' Borough', ' Census Area', ' Municipality', 
        ' City and Borough', ' Parish', ' City', ' City and County'
    ]
    
    # Clean County Names
    for suffix in suffixes:
        df['County'] = df['County'].str.replace(suffix, '', regex=False)
    
    # Strip whitespace for precision
    df['County'] = df['County'].str.strip()
    df['State'] = df['State'].str.strip()
    
    return df

race_df_cleaned = normalize_race_data(race_df.copy())
print("Normalized Race Data Sample:")
race_df_cleaned[['County', 'State', '% White alone', '% Black or African American alone']].head(5)

## 2. Inner Join

We perform an inner join on `State` and `County`. This operation combines the datasets only where record matches exist in both, which ensures the resulting analysis is based on complete information.

In [None]:
# Perform inner join
joined_df = pd.merge(
    aqi_df, 
    race_df_cleaned.drop(columns=['County_Area', 'GEO_ID']), # Drop unnecessary columns
    on=['State', 'County'], 
    how='inner'
)

print(f"Success! Joined dataset has {joined_df.shape[0]} rows.")
joined_df.head(10)

## 3. Export Success

The final step is to save the integrated dataset as a CSV file for use in our models and visualizations.

In [None]:
joined_df.to_csv(OUTPUT_PATH, index=False)
print(f"Dataset exported successfully to {OUTPUT_PATH}")