# Feature Engineering for AQI Prediction

This notebook documents the feature engineering steps taken to improve the predictive power of our XGBoost model. The core idea is to handle data skewness (via log transformations) and capture regional sensitivities (via interaction terms).

## 1. Goal
Improve upon the baseline RÂ² of ~0.24 by creating features that better represent the underlying relationships between income, density, geography, and air quality.

## 2. Feature Transformations

### Log Transformations (Replacement)
- **Population Density**: Ranges from 0.1 to 71,000. Tree-based models can struggle with such high variance if the splitting logic doesn't align with the exponential nature of urbanization. Log-transforming standardizes this.
- **Median Household Income**: Wealth often has a non-linear relationship with environmental outcomes. Log-transforming helps normalize the distribution.

### Interaction Terms (Additions)
We multiply the **Log of Income** and **Log of Density** by one-hot encoded **Census Divisions**. 

#### Why for all Divisions (not just West)?
While the West is highly unique due to wildfire patterns and mountainous geography, every region has distinct characteristics:
- **Pacific/Mountain**: High wildfire impact, varying density.
- **Northeast/New England**: Older infrastructure, high density, different economic-pollution coupling.
- **South**: Different industrial and climatic factors.

By including interactions for all 9 divisions, we allow the model to learn that "high income in the Pacific" might mean something different for AQI than "high income in the South Atlantic".

In [None]:
import pandas as pd
import numpy as np
import os

# Load the joined data
input_path = '../JOINED-aqi-income-race-populationDensity-region/joined-data-with-region.csv'
df = pd.read_csv(input_path)

print(f"Initial shape: {df.shape}")

## 3. Implementation Logic

In [None]:
# 1. Log Transformations
df['log_population_density'] = np.log1p(df['population_density'])
df['log_median_income'] = np.log1p(df['Median_Household_Income'])

# 2. Aggregated Race
df['total_minority_pct'] = df['% Black or African American alone'] + df['% Hispanic or Latino']

# 3. One-hot encode Division
division_dummies = pd.get_dummies(df['Division'], prefix='Division')
df = pd.concat([df, division_dummies], axis=1)

# 4. Interaction Terms
division_cols = [c for c in division_dummies.columns]
for div in division_cols:
    df[f'interaction_income_{div}'] = df['log_median_income'] * df[div]
    df[f'interaction_density_{div}'] = df['log_population_density'] * df[div]

print(f"New shape: {df.shape}")

## 4. Export
We save the result as `feature-engineered-data.csv`.

In [None]:
output_file = 'feature-engineered-data.csv'
df.to_csv(output_file, index=False)
print(f"Exported to {output_file}")