# XGBoost: Predicting AQI from Income, Race, Population Density, and Region

This notebook trains an XGBoost model to predict median AQI from **income**, **racial composition** (% Black, % Latino), **population density**, and **Region** (Northeast, Midwest, South, West). Division is ignored.

- **Target:** median_aqi
- **Features:** Median_Household_Income, % Black, % Hispanic or Latino, population_density, Region (one-hot)
- **Random 80/20 split**
- **Sample weights** to downweight low-quality observations
- **Hyperparameter tuning** via 5-fold CV

## 1. Load Data and Prepare

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb

df = pd.read_csv('../JOINED-aqi-income-race-populationDensity-region/joined-data-with-region.csv')

base_cols = ['Median_Household_Income', '% Black or African American alone', '% Hispanic or Latino', 'population_density']
required_cols = ['median_aqi', 'sample_weight', 'Region'] + base_cols
df = df.dropna(subset=required_cols)

# One-hot encode Region (ignore Division)
region_dummies = pd.get_dummies(df['Region'], prefix='Region', drop_first=True)
df = pd.concat([df.reset_index(drop=True), region_dummies], axis=1)

region_cols = [c for c in df.columns if c.startswith('Region_')]
feature_cols = base_cols + region_cols

print(f"Data shape: {df.shape}")
print(f"median_aqi range: {df['median_aqi'].min():.0f} - {df['median_aqi'].max():.0f}")
print(f"Regions: {df['Region'].unique().tolist()}")
print(f"Features: {feature_cols}")
df[['median_aqi', 'sample_weight'] + feature_cols].head()

## 2. Random Train/Test Split (80/20)

In [None]:
X_full = df[feature_cols]
y = df['median_aqi']
weights = df['sample_weight']

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X_full, y, weights, test_size=0.2, random_state=42
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

## 3. Full Model: Income + Race + Population Density + Region

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = list(kf.split(X_train))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 3],
}

xgb_model = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

grid_search = GridSearchCV(
    xgb_model, param_grid, cv=cv_splits, scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train, sample_weight=w_train)

print("Best params:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

In [None]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train, sample_weight=w_train)

y_pred = best_model.predict(X_test)

print("Full model - Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred):.2f}")
print(f"  R2:   {r2_score(y_test, y_pred):.4f}")

In [None]:
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature importance:")
print(importance_df)