# XGBoost: Predicting AQI from Income (Geographic Holdout)

This notebook trains an XGBoost model to predict median AQI from median household income, using a **geographic holdout** for train/test split. Train on Midwest/East Coast; test on West (CA, OR, WA, NV, AZ, HI, AK) to validate generalization to new regions.

- **Target:** median_aqi
- **Feature:** Median_Household_Income
- **Train:** All states except West (CA, OR, WA, NV, AZ, HI, AK)
- **Test:** West states only
- **Sample weights** to downweight low-quality observations
- **Hyperparameter tuning** via 5-fold CV on training set

## 1. Load Data and Prepare

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb

df = pd.read_csv('../JOINED-aqi-income/aqi-income-joined.csv')
df = df.dropna(subset=['median_aqi', 'Median_Household_Income', 'sample_weight'])

print(f"Data shape: {df.shape}")
print(f"median_aqi range: {df['median_aqi'].min():.0f} - {df['median_aqi'].max():.0f}")
df.head()

Data shape: (940, 6)
median_aqi range: 3 - 90


Unnamed: 0,State,County,Year,median_aqi,sample_weight,Median_Household_Income
0,Alabama,Baldwin,2025,42,1.0,78775.0
1,Alabama,Clay,2025,32,1.0,55250.0
2,Alabama,DeKalb,2025,42,1.0,51204.0
3,Alabama,Elmore,2025,32,0.983333,78243.0
4,Alabama,Etowah,2025,45,1.0,54563.0


## 2. Geographic Holdout: Train (Midwest/East) / Test (West)

**West states (Option B):** CA, OR, WA, NV, AZ, HI, AK — diverse geography (coast, desert, mountain).

In [2]:
WEST_STATES = ['California', 'Oregon', 'Washington', 'Nevada', 'Arizona', 'Hawaii', 'Alaska']

df['is_west'] = df['State'].isin(WEST_STATES)
df_train = df[~df['is_west']].copy()
df_test = df[df['is_west']].copy()

print(f"Train (Midwest/East): {len(df_train)} counties")
print(f"Test (West): {len(df_test)} counties")
print(f"West states: {WEST_STATES}")
print(f"\nTest set AQI: min={df_test['median_aqi'].min():.0f}, max={df_test['median_aqi'].max():.0f}, mean={df_test['median_aqi'].mean():.1f}")
print(f"Train set AQI: min={df_train['median_aqi'].min():.0f}, max={df_train['median_aqi'].max():.0f}, mean={df_train['median_aqi'].mean():.1f}")

Train (Midwest/East): 811 counties
Test (West): 129 counties
West states: ['California', 'Oregon', 'Washington', 'Nevada', 'Arizona', 'Hawaii', 'Alaska']

Test set AQI: min=6, max=90, mean=35.8
Train set AQI: min=3, max=74, mean=39.1


## 3. Build Feature Matrices and Hyperparameter Tuning

In [3]:
X_train = df_train[['Median_Household_Income']]
y_train = df_train['median_aqi']
w_train = df_train['sample_weight']

X_test = df_test[['Median_Household_Income']]
y_test = df_test['median_aqi']
w_test = df_test['sample_weight']

# Standard 5-fold CV on training set (no geographic stratification within train)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = list(kf.split(X_train))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 3],
}

xgb_model = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

grid_search = GridSearchCV(
    xgb_model, param_grid, cv=cv_splits, scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train, sample_weight=w_train)

print("Best params:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100}
Best CV MSE: 66.83065151630075


## 4. Train Final Model and Evaluate on West Coast Test Set

In [4]:
best_model = grid_search.best_estimator_

# Refit on full training set with sample weights
best_model.fit(X_train, y_train, sample_weight=w_train)

y_pred = best_model.predict(X_test)

print("Test set metrics (West holdout):")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred):.2f}")
print(f"  R²:   {r2_score(y_test, y_pred):.4f}")

Test set metrics (West holdout):
  RMSE: 16.40
  MAE:  13.10
  R²:   -0.0762


## 5. Feature Importance

In [5]:
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print(importance)

                   feature  importance
0  Median_Household_Income         1.0
