# XGBoost: Predicting AQI from Income

This notebook trains an XGBoost model to predict median AQI (livability proxy) from median household income. Part of a larger analysis showing how socioeconomic factors correlate with access to livable climate.

- **Target:** median_aqi
- **Feature:** Median_Household_Income
- **Stratified 80/20 split** by AQI quartiles
- **Sample weights** to downweight low-quality observations
- **Hyperparameter tuning** via stratified 5-fold CV

## 1. Load Data and Prepare 

In [42]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb

df = pd.read_csv('../JOINED-aqi-income/aqi-income-joined.csv')
df = df.dropna(subset=['median_aqi', 'Median_Household_Income', 'sample_weight'])

print(f"Data shape: {df.shape}")
print(f"median_aqi range: {df['median_aqi'].min():.0f} - {df['median_aqi'].max():.0f}")
df.head()

Data shape: (940, 6)
median_aqi range: 3 - 90


Unnamed: 0,State,County,Year,median_aqi,sample_weight,Median_Household_Income
0,Alabama,Baldwin,2025,42,1.0,78775.0
1,Alabama,Clay,2025,32,1.0,55250.0
2,Alabama,DeKalb,2025,42,1.0,51204.0
3,Alabama,Elmore,2025,32,0.983333,78243.0
4,Alabama,Etowah,2025,45,1.0,54563.0


## 2. Create Stratification Bins and Train/Test Split

We stratify by AQI quartiles so train and test sets have similar distributions of livability levels.

In [43]:
# Quartile-based stratification (data is mostly 0-50 AQI, quartiles give balanced groups)
df['aqi_stratum'] = pd.qcut(df['median_aqi'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'], duplicates='drop')
print("Stratum distribution:")
print(df['aqi_stratum'].value_counts().sort_index())

X = df[['Median_Household_Income']]
y = df['median_aqi']
weights = df['sample_weight']
strata = df['aqi_stratum']

X_train, X_test, y_train, y_test, w_train, w_test, strata_train, strata_test = train_test_split(
    X, y, weights, strata, test_size=0.2, random_state=42, stratify=strata
)

print(f"\nTrain: {len(X_train)}, Test: {len(X_test)}")

Stratum distribution:
aqi_stratum
Q1    262
Q2    209
Q3    253
Q4    216
Name: count, dtype: int64

Train: 752, Test: 188


## 3. Hyperparameter Tuning with Stratified CV

We use stratified 5-fold CV and a search over common XGBoost hyperparameters.

In [44]:
# Precompute stratified CV splits (stratify on AQI quartiles)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = list(skf.split(X_train, strata_train))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 3],
}

xgb_model = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

grid_search = GridSearchCV(
    xgb_model, param_grid, cv=cv_splits, scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)

# sample_weight passed to fit(); XGBRegressor uses it to downweight low-quality observations
grid_search.fit(X_train, y_train, sample_weight=w_train)

print("Best params:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100}
Best CV MSE: 95.86677004106876


## 4. Train Final Model and Evaluate on Test Set

Refit the best model on the full training set (with sample weights) and evaluate on held-out test data.

In [45]:
best_model = grid_search.best_estimator_

# Refit on full training set with sample weights
best_model.fit(X_train, y_train, sample_weight=w_train)

y_pred = best_model.predict(X_test)

print("Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred):.2f}")
print(f"  R²:   {r2_score(y_test, y_pred):.4f}")

Test set metrics:
  RMSE: 10.42
  MAE:  6.93
  R²:   -0.0375


## 5. Feature Importance

With a single feature (income), importance shows how much it contributes to predictions.

In [46]:
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print(importance)

                   feature  importance
0  Median_Household_Income         1.0
