# XGBoost: Predicting AQI from Income and Race (Environmental Justice)

This notebook trains an XGBoost model to predict median AQI from **income** and **racial composition** (percent Black, percent Latino). Environmental justice lens: do counties with higher concentrations of Black or Latino residents face worse air quality?

- **Target:** median_aqi
- **Features:** Median_Household_Income, pct Black, pct Hispanic or Latino
- **Random 80/20 split**
- **Sample weights** to downweight low-quality observations
- **Hyperparameter tuning** via 5-fold CV

## 1. Load Data and Prepare

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb

df = pd.read_csv('../JOINED-aqi-income-race/aqi-income-race-joined.csv')

feature_cols = ['Median_Household_Income', '% Black or African American alone', '% Hispanic or Latino']
required_cols = ['median_aqi', 'sample_weight'] + feature_cols
df = df.dropna(subset=required_cols)

print(f"Data shape: {df.shape}")
print(f"median_aqi range: {df['median_aqi'].min():.0f} - {df['median_aqi'].max():.0f}")
print(f"Features: {feature_cols}")
df[['median_aqi', 'sample_weight'] + feature_cols].head()

XGBoostError: 
XGBoost Library (libxgboost.dylib) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed
    - vcomp140.dll or libgomp-1.dll for Windows
    - libomp.dylib for Mac OSX
    - libgomp.so for Linux and other UNIX-like OSes
    Mac OSX users: Run `brew install libomp` to install OpenMP runtime.

  * You are running 32-bit Python on a 64-bit OS

Error message(s): ["dlopen(/Users/williampantel/My Drive/coding-projects/datathon2026/.venv/lib/python3.14/site-packages/xgboost/lib/libxgboost.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib\n  Referenced from: <58FE87DD-A5B4-3D80-BC4B-11FC831B9707> /Users/williampantel/My Drive/coding-projects/datathon2026/.venv/lib/python3.14/site-packages/xgboost/lib/libxgboost.dylib\n  Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file)"]


## 2. Random Train/Test Split

In [None]:
X = df[feature_cols]
y = df['median_aqi']
weights = df['sample_weight']

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, weights, test_size=0.2, random_state=42
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

Train: 752, Test: 188


## 3. Hyperparameter Tuning with 5-Fold CV

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = list(kf.split(X_train))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 3],
}

xgb_model = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

grid_search = GridSearchCV(
    xgb_model, param_grid, cv=cv_splits, scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train, sample_weight=w_train)

print("Best params:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100}
Best CV MSE: 78.1312916190725


## 4. Train Final Model and Evaluate on Test Set

In [None]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train, sample_weight=w_train)

y_pred = best_model.predict(X_test)

print("Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred):.2f}")
print(f"  R2:   {r2_score(y_test, y_pred):.4f}")

Test set metrics:
  RMSE: 9.78
  MAE:  6.78
  R2:   0.1054


## 5. Feature Importance (Environmental Justice Lens)

Which factors matter most for predicting AQI? Positive contribution to predictions suggests higher feature values â†’ higher AQI (worse air quality).

In [None]:
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print(importance)

                             feature  importance
1  % Black or African American alone    0.452764
2               % Hispanic or Latino    0.313581
0            Median_Household_Income    0.233655


## 6. Race-Only Model (Baseline Comparison)

Same setup but **only race features** (no income). How much predictive power does race alone have?

In [None]:
race_cols = ['% Black or African American alone', '% Hispanic or Latino']
X_race = df[race_cols]
y_race = df['median_aqi']
weights_race = df['sample_weight']

X_train_r, X_test_r, y_train_r, y_test_r, w_train_r, w_test_r = train_test_split(
    X_race, y_race, weights_race, test_size=0.2, random_state=42
)

kf_r = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits_r = list(kf_r.split(X_train_r))

grid_search_r = GridSearchCV(
    xgb.XGBRegressor(random_state=42, objective='reg:squarederror'),
    param_grid, cv=cv_splits_r, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1
)
grid_search_r.fit(X_train_r, y_train_r, sample_weight=w_train_r)

best_race = grid_search_r.best_estimator_
best_race.fit(X_train_r, y_train_r, sample_weight=w_train_r)
y_pred_r = best_race.predict(X_test_r)

print("Race-only model - Best params:", grid_search_r.best_params_)
print("Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_r)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test_r, y_pred_r):.2f}")
print(f"  R2:   {r2_score(y_test_r, y_pred_r):.4f}")

In [None]:
importance_race = pd.DataFrame({
    'feature': X_train_r.columns,
    'importance': best_race.feature_importances_
}).sort_values('importance', ascending=False)
print("Race-only feature importance:")
print(importance_race)