# XGBoost: Predicting AQI from Income, Race, and Population Density

This notebook trains XGBoost models to predict median AQI from **income**, **racial composition** (% Black, % Latino), and **population density**. Does adding density improve prediction?

- **Target:** median_aqi
- **Full model features:** Median_Household_Income, % Black, % Hispanic or Latino, population_density
- **Density-only model:** population_density only
- **Random 80/20 split**
- **Sample weights** to downweight low-quality observations
- **Hyperparameter tuning** via 5-fold CV

## 1. Load Data and Prepare

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb

df = pd.read_csv('../JOINED-aqi-income-race-populationDensity/aqi-income-race-populationDensity-joined.csv')

feature_cols = ['Median_Household_Income', '% Black or African American alone', '% Hispanic or Latino', 'population_density']
required_cols = ['median_aqi', 'sample_weight'] + feature_cols
df = df.dropna(subset=required_cols)

print(f"Data shape: {df.shape}")
print(f"median_aqi range: {df['median_aqi'].min():.0f} - {df['median_aqi'].max():.0f}")
print(f"population_density range: {df['population_density'].min():.2f} - {df['population_density'].max():.2f}")
print(f"Features: {feature_cols}")
df[['median_aqi', 'sample_weight'] + feature_cols].head()

Data shape: (942, 15)
median_aqi range: 3 - 90
population_density range: 0.12 - 71916.19
Features: ['Median_Household_Income', '% Black or African American alone', '% Hispanic or Latino', 'population_density']


Unnamed: 0,median_aqi,sample_weight,Median_Household_Income,% Black or African American alone,% Hispanic or Latino,population_density
0,42,1.0,78775.0,8.0,5.8,155.352379
1,32,1.0,55250.0,12.8,3.3,23.498167
2,42,1.0,51204.0,1.3,17.1,93.00108
3,32,0.983333,78243.0,21.1,3.4,144.743224
4,45,1.0,54563.0,14.7,5.2,192.68181


## 2. Random Train/Test Split (80/20)

In [10]:
X_full = df[feature_cols]
y = df['median_aqi']
weights = df['sample_weight']

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X_full, y, weights, test_size=0.2, random_state=42
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

Train: 753, Test: 189


## 3. Full Model: Income + Race + Population Density

In [11]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = list(kf.split(X_train))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'min_child_weight': [1, 3],
}

xgb_model = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')

grid_search = GridSearchCV(
    xgb_model, param_grid, cv=cv_splits, scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train, sample_weight=w_train)

print("Best params:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100}
Best CV MSE: 78.71603958879263


In [12]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train, sample_weight=w_train)

y_pred_full = best_model.predict(X_test)

print("Full model - Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_full)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_full):.2f}")
print(f"  R2:   {r2_score(y_test, y_pred_full):.4f}")

Full model - Test set metrics:
  RMSE: 9.51
  MAE:  6.77
  R2:   0.2117


In [13]:
importance_full = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Full model - Feature importance:")
print(importance_full)

Full model - Feature importance:
                             feature  importance
1  % Black or African American alone    0.322591
3                 population_density    0.293787
2               % Hispanic or Latino    0.217220
0            Median_Household_Income    0.166402


## 4. Density-Only Model

In [14]:
density_col = ['population_density']
X_density = df[density_col]

# Use same split indices for fair comparison
X_train_d, X_test_d, y_train_d, y_test_d, w_train_d, w_test_d = train_test_split(
    X_density, y, weights, test_size=0.2, random_state=42
)

kf_d = KFold(n_splits=5, shuffle=True, random_state=42)
cv_splits_d = list(kf_d.split(X_train_d))

grid_search_d = GridSearchCV(
    xgb.XGBRegressor(random_state=42, objective='reg:squarederror'),
    param_grid, cv=cv_splits_d, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1
)
grid_search_d.fit(X_train_d, y_train_d, sample_weight=w_train_d)

best_density = grid_search_d.best_estimator_
best_density.fit(X_train_d, y_train_d, sample_weight=w_train_d)
y_pred_density = best_density.predict(X_test_d)

print("Density-only model - Best params:", grid_search_d.best_params_)
print("Test set metrics:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_d, y_pred_density)):.2f}")
print(f"  MAE:  {mean_absolute_error(y_test_d, y_pred_density):.2f}")
print(f"  R2:   {r2_score(y_test_d, y_pred_density):.4f}")

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Density-only model - Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100}
Test set metrics:
  RMSE: 9.88
  MAE:  6.98
  R2:   0.1494


In [15]:
importance_density = pd.DataFrame({
    'feature': X_train_d.columns,
    'importance': best_density.feature_importances_
}).sort_values('importance', ascending=False)

print("Density-only feature importance:")
print(importance_density)

Density-only feature importance:
              feature  importance
0  population_density         1.0


## 5. Model Comparison Summary

In [16]:
comparison = pd.DataFrame({
    'Model': ['Full (income + race + density)', 'Density-only'],
    'RMSE': [
        np.sqrt(mean_squared_error(y_test, y_pred_full)),
        np.sqrt(mean_squared_error(y_test_d, y_pred_density))
    ],
    'MAE': [
        mean_absolute_error(y_test, y_pred_full),
        mean_absolute_error(y_test_d, y_pred_density)
    ],
    'R2': [
        r2_score(y_test, y_pred_full),
        r2_score(y_test_d, y_pred_density)
    ]
})
print(comparison.to_string(index=False))

                         Model     RMSE      MAE       R2
Full (income + race + density) 9.511872 6.769869 0.211738
                  Density-only 9.881006 6.975402 0.149370
