# 11 Economic Mobility Model

###### 
**Project:** NORI  
**Author:** Yuseof J  
**Date:** 26/12/25

### **Purpose**
Load all  features and economic mobility outcomes (model targets), train different models and evaluate performance, then select the best performing model to make final economic mobility predictions across all NYC tracts. 

What this model is doing exactly:

By observing a specific cohort with a fixed starting point (i.e. children born between 1978 and 1983 from parent's at the 25th percentile of national income) and measuring their *current* income rank, we can identify which (tract-level) characteristics from their environment are correlated with higher long-run mobility outcomes (i.e. children currently have a higher income than their parents), and vice versa. This is meant to identify enviromental charactersitics that might be worth considering in resource planning and program/intervention design for more equitable economic mobility.

Major assumptions:

**1. Stable residence**: children grew up in ans live in the tract where their parents lived, or at least a similar tract / neighborhood.

**2. Structural stability**: tract-level characteristics (e.g. green space, population density) have not changed substantially over decades, so present-day measurements approximate childhood environment

Of course, these will not always be true. People move, and neighborhoods can change drastically. So we view the model's learned associations as an estimation, not a perfectly causal explainer of how environment effects economic mobility outcomes.

**Important limitation**: This model is simply learning correlation, not causation. For example, we may notice that large amounts of green space area are strongly correlated with higher long-run economic mobility outcomes. We don't know if this is because having the green area caused the outcome, or if people with higher incomes than their parents have moved to areas with more green area than the tracts that their parents grew up in. What we *can* do with the model's learned relationships is to identify highly correlated features for further investigation (e.g. causal analysis) to isolate the effect of features on outcomes. 

### **Inputs**
- `data/processed/master_model_features.csv`
- `data/processed/outcomes_econ_mobility.csv`

### **Outputs**
- `models/xgb_econ_mobility_model.pkl`
- `data/processed/predictions_econ_mobility.csv`
- `data/processed/model_performance_econ_mobility.csv`
- `data/processed/nyc_tracts.gpkg (layer = econ_mobility_predictions)`
  
--------------------------------------------------------------------------

### 0. Imports and Setup

In [113]:
# package imports
import os
import joblib
import numpy as np
import pandas as pd
import geopandas as gpd
from pathlib import Path
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.model_selection import cross_val_score, GroupKFold

# specify filepaths
path_nyc_tracts = 'data/processed/nyc_tracts.gpkg'
path_model_features = 'data/processed/master_features.csv'
path_model_targets = 'data/processed/outcomes_econ_mobility.csv'
path_performance_metrics = 'data/processed/model_performance_econ_mobility.csv'
path_final_model_pkl = 'models/xgb_econ_mobility_model.pkl'
path_output_tract_preds = 'data/processed/predictions_econ_mobility.csv'
output_gpkg_layer = 'econ_mobility_predictions'

# ensure cwd is project root for file paths to function properly
project_root = Path(os.getcwd())            # get current directory
while not (project_root / "data").exists(): # keep moving up until in parent
    project_root = project_root.parent
os.chdir(project_root)                      # switch to parent directory

### 1. Load Data

In [114]:
# nyc tracts
gdf_nyc_tracts = gpd.read_file(path_nyc_tracts, layer="tracts")

# features (X), targets (y)
X = pd.read_csv(path_model_features)
y = pd.read_csv(path_model_targets)

# keep copies for accounting and assertin statements
X_original = X.copy()
y_original = y.copy()

### 2. Prepare Data

For this first pass, we'll be focusing on just one target to get the pipeline up and running

In [115]:
TARGET = "income_rank_children"

y = y[['GEOID', TARGET]]

In [116]:
# take note of feature columns
feature_cols = X.columns.tolist()

# get features and targets together for dropping rows without target value for train/eval
df_model = y.merge(X, how="left", on="GEOID")

# TODO: TEMPORARY: DROPPING NAN FEATURE ROWS UNTIL IMPUTATION METHOD DETERMINED
df_model = df_model.dropna(subset=feature_cols)

# for final model predictions, keep all tracts
df_model_all_tracts = df_model.copy()

# for train/eval, select only tracts with a value for the target
df_model = df_model[df_model[TARGET].notna()]

# re-seperate features and target for modeling
X = df_model[feature_cols].copy()
y = df_model[TARGET].copy()

In [117]:
X.head()

Unnamed: 0,GEOID,distance_to_park_m,park_area_500m_centroid,park_area_1km_centroid,percent_tree_canopy,median_household_income,poverty_rate,unemployment_rate,gini_index,pct_higher_ed,pct_renters,median_gross_rent,pct_rent_burdened,pct_no_vehicle,pop_density_sq_km,pct_age_65_plus
0,36085024402,169.509962,120604.3,685929.1,0.031706,117981.0,0.029496,0.024523,0.4031,0.42727,0.181553,1429.0,0.324759,0.042032,2692.772684,0.19393
1,36085027705,2129.879397,0.0,0.0,0.0,96684.0,0.09977,0.030645,0.4349,0.311325,0.182136,1799.0,0.69863,0.005988,11465.037656,0.191336
2,36085012806,0.0,1370448.0,4764153.0,0.0,61378.0,0.08387,0.037376,0.4349,0.259306,0.648697,1797.0,0.694598,0.004942,4022.827347,0.186134
3,36047024400,622.327717,0.0,349727.3,0.0,67500.0,0.398833,0.069648,0.3851,0.313904,0.621649,1748.0,0.60199,0.106186,23808.91047,0.078171
4,36047023000,972.144627,0.0,1453.035,0.0,51250.0,0.451197,0.177823,0.5221,0.188471,0.733522,1630.0,0.593068,0.10452,32376.888983,0.078985


In [118]:
y.head()

0    0.595816
1    0.597892
2    0.525586
3    0.509934
4    0.426163
Name: income_rank_children, dtype: float64

#### Spatial CV - Borough

Here we'll bring in borough so that we can effectively split the data for spatial cross-validation. It's important to split the data by some meaningful geographic grouping because the tracts are not independent of one another: neighboring tracts share characteristics like socioeconomic conditions, health outcomes, air quality, and green coverage. If we split the spatial data at random, we may end up with the scenario where one tract is used for training, and its neighbor is used in testing. This will lead to overestimated model performance because those two tracts are too similar, meaning the model isn't learning to generalize to new places, but rather memorizing information about similar tract contexts/characteristics.

In [119]:
# ensure matching dtypes before join
X.GEOID = X.GEOID.astype(int)
gdf_nyc_tracts.GEOID = gdf_nyc_tracts.GEOID.astype(int)

# get borough names for tracts 
X = X.merge(gdf_nyc_tracts[['GEOID', 'BoroCode']], 
           how='left', 
           on='GEOID',
)

In [120]:
# create borough-based CV groups
groups = X['BoroCode']
cv = GroupKFold(n_splits=5)

In [121]:
# drop identifier and spatial label before modeling
X.drop(columns=['GEOID', 'BoroCode'], inplace=True)

#### Handling Missing Values

TODO: for now, nan rows are dropped (above) until a more comprehensive method is devloped

### 3. Cross-Validated Model Training and Evaluation

Using the folds generated in the Borough-based Spatial CV above, we will train a baseline model (linear regression) and tree-based model (XGBoost) and compare their performance using RMSE and R2

In [122]:
# initialize models (currently fixed hyperparamters for xgb)

# linear regression
lr_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

# xgboost
xgb = XGBRegressor(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    random_state=42
)

In [123]:
# run cv training and evaluation loop

scores = {
    "linreg": {"r2":[], "rmse":[]},
    "xgb": {"r2":[], "rmse":[]}
}

for fold, (train_idx, test_idx) in enumerate(cv.split(X, y, groups)):

    print(f"Beginning training and eval for fold {fold}")

    # intialize training and validation sets for this fold
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # -------- Linear Regression --------------
    lr_pipe.fit(X_train, y_train)
    y_pred_lr = lr_pipe.predict(X_test)

    lr_r2_score = r2_score(y_test, y_pred_lr)
    lr_rmse_score = root_mean_squared_error(y_test, y_pred_lr)

    scores["linreg"]["r2"].append(lr_r2_score)
    scores["linreg"]["rmse"].append(lr_rmse_score)

    # ------------- XGBoost -------------------
    xgb.fit(X_train, y_train)
    y_pred_xgb = xgb.predict(X_test)

    xgb_r2_score = r2_score(y_test, y_pred_xgb)
    xgb_rmse_score = root_mean_squared_error(y_test, y_pred_xgb)

    scores["xgb"]["r2"].append(xgb_r2_score)
    scores["xgb"]["rmse"].append(xgb_rmse_score)

    print(f"Completed successfully!")
    
print("Training and validation completed successfully!")

Beginning training and eval for fold 0
Completed successfully!
Beginning training and eval for fold 1
Completed successfully!
Beginning training and eval for fold 2
Completed successfully!
Beginning training and eval for fold 3
Completed successfully!
Beginning training and eval for fold 4
Completed successfully!
Training and validation completed successfully!


### 4. Performance Comparison

In [124]:
mean_scores = []

for model in scores:

    # calculate mean scores across cv folds
    mean_r2 = np.mean(scores[model]["r2"])
    mean_rmse = np.mean(scores[model]["rmse"])

    # print scores for review
    print(f"\n{model} performance evaluation:")
    print(f"R2: ", mean_r2)
    print(f"RMSE: ", mean_rmse)

    # for outputting scores to artifact
    mean_scores.append({"model":model, "r2_cv":mean_r2, "rmse_cv": mean_rmse})


linreg performance evaluation:
R2:  -0.2977826259068722
RMSE:  0.08140891833495763

xgb performance evaluation:
R2:  0.11495030846382434
RMSE:  0.06911113566329115


A look at scores for each spatial cv split (i.e. borough)

In [125]:
scores

{'linreg': {'r2': [0.1881949810858642,
   -0.08108691950779878,
   0.45476923224118415,
   -0.2412194008753703,
   -1.8095710224782402],
  'rmse': [0.07678566378953279,
   0.07805971536231746,
   0.04424951461464969,
   0.07887642552564787,
   0.12907327238264035]},
 'xgb': {'r2': [0.17541782850396692,
   0.025689624754657192,
   0.38530791869951786,
   -0.18530295183728684,
   0.1736391221982666],
  'rmse': [0.07738757666643048,
   0.07410462614851292,
   0.04698369219741443,
   0.07707927599155681,
   0.07000050731254101]}}

### 5. Model Summary

**- Explainability -**

Using spatial cross-validation, our XGBoost regression model explains approximately **11% of the variation in tract-level economic mobility outcomes** (measured as the income rank for children from low income families). 

This means that of the features currently included in the model **(green space features and ACS demographics)**, we are able to understand *some meaningful, systematic differences* in economic mobility outcomes. However, the remaining **~89% of variation in mobility outcomes between tracts is currently unexplained** by these features. Which makes sense: economic mobility is shaped by structural factors (like school quality, discrimination, labor markets) that are not fully reflected in demographics and the built environment. 

**- Prediction Performance -**

As far as our model's prdictive capabilities: on avergae, our model's guesses are off by about **7 percentage points**. 
For example, say the observed income rank for a given tract is 0.53. Our model may have predicted a value in the range 0.46 to 0.60. Thile this level of error limit's the model's application in precise tract-level prediction, it is sufficient for comparative analysis and ranking.  

Overall, the model **partially** captures, but does not provide a comprehensive understanding of the dynamics of economic mobility.

### 6. Final Model Predictions

In [126]:
final_model = xgb.fit(X, y)

In [127]:
X.head()

Unnamed: 0,distance_to_park_m,park_area_500m_centroid,park_area_1km_centroid,percent_tree_canopy,median_household_income,poverty_rate,unemployment_rate,gini_index,pct_higher_ed,pct_renters,median_gross_rent,pct_rent_burdened,pct_no_vehicle,pop_density_sq_km,pct_age_65_plus
0,169.509962,120604.3,685929.1,0.031706,117981.0,0.029496,0.024523,0.4031,0.42727,0.181553,1429.0,0.324759,0.042032,2692.772684,0.19393
1,2129.879397,0.0,0.0,0.0,96684.0,0.09977,0.030645,0.4349,0.311325,0.182136,1799.0,0.69863,0.005988,11465.037656,0.191336
2,0.0,1370448.0,4764153.0,0.0,61378.0,0.08387,0.037376,0.4349,0.259306,0.648697,1797.0,0.694598,0.004942,4022.827347,0.186134
3,622.327717,0.0,349727.3,0.0,67500.0,0.398833,0.069648,0.3851,0.313904,0.621649,1748.0,0.60199,0.106186,23808.91047,0.078171
4,972.144627,0.0,1453.035,0.0,51250.0,0.451197,0.177823,0.5221,0.188471,0.733522,1630.0,0.593068,0.10452,32376.888983,0.078985


In [128]:
# make sure to use all tracts (originally, we dropped tracts from X, y that had no value for target y. We want those back for final predictions)
X_final = df_model_all_tracts[feature_cols].copy()
y_final = df_model_all_tracts[TARGET].copy()

# set aside GEOID for predictions 
X_geoids = X_final.GEOID
X_final.drop(columns=['GEOID'], inplace=True)

Make sure all tracts are present for model predictions

In [129]:
# this will fail while imputation not set up (since I am merely dropping tract rows with no feature values)
try:
    assert(len(X_final) == len(gdf_nyc_tracts))
    print("All tracts accounted for")

# if this passes, then above only failed because of lack of impuation handling
# if this fails, then tracts are inexplicably unaccounted for, and attention is needed
except:
    assert(len(X_final) == (len(gdf_nyc_tracts) - X_original[feature_cols].isna().any(axis=1).sum()))
    print("All tracts accounted for, imputation-dropped tracts not included in predictions")


All tracts accounted for, imputation-dropped tracts not included in predictions


In [130]:
X_final.head()

Unnamed: 0,distance_to_park_m,park_area_500m_centroid,park_area_1km_centroid,percent_tree_canopy,median_household_income,poverty_rate,unemployment_rate,gini_index,pct_higher_ed,pct_renters,median_gross_rent,pct_rent_burdened,pct_no_vehicle,pop_density_sq_km,pct_age_65_plus
0,169.509962,120604.3,685929.1,0.031706,117981.0,0.029496,0.024523,0.4031,0.42727,0.181553,1429.0,0.324759,0.042032,2692.772684,0.19393
1,2129.879397,0.0,0.0,0.0,96684.0,0.09977,0.030645,0.4349,0.311325,0.182136,1799.0,0.69863,0.005988,11465.037656,0.191336
2,0.0,1370448.0,4764153.0,0.0,61378.0,0.08387,0.037376,0.4349,0.259306,0.648697,1797.0,0.694598,0.004942,4022.827347,0.186134
3,622.327717,0.0,349727.3,0.0,67500.0,0.398833,0.069648,0.3851,0.313904,0.621649,1748.0,0.60199,0.106186,23808.91047,0.078171
4,972.144627,0.0,1453.035,0.0,51250.0,0.451197,0.177823,0.5221,0.188471,0.733522,1630.0,0.593068,0.10452,32376.888983,0.078985


In [131]:
tract_preds = final_model.predict(X_final)

In [132]:
df_tract_preds = pd.DataFrame({
    "GEOID": X_geoids,
    "actual": y_final,
    "predicted": tract_preds,
    "residual": y_final - tract_preds
})

In [133]:
df_tract_preds.head()

Unnamed: 0,GEOID,actual,predicted,residual
0,36085024402,0.595816,0.584794,0.011023
1,36085027705,0.597892,0.560722,0.03717
2,36085012806,0.525586,0.520427,0.005159
3,36047024400,0.509934,0.476076,0.033858
4,36047023000,0.426163,0.412003,0.01416


### 7. Output Modeling Artifacts

A. Performance Metrics

In [134]:
df_perform_metrics = pd.DataFrame(mean_scores)

In [135]:
df_perform_metrics

Unnamed: 0,model,r2_cv,rmse_cv
0,linreg,-0.297783,0.081409
1,xgb,0.11495,0.069111


In [136]:
df_perform_metrics.to_csv(path_performance_metrics, index=False)

B. Final Model

In [137]:
joblib.dump(final_model, path_final_model_pkl)

['models/xgb_econ_mobility_model.pkl']

C. Tract-Level Predictions

In [138]:
# save to csv
df_tract_preds.to_csv(path_output_tract_preds ,index=False)

In [139]:
# and additionally save as layer in nyc tracts geopackage
gdf_nyc_tracts.GEOID = gdf_nyc_tracts.GEOID.astype(int)
df_tract_preds.GEOID = df_tract_preds.GEOID.astype(int)

gdf_tracts_preds = gdf_nyc_tracts.merge(
    df_tract_preds,
    on='GEOID',
    how='left'
)
gdf_tracts_preds.to_file(path_nyc_tracts, layer=output_gpkg_layer)