# 04 ACS Feature Engineering

**Project:** NORI  
**Author:** Yuseof J  
**Date:** 23/12/25  

### **Purpose**
Load the raw ACS csv and calculate some useful model features. 

### **Inputs**
- `data/raw/census_acs.csv`
- `data/processed/nyc_tracts.gpkg`

### **Outputs**
- `data/processed/model_features_acs.csv`
  
--------------------------------------------------------------------------

### 0. Imports and Setup

In [1]:
# package imports
import os
import pandas as pd
import geopandas as gpd
from pathlib import Path

# specify filepaths
path_acs = 'data/raw/census_acs.csv'
path_nyc_tracts = 'data/processed/nyc_tracts.gpkg'
path_output_processed_data = 'data/processed/model_features_acs.csv'

# ensure cwd is project root for file paths to function properly
project_root = Path(os.getcwd())            # get current directory
while not (project_root / "data").exists(): # keep moving up until in parent
    project_root = project_root.parent
os.chdir(project_root)                      # switch to parent directory

### 1. Load Data

In [2]:
# census acs
df_acs = pd.read_csv(path_acs)

# nyc tracts
gdf_tracts_nyc = gpd.read_file(path_nyc_tracts, layer="tracts")

In [3]:
df_acs.columns.tolist()

['total_population',
 'median_household_income',
 'gini_index',
 'population_poverty_universe',
 'population_below_poverty',
 'labor_force',
 'unemployed_population',
 'population_25_plus',
 'bachelors_degree',
 'masters_degree',
 'professional_degree',
 'doctorate_degree',
 'occupied_housing_units',
 'renter_occupied_units',
 'median_gross_rent',
 'rent_burden_universe',
 'rent_30_to_34_pct_income',
 'rent_35_to_39_pct_income',
 'rent_40_to_49_pct_income',
 'rent_50_plus_pct_income',
 'vehicle_availability_universe',
 'households_no_vehicle',
 'renter_crowded_units',
 'male_65_66',
 'male_67_69',
 'male_70_74',
 'male_75_79',
 'male_80_84',
 'male_85_plus',
 'female_65_66',
 'female_67_69',
 'female_70_74',
 'female_75_79',
 'female_80_84',
 'female_85_plus',
 'disability_universe',
 'male_with_disability',
 'male_without_disability',
 'female_with_disability',
 'female_without_disability',
 'GEOID']

### 2. Feature Engineering

### Economic

Poverty Rate : *ratio of tract residents living in poverty*

In [4]:
# note: we don't use all tract residents for the denom, only those for which we have a known poverty status
df_acs['poverty_rate'] = df_acs['population_below_poverty'] / df_acs['population_poverty_universe']

Unemployment Rate : *unemployed residents out of total labor force*

In [5]:
df_acs['unemployment_rate'] = df_acs['unemployed_population'] / df_acs['labor_force']

### Education

% Higher Education : *ratio of tract residents who hold a higher ed degree*

In [6]:
df_acs['pct_higher_ed'] = df_acs[['bachelors_degree','masters_degree','professional_degree','doctorate_degree']].sum(axis=1) / df_acs['population_25_plus']

### Housing

% Renters : *ratio of rented housing units out of overall housing units*

In [7]:
df_acs['pct_renters'] = df_acs['renter_occupied_units'] / df_acs['occupied_housing_units']

% Cost-Burdened Renters : *ratio of renters whose rent is >= 30% of their income*

In [8]:
df_acs['pct_rent_burdened'] = df_acs[['rent_30_to_34_pct_income',
                                      'rent_35_to_39_pct_income',
                                      'rent_40_to_49_pct_income',
                                      'rent_50_plus_pct_income']].sum(axis=1) / df_acs['rent_burden_universe']

### Mobility

% No Vehicle Access  : *total households with no vehicle access*

In [9]:
df_acs['pct_no_vehicle'] = df_acs['households_no_vehicle'] / df_acs['vehicle_availability_universe']

### Demographics

% Disability Prevalence : *ratio of residents with a disability*

In [12]:
df_acs['pct_disability_prevalence'] = df_acs['population_with_disability'] / df_acs['disability_universe']

% Age 65+ : *a measure of age vulnerability*

In [13]:
df_acs['pct_age_65_plus'] = df_acs[['male_65_66',
                                   'male_67_69',
                                   'male_70_74',
                                   'male_75_79',
                                   'male_80_84',
                                   'male_85_plus',
                                   'female_65_66',
                                   'female_67_69',
                                   'female_70_74',
                                   'female_75_79',
                                   'female_80_84',
                                   'female_85_plus']].sum(axis=1) / df_acs['total_population']

Population Density : *total residents per square meter*

In [14]:
# get area per tract (sq_m)
df_acs['GEOID'] = df_acs['GEOID'].astype(int)
gdf_tracts_nyc['GEOID'] = gdf_tracts_nyc['GEOID'].astype(int)
df_acs = df_acs.merge(gdf_tracts_nyc[['GEOID', 'ALAND']].copy(), how='left', on='GEOID')

# convert area from sq_m to sq_km (since pop density is more intuitive in sq_km)
df_acs['ALAND_SQ_KM'] = df_acs['ALAND'] / 1_000_000

In [15]:
df_acs['pop_density_sq_km'] = df_acs['total_population'] / df_acs['ALAND_SQ_KM']

### 3. Select Features of Interest

The overall ACS data contains many useful columns. For the current project sprint, I'll only be using the following:

In [20]:
df_model_features = df_acs[['GEOID',
                            'median_household_income',
                            'poverty_rate',
                            'unemployment_rate',
                            'gini_index', # measure of inequality
                            'pct_higher_ed',
                            'pct_renters',
                            'median_gross_rent',
                            'pct_rent_burdened',
                            'pct_no_vehicle',
                            'pop_density_sq_km',
                            'pct_age_65_plus']]

### 4. Save Data

In [21]:
df_model_features.to_csv(path_output_processed_data, index=False)