# 04 ACS Feature Engineering

**Project:** NORI  
**Author:** Yuseof J  
**Date:** 23/12/25  

### **Purpose**
Load the raw CDC PLACES csv, filter for NYC tracts, and output data as parquet file. 

### **Inputs**
- `data/raw/census_acs.csv`

### **Outputs**
- `data/processed/cdc_places_nyc.parquet`
  
--------------------------------------------------------------------------

### 0. Imports and Setup

In [1]:
# package imports
import os
import pandas as pd
import geopandas as gpd
from pathlib import Path

# specify filepaths
path_acs = 'data/raw/census_acs.csv'
path_nyc_tracts = 'data/processed/nyc_tracts.gpkg'
path_output_processed_data = 'data/processed/model_features_acs.csv'

# ensure cwd is project root for file paths to function properly
project_root = Path(os.getcwd())            # get current directory
while not (project_root / "data").exists(): # keep moving up until in parent
    project_root = project_root.parent
os.chdir(project_root)                      # switch to parent directory

### 1. Load Data

In [2]:
# census acs
df_acs = pd.read_csv(path_acs)

# nyc tracts
gdf_tracts_nyc = gpd.read_file(path_nyc_tracts, layer="tracts")

### 2. Feature Engineering

### Economic

> Median Household Income

> Poverty Rate

> Unemployment Rate

> Gini Index

### 3. Select Features of Interest

The overall ACS data contains many useful columns. For the current project sprint, I'll only be using the following:

In [47]:
# ensure matching dtypes for filtering
nyc_tracts_fips = gdf_tracts_nyc.GEOID.astype(int)
df_cdc.TractFIPS = df_cdc.TractFIPS.astype(int)

# filter cdc places for nyc tracts
df_cdc_nyc = df_cdc[df_cdc.TractFIPS.isin(nyc_tracts_fips)]

# report number of matched tracts
percent_matched = int((df_cdc_nyc.TractFIPS.nunique()/df_tracts_nyc.GEOID.nunique()) * 100)
print(f"Found places data for {df_cdc_nyc.TractFIPS.nunique()} / {df_tracts_nyc.GEOID.nunique()} ({percent_matched}%) of nyc tracts")

Found places data for 2231 / 2327 (95%) of nyc tracts


In [None]:
### 

### 4. Save Data

In [None]:
df_cdc_nyc.to_parquet("data_processed/cdc_places_nyc.parquet")