
Note: No Python cleaning steps were performed in this notebook because all cleaning was done in OpenRefine. This notebook is used only to document the workflow and load the cleaned outputs for verification.

This notebook documents the cleaning steps for our IS477 course project.  

All data cleaning was performed in **OpenRefine**, not in Python, because our cleaning workflow involved facet filters, clustering, and custom transforms that OpenRefine is specifically designed for.

The full cleaning workflow is fully documented in `data/README.md`.

Additionally, the OpenRefine operation history files have been exported and included in the repository under:`openrefine/`

These JSON files allow anyone to reproduce our cleaning steps by importing them into OpenRefine.


In [3]:
import pandas as pd

cdc = pd.read_csv("cdc_cleaned.csv")
usda = pd.read_csv("foodatlas_clean.csv")

cdc.head(), usda.head()

(  State         County  Obesity_Rate  FIPS
 0    AL     Lauderdale          36.5  1077
 1    AK  Kodiak Island          30.3  2150
 2    CA      Riverside          34.0  6065
 3    CO        El Paso          28.1  8041
 4    AZ         Mohave          36.6  4015,
    FIPS State   County  fast_food_density
 0  1001    AL  Autauga           0.801496
 1  1003    AL  Baldwin           0.750152
 2  1005    AL  Barbour           0.976046
 3  1007    AL     Bibb           0.316227
 4  1009    AL   Blount           0.414658)

## Summary of Cleaning Steps

### USDA Food Environment Atlas (Fast-food restaurant density)
- Imported raw CSV into OpenRefine  
- Filtered to keep only variable `FFRPTH20` (# of fast-food restaurants per 1,000 population, 2020)  
- Kept relevant columns (State, County, FIPS, Density value)  
- Removed non-county rows (“US”, territories not part of 50 states)  
- Validated FIPS codes (all 5 digits)  
- Cleaned density values (removed -9999, blanks, non-numeric values)  
- Exported cleaned dataset → `data/cleaned/foodatlas_clean.csv`

### CDC PLACES Obesity Data
- Imported raw CSV into OpenRefine  
- Filtered to keep only Measure = “Obesity among adults”  
- Kept only “Age-adjusted prevalence”  
- Selected relevant columns (State, Location, FIPS, Value)  
- Removed DC / non-county rows  
- Validated FIPS codes  
- Cleaned obesity rate values (numeric, 0–100, non-blank)  
- Exported cleaned dataset → `data/cleaned/cdc_cleaned.csv`