# Data Cleaning: Median Household Income (ACS 5-Year Estimates)

This notebook performs basic data cleaning for the Census Bureau's Median Household Income dataset (`ACSST5Y2024.S1903`). The goal is to prepare a clean, numeric dataset for analysis of economic trends across different counties.

In [None]:
import pandas as pd

# Define the path to the dataset
file_path = 'ACSST5Y2024.S1903_2026-02-07T134855/ACSST5Y2024.S1903-Data.csv'

# Load the data, skipping the second row (index 1) which contains descriptive labels rather than column IDs.
# This allows us to work directly with standardized Census column codes.
df = pd.read_csv(file_path, skiprows=[1])

print(f"Initial dataset shape: {df.shape}")
df.head()

## 1. Column Selection

We are selecting the essential columns: `GEO_ID`, `NAME` (Geographic Area), and `S1903_C03_001E` which represents the **Median Income Estimate** for all households.

In [None]:
# Selective columns for analysis
cols_to_keep = ['GEO_ID', 'NAME', 'S1903_C03_001E']
df_cleaned = df[cols_to_keep].copy()

df_cleaned.head()

## 2. Handling Missing and Special Characters

Census data often uses special characters like `250,000+` or symbols to denote suppressed data. To perform technical analysis, we must sanitize these values and convert the column to a numeric format.

In [None]:
# Remove commas and plus signs that interfere with numeric conversion
df_cleaned['S1903_C03_001E'] = df_cleaned['S1903_C03_001E'].astype(str).str.replace(',', '').str.replace('+', '')

# Convert to numeric, forcing errors to NaN for suppressed or missing data
df_cleaned['S1903_C03_001E'] = pd.to_numeric(df_cleaned['S1903_C03_001E'], errors='coerce')

print(f"Missing values after conversion: {df_cleaned['S1903_C03_001E'].isna().sum()}")

## 3. Renaming and Finalization

To make the dataset human-readable and accessible for other team members, we rename the columns into intuitive descriptors.

In [None]:
# Rename columns for clarity
df_cleaned.rename(columns={
    'NAME': 'County_Area',
    'S1903_C03_001E': 'Median_Household_Income'
}, inplace=True)

# Drop rows with missing income if necessary, or fill them depending on analysis needs
df_cleaned.dropna(subset=['Median_Household_Income'], inplace=True)

df_cleaned.head()

## 4. Exporting the Cleaned Dataset

Finally, we export the cleaned dataset to a CSV file for use in other parts of the project or for sharing with teammates.

In [None]:
# Export the cleaned data to a CSV file
output_file = 'cleaned_median_household_income.csv'
df_cleaned.to_csv(output_file, index=False)

print(f"Cleaned data exported to: {output_file}")

## Conclusion

The data is now clean, numeric, and exported to `cleaned_median_household_income.csv`. It is ready for integration with other datasets like the Livable Planet metrics. We've ensured that technical constraints (like string types and suppressed data) were addressed proactively.