# Data Cleaning: Total Population (ACS 5-Year Estimates)

This notebook performs standard data cleaning for the Census Bureau's Total Population dataset (`ACSDT5Y2024.B01003`).

In [None]:
import pandas as pd

# Define the path to the dataset
file_path = 'ACSDT5Y2024.B01003-Data.csv'

# Load the data, skipping the second row (index 1) which contains descriptive labels.
df = pd.read_csv(file_path, skiprows=[1])

print(f"Initial dataset shape: {df.shape}")
df.head()

## 1. Column Selection

We select the essential identifiers and the total population metric (`B01003_001E`).

In [None]:
# Selective columns for analysis
cols_to_keep = ['GEO_ID', 'NAME', 'B01003_001E']
df_cleaned = df[cols_to_keep].copy()

df_cleaned.head()

## 2. Numeric Sanitization

We remove potential formatting characters (commas, plus signs) and convert the population estimate to a strictly numeric format (integer).

In [None]:
# Remove commas and formatting characters
df_cleaned['B01003_001E'] = df_cleaned['B01003_001E'].astype(str).str.replace(',', '').str.replace('+', '').str.replace('*', '', regex=False)

# Convert to numeric, forcing errors to NaN
df_cleaned['B01003_001E'] = pd.to_numeric(df_cleaned['B01003_001E'], errors='coerce')

print(f"Missing population values after conversion: {df_cleaned['B01003_001E'].isna().sum()}")

## 3. Renaming and Finalization

Renaming columns to standard descriptors used in our unified analysis pipeline.

In [None]:
# Rename columns for clarity
df_cleaned.rename(columns={
    'NAME': 'County_Area',
    'B01003_001E': 'Total_Population'
}, inplace=True)

# Drop rows with missing population if any exist
df_cleaned.dropna(subset=['Total_Population'], inplace=True)

# Cast to integer since population counts are discrete
df_cleaned['Total_Population'] = df_cleaned['Total_Population'].astype(int)

df_cleaned.head()

## 4. Export

Saving the cleaned dataset for downstream integration.

In [None]:
output_file = 'cleaned-population-by-county.csv'
df_cleaned.to_csv(output_file, index=False)

print(f"Cleaned population data exported to: {output_file}")