# Data Cleaning: Race by County (ACS 5-Year Estimates)

This notebook performs basic data cleaning for the Census Bureau's Race/Ethnicity dataset (`ACSDP5Y2024.DP05`). The goal is to prepare a clean, numeric dataset with race percentages by county, following the same approach used for the median household income and AQI datasets.

In [None]:
import pandas as pd

# Define the path to the dataset
file_path = 'ACSDP5Y2024.DP05-Data.csv'

# Load the data, skipping the second row (index 1) which contains descriptive labels.
# This matches the median income cleaning approach for Census ACS data.
df = pd.read_csv(file_path, skiprows=[1], low_memory=False)

print(f"Initial dataset shape: {df.shape}")
df.head()

## 1. Column Selection

We keep `GEO_ID`, `NAME` (Geographic Area), and the six race percentage columns:
- % Hispanic or Latino
- % White alone
- % Black or African American alone
- % Asian alone
- % American Indian and Alaska Native alone
- % Two or More Races

Census column codes: DP05_0090PE (Hispanic), DP05_0096PE (White), DP05_0097PE (Black), DP05_0098PE (American Indian), DP05_0099PE (Asian), DP05_0102PE (Two or More Races).

In [None]:
# Census DP05 percent columns for the six race categories (from HISPANIC OR LATINO AND RACE section)
race_cols = {
    'DP05_0090PE': 'Pct_Hispanic_or_Latino',
    'DP05_0096PE': 'Pct_White_alone',
    'DP05_0097PE': 'Pct_Black_or_African_American_alone',
    'DP05_0098PE': 'Pct_American_Indian_and_Alaska_Native_alone',
    'DP05_0099PE': 'Pct_Asian_alone',
    'DP05_0102PE': 'Pct_Two_or_More_Races'
}

cols_to_keep = ['GEO_ID', 'NAME'] + list(race_cols.keys())
df_cleaned = df[cols_to_keep].copy()

# Filter to county-level geographies only (GEO_ID format: 0500000US = county)
df_cleaned = df_cleaned[df_cleaned['GEO_ID'].astype(str).str.startswith('0500000US')].copy()

df_cleaned.head()

## 2. Handling Missing and Special Characters

Census data uses special characters like `*****`, `(X)`, `-`, or `*` to denote suppressed or missing data. We sanitize these and convert to numeric format.

In [None]:
# Clean each race column: remove/replace Census special characters
for col in race_cols.keys():
    df_cleaned[col] = df_cleaned[col].astype(str).str.replace(',', '').str.replace('*****', '').str.replace('(X)', '').str.replace('-', '').str.replace('*', '').str.strip()
    df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors='coerce')

print("Missing values per column after conversion:")
print(df_cleaned[list(race_cols.keys())].isna().sum())

## 3. Renaming and Finalization

Rename columns to human-readable names and drop rows with missing race percentages.

In [None]:
df_cleaned.rename(columns={
    'NAME': 'County_Area',
    **race_cols
}, inplace=True)

# Drop rows where any race percentage is missing
df_cleaned.dropna(subset=list(race_cols.values()), inplace=True)

# Rename for display as "% X" format per user specification
df_cleaned.rename(columns={
    'Pct_Hispanic_or_Latino': '% Hispanic or Latino',
    'Pct_White_alone': '% White alone',
    'Pct_Black_or_African_American_alone': '% Black or African American alone',
    'Pct_Asian_alone': '% Asian alone',
    'Pct_American_Indian_and_Alaska_Native_alone': '% American Indian and Alaska Native alone',
    'Pct_Two_or_More_Races': '% Two or More Races'
}, inplace=True)

df_cleaned.head()

## 4. Exporting the Cleaned Dataset

Export the cleaned dataset to CSV for use in other analyses.

In [None]:
output_file = 'cleaned_race_by_county.csv'
df_cleaned.to_csv(output_file, index=False)

print(f"Cleaned data exported to: {output_file}")
print(f"Final shape: {df_cleaned.shape}")

## Conclusion

The data is now clean, numeric, and exported to `cleaned_race_by_county.csv`. It contains county-level race percentages for the six specified categories and is ready for integration with other datasets such as AQI and median household income.