# Data Transformation: Calculating Population Density

This notebook documents the process of combining total population estimates with geographic land area data to derive **population density** (residents per square mile) at the county level. This metric is a key indicator for environmental impact studies.

In [None]:
import pandas as pd
import os

# File Paths
POPULATION_PATH = 'cleaned-population-by-county.csv'
LAND_AREA_PATH = 'GEOINFO2023.GEOINFO-2026-02-07T233836.csv'
OUTPUT_PATH = 'cleaned-population-density-by-county.csv'

def load_data():
    pop_df = pd.read_csv(POPULATION_PATH)
    area_df = pd.read_csv(LAND_AREA_PATH)
    return pop_df, area_df

pop_df, area_df = load_data()

print(f"Population Records: {pop_df.shape[0]}")
print(f"Land Area Records: {area_df.shape[0]}")

## 1. Cleaning Land Area Data

We select the required columns from the geographic info dataset: the location name and the land area in square miles (`AREALAND_SQMI`). We also ensure the land area is treated as a numeric value.

In [None]:
# Select key columns
area_df_cleaned = area_df[['Geographic Area Name (NAME)', 'Area (Land, in square miles) (AREALAND_SQMI)']].copy()

# Rename for consistency with population dataset
area_df_cleaned.rename(columns={
    'Geographic Area Name (NAME)': 'County_Area',
    'Area (Land, in square miles) (AREALAND_SQMI)': 'Land_Area_SqMi'
}, inplace=True)

# Sanitize numeric values (remove commas)
area_df_cleaned['Land_Area_SqMi'] = area_df_cleaned['Land_Area_SqMi'].astype(str).str.replace(',', '')
area_df_cleaned['Land_Area_SqMi'] = pd.to_numeric(area_df_cleaned['Land_Area_SqMi'], errors='coerce')

# Drop records with missing area data
area_df_cleaned.dropna(subset=['Land_Area_SqMi'], inplace=True)

area_df_cleaned.head()

## 2. Merging and Calculating Density

We perform an inner join on the geographic identifier and calculate the density: `Total_Population / Land_Area_SqMi`.

In [None]:
# Merge datasets
merged_df = pd.merge(pop_df, area_df_cleaned, on='County_Area', how='inner')

# Calculate Density
merged_df['population_density'] = merged_df['Total_Population'] / merged_df['Land_Area_SqMi']

print(f"Successfully joined {merged_df.shape[0]} records.")
merged_df[['County_Area', 'Total_Population', 'Land_Area_SqMi', 'population_density']].head(10)

## 3. Exporting Results

The final dataset including population density is saved for integration with environmental indicators.

In [None]:
merged_df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned density data exported to: {OUTPUT_PATH}")