# ECON 0150 | Replication Notebook

**Title:** Voter Turnout and Presidential Margins

**Original Authors:** Brennfleck, Jones, Kachalova

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** What is the relationship between Voter Turnout and the Margin of U.S. Presidential Victories by County?

**Data Source:** County presidential election results (2000-2024) merged with population estimates

**Methods:** OLS regression with turnout as predictor of vote margin

**Main Finding:** Higher voter turnout is associated with slightly more Republican-leaning results. Each 1 percentage point increase in turnout is associated with 0.13 points more Republican margin (p < 0.001).

**Course Concepts Used:**
- Merging datasets
- Creating new variables (turnout percentage, margin percentage)
- OLS regression
- Residual analysis

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0008/data/'

pres_data = pd.read_csv(base_url + 'countypres_2000-2024.csv')
pop_data = pd.read_csv(base_url + 'Population%20Estimates%20-%20US,%20States,%20Counties.csv')

print(f"Presidential data: {len(pres_data)} rows")
print(f"Population data: {len(pop_data)} rows")

---
## Step 1 | Data Preparation

In [None]:
# Filter to major parties only
major_parties = pres_data[(pres_data['office'] == 'US PRESIDENT') & 
                           (pres_data['party'].isin(['DEMOCRAT', 'REPUBLICAN']))].copy()

# Group by county, year, and party
party_votes = major_parties.groupby(['year', 'state_po', 'county_name', 'party'])['candidatevotes'].sum().reset_index()

# Pivot to get Democrat and Republican columns
pivoted = party_votes.pivot_table(
    index=['year', 'state_po', 'county_name'], 
    columns='party', 
    values='candidatevotes'
).reset_index()

pivoted = pivoted.rename(columns={'state_po': 'state'})
pivoted = pivoted.fillna(0)

# Calculate vote margin (positive = Democratic lead)
pivoted['Vote_Margin'] = pivoted['DEMOCRAT'] - pivoted['REPUBLICAN']

print(f"Pivoted data: {len(pivoted)} county-year observations")
pivoted.head()

In [None]:
# Get total votes per county-year
total_votes = pres_data[['year', 'state_po', 'county_name', 'totalvotes']].drop_duplicates()
total_votes = total_votes.rename(columns={'state_po': 'state'})

# Merge with pivoted data
election_data = pd.merge(pivoted, total_votes, on=['year', 'state', 'county_name'])

print(f"Election data with total votes: {len(election_data)} rows")

In [None]:
# Prepare population data
# State abbreviation mapping
state_map = {
    'ALABAMA': 'AL', 'ALASKA': 'AK', 'ARIZONA': 'AZ', 'ARKANSAS': 'AR', 'CALIFORNIA': 'CA',
    'COLORADO': 'CO', 'CONNECTICUT': 'CT', 'DELAWARE': 'DE', 'FLORIDA': 'FL', 'GEORGIA': 'GA',
    'HAWAII': 'HI', 'IDAHO': 'ID', 'ILLINOIS': 'IL', 'INDIANA': 'IN', 'IOWA': 'IA',
    'KANSAS': 'KS', 'KENTUCKY': 'KY', 'LOUISIANA': 'LA', 'MAINE': 'ME', 'MARYLAND': 'MD',
    'MASSACHUSETTS': 'MA', 'MICHIGAN': 'MI', 'MINNESOTA': 'MN', 'MISSISSIPPI': 'MS', 'MISSOURI': 'MO',
    'MONTANA': 'MT', 'NEBRASKA': 'NE', 'NEVADA': 'NV', 'NEW HAMPSHIRE': 'NH', 'NEW JERSEY': 'NJ',
    'NEW MEXICO': 'NM', 'NEW YORK': 'NY', 'NORTH CAROLINA': 'NC', 'NORTH DAKOTA': 'ND', 'OHIO': 'OH',
    'OKLAHOMA': 'OK', 'OREGON': 'OR', 'PENNSYLVANIA': 'PA', 'RHODE ISLAND': 'RI', 'SOUTH CAROLINA': 'SC',
    'SOUTH DAKOTA': 'SD', 'TENNESSEE': 'TN', 'TEXAS': 'TX', 'UTAH': 'UT', 'VERMONT': 'VT',
    'VIRGINIA': 'VA', 'WASHINGTON': 'WA', 'WEST VIRGINIA': 'WV', 'WISCONSIN': 'WI', 'WYOMING': 'WY'
}

# Filter county population estimates
pop_counties = pop_data[(pop_data['Count or Estimate'] == 'Estimate') & 
                         (pop_data['State or County Release'] == 'County')].copy()

# Extract county name and state
pop_counties['county_name'] = pop_counties['Description'].apply(
    lambda x: x.split(',')[0].strip().replace(' County', '').upper() if pd.notna(x) else ''
)
pop_counties['state'] = pop_counties['Description'].apply(
    lambda x: state_map.get(x.split(',')[-1].strip().upper(), '') if pd.notna(x) else ''
)

pop_counties = pop_counties.rename(columns={'Year': 'year', 'Population': 'population'})
pop_counties = pop_counties[(pop_counties['year'] >= 2000) & (pop_counties['year'] <= 2024)]

# Aggregate population by county-year
pop_agg = pop_counties.groupby(['year', 'county_name', 'state'])['population'].mean().reset_index()

print(f"Population data prepared: {len(pop_agg)} county-year observations")

In [None]:
# Merge election and population data
data = pd.merge(election_data, pop_agg, on=['year', 'state', 'county_name'], how='left')

# Calculate turnout percentage
data['turnout_percentage'] = np.where(
    data['population'] > 0,
    (data['totalvotes'] / data['population']) * 100,
    np.nan
)

# Clip turnout to reasonable range (0-100)
data['turnout_percentage'] = data['turnout_percentage'].clip(0, 100)

# Calculate vote margin percentage
data['Vote_Margin_Percentage'] = np.where(
    data['totalvotes'] > 0,
    (data['Vote_Margin'] / data['totalvotes']) * 100,
    np.nan
)

# Drop missing values
data = data.dropna(subset=['turnout_percentage', 'Vote_Margin_Percentage'])

print(f"Final dataset: {len(data)} county-year observations")
data.head()

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
data[['turnout_percentage', 'Vote_Margin_Percentage']].describe()

In [None]:
# Summary by year
data.groupby('year')[['turnout_percentage', 'Vote_Margin_Percentage']].mean()

---
## Step 3 | Visualization

In [None]:
# Scatter plot: Turnout vs Vote Margin, colored by year
plt.figure(figsize=(12, 8))
sns.scatterplot(data=data, x='turnout_percentage', y='Vote_Margin_Percentage', 
                hue='year', palette='viridis', alpha=0.5, s=10)
sns.regplot(data=data, x='turnout_percentage', y='Vote_Margin_Percentage', 
            scatter=False, color='blue')

plt.axhline(0, color='red', linestyle='--', linewidth=0.8)
plt.xlabel('Voter Turnout Percentage (%)')
plt.ylabel('Democratic Lead (D-R) Percentage (%)')
plt.title('Voter Turnout vs Democratic Lead by County')
plt.xlim(0, 100)
plt.ylim(-100, 100)
plt.legend(title='Election Year', bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS regression: Vote margin ~ Turnout
model = smf.ols('Vote_Margin_Percentage ~ turnout_percentage', data=data).fit()
print(model.summary().tables[1])
print(f"\nR-squared: {model.rsquared:.4f}")

In [None]:
# Residual plot
data['predicted'] = model.predict()
data['residuals'] = data['Vote_Margin_Percentage'] - data['predicted']

plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x='predicted', y='residuals', hue='year', palette='viridis', alpha=0.5, s=10)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Democratic Lead (%)')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

In [None]:
# Histogram of residuals
plt.figure(figsize=(10, 6))
sns.histplot(data['residuals'], bins=50, kde=True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Regression Results:**
- **Intercept:** -18.97 (p < 0.001)
- **Turnout coefficient:** -0.13 (p < 0.001)
- **R-squared:** ~0.003 (very low)

### Interpretation

The negative coefficient on turnout indicates that higher voter turnout is associated with slightly more Republican outcomes (more negative Democratic margin). However:

1. **Very small effect size:** A 10 percentage point increase in turnout is associated with only 1.3 point more Republican margin
2. **Very low RÂ²:** Turnout explains less than 1% of variation in vote margins
3. **Other factors dominate:** County characteristics (urban/rural, demographics, etc.) are much stronger predictors

### Caveats

- Correlation does not imply causation
- Turnout calculation may have errors due to population estimation
- County composition changed over time

---
## Replication Exercises

### Exercise 1: Year-by-Year Analysis
Run separate regressions for each election year. Does the relationship change over time?

### Exercise 2: Swing States
Filter to swing states only (PA, MI, WI, etc.). Is the relationship different in competitive states?

### Exercise 3: Urban vs Rural
Split counties by population size. Does turnout affect margins differently in urban vs rural areas?

### Challenge Exercise
Add year fixed effects to the regression. How does this change the interpretation?

In [None]:
# Your code for exercises
