# ECON 0150 | Replication Notebook

**Title:** Income and PA Voter Turnout 2024

**Original Authors:** Sophia C

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Did income influence voter turnout in Pennsylvania during the 2024 Presidential Election?

**Data Source:** Pennsylvania county-level income and 2024 voter turnout data (66 counties)

**Methods:** OLS regression: Voter_Turnout ~ Income

**Main Finding:** Significant positive relationship between income and voter turnout (p < 0.001, RÂ² = 0.265).

**Course Concepts Used:**
- Simple linear regression
- Scatter plots with regression lines
- Hypothesis testing
- Cross-sectional analysis

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0066/data/'

data = pd.read_excel(base_url + 'final.xlsx')

print(f"Number of PA counties: {len(data)}")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# Check columns
print("Columns:", data.columns.tolist())
print(f"\nData shape: {data.shape}")

In [None]:
# Rename columns for clarity
data = data.rename(columns={
    'income': 'Income',
    'voter turnout': 'Voter_Turnout'
})

# Convert turnout to percentage if needed
if data['Voter_Turnout'].max() <= 1:
    data['Voter_Turnout_Pct'] = data['Voter_Turnout'] * 100
else:
    data['Voter_Turnout_Pct'] = data['Voter_Turnout']

# Convert income to thousands
data['Income_Thousands'] = data['Income'] / 1000

print(f"\nCleaned data: {len(data)} counties")
data.head()

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['Income_Thousands', 'Voter_Turnout_Pct']].describe())

In [None]:
# Correlation
correlation = data['Income'].corr(data['Voter_Turnout'])
print(f"\nCorrelation between income and voter turnout: {correlation:.4f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Income_Thousands', y='Voter_Turnout_Pct', data=data,
            scatter_kws={'s': 50, 'alpha': 0.7}, line_kws={'color': 'green'})
plt.title('Income vs. Voter Turnout in Pennsylvania Counties (2024)')
plt.xlabel('Median Household Income ($1000s)')
plt.ylabel('Voter Turnout (%)')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Distribution of voter turnout
plt.figure(figsize=(10, 6))
sns.histplot(data['Voter_Turnout_Pct'], kde=True, bins=20)
plt.title('Distribution of Voter Turnout Across PA Counties')
plt.xlabel('Voter Turnout (%)')
plt.ylabel('Frequency')
plt.axvline(data['Voter_Turnout_Pct'].mean(), color='red', linestyle='--', label=f"Mean: {data['Voter_Turnout_Pct'].mean():.1f}%")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols('Voter_Turnout ~ Income', data=data).fit()
print("OLS Regression: Voter_Turnout ~ Income")
print(model.summary())

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: Income does not affect voter turnout (beta = 0)")
print(f"\nModel Results:")
print(f"  Intercept: {model.params['Intercept']:.4f} ({model.params['Intercept']*100:.2f}%)")
print(f"  Income coefficient: {model.params['Income']:.2e}")
print(f"  P-value: {model.pvalues['Income']:.6f}")
print(f"  R-squared: {model.rsquared:.3f}")
print(f"\nInterpretation:")
print(f"  Each $10,000 increase in median income is associated with")
print(f"  {model.params['Income']*10000*100:.2f} percentage points higher turnout")
print(f"\nConclusion: REJECT null hypothesis")
print(f"  Income IS significantly associated with voter turnout in PA")

---
## Step 5 | Results Interpretation

### Key Findings

| Statistic | Value |
|-----------|-------|
| Income Coefficient | 1.45e-06 |
| P-value | < 0.001 |
| R-squared | 0.265 |
| Correlation | 0.51 |

1. **Significant Positive Relationship:** Higher income counties have higher turnout

2. **Moderate Explanatory Power:** Income explains ~27% of turnout variation

3. **Practical Significance:** ~1.5 percentage points per $10k income

### Why Does Income Predict Turnout?

**Resource theory:** Higher income provides:
- Time flexibility to vote
- Transportation to polling places
- Civic engagement opportunities
- Better access to information

**Education correlation:** Income strongly correlates with education, which also predicts turnout

### PA 2024 Context

Pennsylvania was a key swing state in 2024:
- High overall turnout (~93% in some counties)
- Intense campaign activity
- 66 counties with diverse demographics

### Limitations

- Cross-sectional data (one election)
- County-level analysis masks individual variation
- Omitted variables (education, age, urbanization)

---
## Replication Exercises

### Exercise 1: Add Controls
If education data is available, add it as a control. Does income still matter?

### Exercise 2: Urban vs. Rural
Split counties by urbanization. Does the relationship differ?

### Exercise 3: Historical Comparison
Compare to 2020 or 2016 turnout. Is the pattern consistent?

### Challenge Exercise
Research the political science literature on voter turnout. What are the most important predictors?

In [None]:
# Your code for exercises

# Example: Highest turnout counties
# print(data.nlargest(5, 'Voter_Turnout_Pct')[['County', 'Income_Thousands', 'Voter_Turnout_Pct']])