# ECON 0150 | Replication Notebook

**Title:** Regional Salary Differences

**Original Authors:** Cohen; Stiles; Naso

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** What are the regional differences in U.S. salaries: analyzing the impact of North vs South?

**Data Source:** Bureau of Labor Statistics average weekly wage data by state (2024)

**Methods:** OLS regression with binary regional indicator (North = 0, South = 1)

**Main Finding:** No statistically significant difference between North and South wages (coefficient = -$76.12, p = 0.379, RÂ² = 0.02).

**Course Concepts Used:**
- Binary (dummy) variables
- OLS regression
- Hypothesis testing (t-test)
- Interpretation of categorical predictors

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

---
## Step 1 | Data Preparation

In [None]:
# State-level average weekly wage data (BLS, 2024)
data = {
    "State": [
        "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", 
        "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", 
        "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", 
        "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", 
        "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", 
        "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", 
        "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", 
        "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
        "Washington", "West Virginia", "Wisconsin", "Wyoming"
    ],
    "AverageWeeklyWage": [
        1245, 1452, 1431, 1236, 1905, 1681, 1980, 1535, 2606, 1456, 1491, 
        1363, 1178, 1662, 1323, 1253, 1246, 1213, 1227, 1283, 1634, 2107, 
        1391, 1570, 1018, 1310, 1189, 1231, 1359, 1606, 1815, 1203, 2213, 
        1467, 1302, 1361, 1186, 1420, 1500, 1429, 1247, 1163, 1364, 1587, 
        1365, 1280, 1605, 1935, 1149, 1305, 1216
    ]
}

df = pd.DataFrame(data)
print(f"Number of states: {len(df)}")

In [None]:
# Define South region (inclusive definition used by original authors)
south_states = {
    "California", "Arizona", "New Mexico", "Texas", "Oklahoma", "Arkansas", 
    "Louisiana", "Mississippi", "Alabama", "Georgia", "Florida", "South Carolina", 
    "North Carolina", "Tennessee", "Kentucky", "Virginia", "West Virginia", 
    "Maryland", "Delaware", "District of Columbia", "Missouri", "Kansas", 
    "Nevada", "Hawaii", "Utah"
}

df["Region"] = df["State"].apply(lambda x: "South" if x in south_states else "North")
df["South"] = df["State"].apply(lambda x: 1 if x in south_states else 0)

print(f"\nNorth states: {len(df[df['Region'] == 'North'])}")
print(f"South states: {len(df[df['Region'] == 'South'])}")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics by region
summary = df.groupby('Region')['AverageWeeklyWage'].agg(['mean', 'std', 'min', 'max', 'count'])
print("Summary Statistics by Region:")
print(summary.round(2))

In [None]:
# Distribution by region
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram
for region in ['North', 'South']:
    axes[0].hist(df[df['Region'] == region]['AverageWeeklyWage'], 
                 alpha=0.6, label=region, bins=10)
axes[0].set_xlabel('Average Weekly Wage ($)')
axes[0].set_ylabel('Count')
axes[0].set_title('Wage Distribution by Region')
axes[0].legend()

# Box plot
df.boxplot(column='AverageWeeklyWage', by='Region', ax=axes[1])
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Average Weekly Wage ($)')
axes[1].set_title('Wage Distribution by Region')
plt.suptitle('')

plt.tight_layout()
plt.show()

---
## Step 3 | Visualization

In [None]:
# Bar chart of regional averages
region_avgs = df.groupby("Region")["AverageWeeklyWage"].mean()

plt.figure(figsize=(8, 5))
ax = region_avgs.plot(kind="bar", color=['steelblue', 'coral'])

# Add value labels
for i, (region, value) in enumerate(region_avgs.items()):
    ax.text(i, value + 30, f'${value:.2f}', ha='center', fontsize=12)

plt.ylabel("Average Weekly Wage ($)")
plt.title("Average Weekly Wage: North vs South")
plt.ylim(0, region_avgs.max() * 1.15)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))

# Add jitter for visualization
df_plot = df.copy()
df_plot['South_jitter'] = df_plot['South'] + np.random.uniform(-0.1, 0.1, len(df_plot))

sns.scatterplot(data=df_plot, x='South_jitter', y='AverageWeeklyWage', 
                hue='Region', s=100, alpha=0.7)

# Add regression line
X = sm.add_constant(df['South'])
model_viz = sm.OLS(df['AverageWeeklyWage'], X).fit()
plt.plot([0, 1], [model_viz.params['const'], model_viz.params['const'] + model_viz.params['South']], 
         'r--', linewidth=2, label='Regression Line')

# Add predicted means
plt.scatter([0, 1], [model_viz.params['const'], model_viz.params['const'] + model_viz.params['South']], 
           color='red', s=150, marker='X', zorder=5, label='Predicted Mean')

plt.xticks([0, 1], ['North', 'South'])
plt.xlabel('Region')
plt.ylabel('Average Weekly Wage ($)')
plt.title('Weekly Wages by Region with Regression Line')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
X = sm.add_constant(df['South'])
y = df['AverageWeeklyWage']

model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"North average wage (Intercept): ${model.params['const']:.2f}")
print(f"South difference from North: ${model.params['South']:.2f}")
print(f"South average wage: ${model.params['const'] + model.params['South']:.2f}")
print(f"\nP-value for South coefficient: {model.pvalues['South']:.4f}")
print(f"R-squared: {model.rsquared:.3f}")
print("\nConclusion: The difference is NOT statistically significant (p > 0.05)")

---
## Step 5 | Results Interpretation

### Key Findings

| Metric | Value |
|--------|-------|
| North Average | $1,494.04 |
| South Effect | -$76.12 |
| P-value | 0.379 |
| R-squared | 0.016 |

### Interpretation

The regression finds that Southern states have wages approximately $76 lower than Northern states on average. However, this difference is **not statistically significant** (p = 0.379 > 0.05).

**The null hypothesis of no regional difference cannot be rejected.**

### Why No Significant Difference?

1. **High within-region variance:** States like California (South definition) and DC have very high wages, while states like Mississippi and West Virginia have low wages, creating overlap.

2. **Sample size:** With only 51 observations (50 states + DC), detecting small differences requires larger effect sizes.

3. **Regional definition:** The classification of "South" is debatable (e.g., California included).

### Caveats

- Regional definition affects results significantly
- Average wages don't account for cost of living differences
- Industry composition varies by state

---
## Replication Exercises

### Exercise 1: Alternative Regional Definition
Use the Census Bureau's regional definitions instead. Do results change?

### Exercise 2: Cost of Living Adjustment
Find regional price parity data and adjust wages. Are real wages different across regions?

### Exercise 3: Multiple Regression
Add other state-level predictors (education, urbanization, industry mix). What explains state wage variation?

### Challenge Exercise
Research the "great divergence" literature on regional inequality. Has the North-South gap changed over time?

In [None]:
# Your code for exercises
