# ECON 0150 | Replication Notebook

**Title:** Income Inequality and Incarceration

**Original Authors:** Tully

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a correlation between income inequality and incarceration rates across economically advanced countries?

**Data Source:** Gini coefficients and incarceration rates for 12 developed countries

**Methods:** OLS regression of incarceration rate on Gini coefficient

**Main Finding:** Strong positive relationship: each 0.01 increase in Gini is associated with 27.7 more prisoners per 100k population (p = 0.004, R² = 0.58).

**Course Concepts Used:**
- Simple linear regression
- Cross-country comparison
- Scatter plots with country labels
- Interpreting correlation vs causation

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0025/data/'

df = pd.read_csv(base_url + 'incarceration_gini.csv')

# Rename columns for easier use
df = df.rename(columns={
    'Gini Coefficient (Disposable Income)': 'Gini',
    'Incarceration Rate (per 100k)': 'Incarceration'
})

print(f"Number of countries: {len(df)}")
df

---
## Step 1 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(df[['Gini', 'Incarceration']].describe())

In [None]:
# Distribution of variables
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df['Gini'], bins=8, edgecolor='black')
axes[0].set_xlabel('Gini Coefficient')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Income Inequality')

axes[1].hist(df['Incarceration'], bins=8, edgecolor='black')
axes[1].set_xlabel('Incarceration Rate (per 100k)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Incarceration Rates')

plt.tight_layout()
plt.show()

In [None]:
# Correlation
correlation = df['Gini'].corr(df['Incarceration'])
print(f"Correlation between Gini and Incarceration: {correlation:.3f}")

---
## Step 2 | Visualization

In [None]:
# Scatter plot with country labels and regression line
plt.figure(figsize=(12, 8))

sns.regplot(data=df, x='Gini', y='Incarceration', ci=None, 
            scatter_kws={'s': 100}, line_kws={'color': 'red', 'linestyle': '--'})

# Add country labels
for idx, row in df.iterrows():
    plt.annotate(row['Country'], 
                 (row['Gini'] + 0.005, row['Incarceration'] + 10),
                 fontsize=10)

plt.xlabel('Gini Coefficient (Disposable Income)')
plt.ylabel('Incarceration Rate (per 100,000)')
plt.title('Incarceration Rate vs Income Inequality')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Bar chart comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sort by Gini
df_sorted = df.sort_values('Gini', ascending=False)

axes[0].barh(df_sorted['Country'], df_sorted['Gini'])
axes[0].set_xlabel('Gini Coefficient')
axes[0].set_title('Income Inequality by Country')

# Sort by incarceration
df_sorted = df.sort_values('Incarceration', ascending=False)

axes[1].barh(df_sorted['Country'], df_sorted['Incarceration'])
axes[1].set_xlabel('Incarceration Rate (per 100k)')
axes[1].set_title('Incarceration Rate by Country')

plt.tight_layout()
plt.show()

---
## Step 3 | Statistical Analysis

In [None]:
# OLS Regression
X = sm.add_constant(df['Gini'])
y = df['Incarceration']

model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Results table
results_df = pd.DataFrame({
    'Coefficient': model.params,
    'Std Error': model.bse,
    't-statistic': model.tvalues,
    'P-value': model.pvalues,
    '95% CI Lower': model.conf_int()[0],
    '95% CI Upper': model.conf_int()[1]
}).round(3)

print("\nRegression Results Table:")
print(results_df)

In [None]:
# Key interpretation
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: {model.params['const']:.2f}")
print(f"Gini coefficient: {model.params['Gini']:.2f}")
print(f"\nInterpretation: A 0.01 increase in Gini is associated with")
print(f"               {model.params['Gini'] * 0.01:.2f} more prisoners per 100k")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['Gini']:.4f}")

---
## Step 4 | Outlier Analysis

In [None]:
# The United States is a clear outlier
# Let's see what happens if we exclude it

df_no_us = df[df['Country'] != 'United States']

X_no_us = sm.add_constant(df_no_us['Gini'])
y_no_us = df_no_us['Incarceration']

model_no_us = sm.OLS(y_no_us, X_no_us).fit()

print("Regression WITHOUT United States:")
print(model_no_us.summary().tables[1])
print(f"\nR-squared: {model_no_us.rsquared:.3f}")
print(f"P-value: {model_no_us.pvalues['Gini']:.4f}")

In [None]:
# Compare with and without US
print("\nComparison:")
print(f"With US - Coefficient: {model.params['Gini']:.1f}, R²: {model.rsquared:.3f}, p: {model.pvalues['Gini']:.4f}")
print(f"Without US - Coefficient: {model_no_us.params['Gini']:.1f}, R²: {model_no_us.rsquared:.3f}, p: {model_no_us.pvalues['Gini']:.4f}")

---
## Step 5 | Results Interpretation

### Key Findings

| Model | Gini Coefficient | R² | P-value |
|-------|------------------|-----|--------|
| Full sample | 2,774 | 0.58 | 0.004 |
| Excluding US | Lower | ~0.2 | >0.05 |

### Interpretation

1. **Strong correlation with full sample:** Higher inequality (Gini) is associated with higher incarceration rates

2. **US is a major outlier:** The United States has both the highest inequality AND the highest incarceration rate by far

3. **Results are driven by the US:** Excluding the US weakens or eliminates the statistical significance

### Cautions

- **Small sample size:** Only 12 countries limits statistical power
- **Correlation ≠ causation:** Many confounding factors (drug policy, criminal justice systems)
- **Selection bias:** Only developed countries included
- **US exceptionalism:** The US may have unique factors driving both variables

---
## Replication Exercises

### Exercise 1: More Countries
Add more countries to the dataset. Does the relationship hold with a larger sample?

### Exercise 2: Multiple Regression
Add other predictors (GDP per capita, education, drug policy indicators). What explains incarceration?

### Exercise 3: Time Series
How have inequality and incarceration changed over time in the US? Is there a temporal relationship?

### Challenge Exercise
Research the "spirit level" hypothesis (Wilkinson & Pickett). What other outcomes are correlated with inequality?

In [None]:
# Your code for exercises
