# ECON 0150 | Replication Notebook

**Title:** Corruption and Income Inequality

**Original Authors:** Getgen; Onyango; Koychev

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a relationship between corruption perception and income inequality across countries?

**Data Source:** Transparency International Corruption Perceptions Index and World Bank GINI coefficients

**Methods:** OLS regression of GINI coefficient on corruption perception

**Main Finding:** Positive relationship: higher corruption is associated with higher income inequality (coef = 0.0018, p < 0.001, R² = 0.155).

**Course Concepts Used:**
- Simple linear regression
- Cross-country comparison
- Scatter plots with regression lines
- Variable transformation (inverting CPI scale)

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0042/data/'

# Use the processed data file
data = pd.read_csv(base_url + 'final_data processed.csv')

print(f"Number of countries: {len(data)}")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# Check column names
print("Columns:", data.columns.tolist())

In [None]:
# Note: CPI is inverted so higher values = MORE corruption
# Original CPI: 0 = highly corrupt, 100 = very clean
# Transformed: 100 - CPI, so higher = more corrupt

# Find CPI and GINI columns
cpi_col = [c for c in data.columns if 'corruption' in c.lower() or 'cpi' in c.lower()][0] if any('corruption' in c.lower() or 'cpi' in c.lower() for c in data.columns) else None
gini_col = [c for c in data.columns if 'gini' in c.lower()][0] if any('gini' in c.lower() for c in data.columns) else None

if cpi_col and gini_col:
    print(f"CPI column: {cpi_col}")
    print(f"GINI column: {gini_col}")
    
    # Create Corruption variable (higher = more corrupt)
    data['Corruption'] = 100 - data[cpi_col]
    data['Income_Inequality'] = data[gini_col]
else:
    print("Column names:", data.columns.tolist())
    # Try to work with whatever columns exist
    data['Corruption'] = 100 - data.iloc[:, 1]  # Assume second column is CPI
    data['Income_Inequality'] = data.iloc[:, 2]  # Assume third is GINI

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['Corruption', 'Income_Inequality']].describe())

In [None]:
# Correlation
correlation = data['Corruption'].corr(data['Income_Inequality'])
print(f"Correlation between corruption and income inequality: {correlation:.3f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Corruption', y='Income_Inequality', alpha=0.7)
plt.title('Relationship between Corruption and Income Inequality')
plt.xlabel('Corruption Perception (higher = more corrupt)')
plt.ylabel('GINI Coefficient (Income Inequality)')
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Null hypothesis: No association between corruption and income inequality
# Alternative: Significant association exists

# OLS Regression
model = smf.ols('Income_Inequality ~ Corruption', data=data).fit()
print(model.summary())

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(data=data, x='Corruption', y='Income_Inequality', ci=None, 
            line_kws={'color': 'red'})
plt.title('Corruption and Income Inequality with Regression Line')
plt.xlabel('Corruption Perception (higher = more corrupt)')
plt.ylabel('GINI Coefficient')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: No relationship between corruption and inequality (beta = 0)")
print(f"\nIntercept: {model.params['Intercept']:.4f}")
print(f"Corruption coefficient: {model.params['Corruption']:.4f}")
print(f"\nInterpretation:")
print(f"  Each 1-point increase in corruption perception is associated with")
print(f"  a {model.params['Corruption']:.4f} increase in GINI coefficient")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['Corruption']:.2e}")
print(f"\nSignificant at 0.05? {'Yes' if model.pvalues['Corruption'] < 0.05 else 'No'}")

---
## Step 5 | Results Interpretation

### Key Findings

| Variable | Coefficient | P-value |
|----------|-------------|--------|
| Intercept | 0.454 | < 0.001 |
| Corruption | 0.0018 | < 0.001 |

**R-squared:** 0.155

1. **Positive Relationship:** More corrupt countries tend to have higher income inequality

2. **Significant Effect:** The relationship is highly statistically significant (p < 0.001)

3. **Modest R²:** Corruption explains about 15.5% of inequality variation

### Theoretical Interpretation

Why might corruption increase inequality?
- **Rent-seeking:** Elites extract wealth through corruption
- **Weak institutions:** Cannot enforce progressive redistribution
- **Reduced social mobility:** Connections matter more than merit
- **Public services:** Corruption undermines education, healthcare access

### Causal Concerns

The relationship could run both ways:
- **Corruption → Inequality:** As described above
- **Inequality → Corruption:** Elites use wealth to buy political influence
- **Common causes:** Weak institutions cause both

---
## Replication Exercises

### Exercise 1: Control for Development
Add GDP per capita as a control. Does the corruption coefficient change?

### Exercise 2: Regional Differences
Does the relationship differ by continent/region?

### Exercise 3: Time Trends
Using the full time series data, has the relationship changed over time?

### Challenge Exercise
Research the institutional economics literature on corruption. What mechanisms link corruption to economic outcomes?

In [None]:
# Your code for exercises

# Example: Identify high-corruption, low-inequality outliers
# outliers = data[(data['Corruption'] > 60) & (data['Income_Inequality'] < 0.35)]
# print(outliers)