# ECON 0150 | Replication Notebook

**Title:** Air Pollution and GDP

**Original Authors:** Cen, Habazin, Zheng

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a significant relationship between Air Pollution and GDP per Capita?

**Data Source:** PM2.5 air pollution exposure and GDP per capita data by country and year

**Methods:** OLS regression with log transformation of GDP

**Main Finding:** Higher log GDP per capita is associated with lower air pollution. A 1 unit increase in log GDP is associated with ~5-6 fewer micrograms per cubic meter of PM2.5 (p < 0.001). This relationship strengthened from 1990 to 2020.

**Course Concepts Used:**
- OLS regression
- Log transformations
- Panel data analysis across years
- Residual analysis

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
data_url = 'https://tayweid.github.io/econ-0150/projects/replications/0001/data/cleaned_data.csv'
data = pd.read_csv(data_url)

# Preview the data
print(f"Dataset has {len(data)} rows")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# The data has already been cleaned with log_gdp column
# Column key:
# - atm: PM2.5 air pollution (micrograms per cubic meter)
# - gdp: GDP per capita (constant 2021 international $)
# - log_gdp: Natural log of GDP per capita

print(f"Years in data: {sorted(data['Year'].unique())}")
print(f"\nNumber of countries per year:")
data.groupby('Year').size().head(10)

---
## Step 2 | Data Exploration

In [None]:
# Filter to decade years for comparison
data_decade = data[data['Year'].isin([1990, 2000, 2010, 2020])]

# Boxplots of GDP and air pollution by decade
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(data=data_decade, x='gdp', y='Year', orient='h', ax=axes[0])
axes[0].set_title('GDP per Capita by Decade')
axes[0].set_xlabel('GDP per Capita ($)')

sns.boxplot(data=data_decade, x='atm', y='Year', orient='h', ax=axes[1])
axes[1].set_title('Air Pollution (PM2.5) by Decade')
axes[1].set_xlabel('PM2.5 (micrograms/m³)')

plt.tight_layout()
plt.show()

---
## Step 3 | Visualization

In [None]:
# Scatter plots for different years
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, year in zip(axes.flat, [1990, 2000, 2010, 2020]):
    year_data = data[data['Year'] == year]
    sns.regplot(data=year_data, x='log_gdp', y='atm', ax=ax, line_kws={'color': 'red'})
    ax.set_title(f'Air Pollution vs Log GDP ({year})')
    ax.set_xlabel('Log GDP per Capita')
    ax.set_ylabel('PM2.5 (micrograms/m³)')

plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Run regressions for each decade
for year in [1990, 2000, 2010, 2020]:
    year_data = data[data['Year'] == year]
    model = smf.ols('atm ~ log_gdp', data=year_data).fit()
    print(f"\n=== {year} (n={len(year_data)}) ===")
    print(model.summary().tables[1])

---
## Step 5 | Results Interpretation

### Key Findings

**Across all years:**
- Higher GDP per capita is significantly associated with lower air pollution (p < 0.001)
- The relationship is negative: wealthier countries have cleaner air

**Time trends:**
- 1990: coef = -4.77 (R² = 0.108)
- 2000: coef = -5.24 (R² = 0.136)
- 2010: coef = -4.92 (R² = 0.138)
- 2020: coef = -6.33 (R² = 0.224)

The relationship strengthened over time, with R² increasing from 0.11 to 0.22.

### Interpretation

A 1 unit increase in log GDP (roughly 2.7x higher GDP) is associated with approximately 5-6 fewer micrograms per cubic meter of PM2.5. This could reflect:
- Wealthier countries can afford cleaner technology
- Wealthier countries have stronger environmental regulations
- Shift from manufacturing to service economies

---
## Replication Exercises

### Exercise 1: Environmental Kuznets Curve
Add a quadratic term (log_gdp²) to test for an inverted-U relationship. Does pollution first increase then decrease with development?

### Exercise 2: Regional Analysis
Add region as a control variable. Does the relationship differ by world region?

### Exercise 3: Panel Model
Pool all years together and include year fixed effects. How does the pooled estimate compare to the year-by-year estimates?

### Challenge Exercise
This is a cross-sectional relationship. What are the threats to causal interpretation? What would you need to establish causality?

In [None]:
# Your code for exercises
