# ECON 0150 | Replication Notebook

**Title:** Health Spending and Life Expectancy

**Original Authors:** Karpas, Heimel

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is the amount of health spending in a country associated with life expectancy?

**Data Source:** World Bank life expectancy and health expenditure per capita data (2022), plus democracy index data

**Methods:** OLS regression with log transformation of health expenditure

**Main Finding:** Log health expenditure per capita is strongly associated with life expectancy. A 1 unit increase in log health spending is associated with 4.5 years higher life expectancy (p < 0.001, R² = 0.65).

**Course Concepts Used:**
- Log transformations
- OLS regression
- Merging datasets
- Multiple regression with controls

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0017/data/'

life_data = pd.read_csv(base_url + 'Life_expectancy.csv')
health_data = pd.read_csv(base_url + 'health_expenditure.csv')
dem_data = pd.read_csv(base_url + 'dem_index.csv')

print(f"Life expectancy data: {len(life_data)} countries")
print(f"Health expenditure data: {len(health_data)} countries")
print(f"Democracy index data: {len(dem_data)} observations")

---
## Step 1 | Data Preparation

In [None]:
# Filter to 2022 data
life_2022 = life_data[['REF_AREA_LABEL', '2022']].dropna().copy()
life_2022 = life_2022.rename(columns={'2022': 'Life_Expectancy'})

health_2022 = health_data[['REF_AREA_LABEL', '2022']].dropna().copy()
health_2022 = health_2022.rename(columns={'2022': 'Health_Expenditure_per_Capita'})

print(f"Life expectancy 2022: {len(life_2022)} countries")
print(f"Health expenditure 2022: {len(health_2022)} countries")

In [None]:
# Merge life expectancy and health expenditure
data = pd.merge(life_2022, health_2022, on='REF_AREA_LABEL', how='inner')
data = data.dropna()

# Create log of health expenditure
data['log_health'] = np.log(data['Health_Expenditure_per_Capita'])

print(f"Merged data: {len(data)} countries")
data.head()

In [None]:
# Prepare democracy index (2022)
dem_2022 = dem_data[dem_data['Year'] == 2022][['Entity', 'Democracy index']].copy()
dem_2022 = dem_2022.rename(columns={'Entity': 'REF_AREA_LABEL', 'Democracy index': 'Democracy_Index'})

# Merge with main data
data_full = pd.merge(data, dem_2022, on='REF_AREA_LABEL', how='inner')
print(f"Data with democracy index: {len(data_full)} countries")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
data[['Life_Expectancy', 'Health_Expenditure_per_Capita', 'log_health']].describe()

In [None]:
# Histograms
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data['Life_Expectancy'], bins=20, color='purple', edgecolor='black')
axes[0].set_xlabel('Life Expectancy (years)')
axes[0].set_title('Distribution of Life Expectancy (2022)')

axes[1].hist(data['Health_Expenditure_per_Capita'], bins=20, color='pink', edgecolor='black')
axes[1].set_xlabel('Health Expenditure per Capita (USD)')
axes[1].set_title('Distribution of Health Spending (2022)')

plt.tight_layout()
plt.show()

---
## Step 3 | Visualization

In [None]:
# Scatter plot: Raw health expenditure
plt.figure(figsize=(10, 6))
sns.regplot(data=data, x='Health_Expenditure_per_Capita', y='Life_Expectancy', ci=None)
plt.xlabel('Health Expenditure per Capita (USD)')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy vs Health Expenditure (Raw)')
plt.show()

In [None]:
# Scatter plot: Log health expenditure
plt.figure(figsize=(10, 6))
sns.regplot(data=data, x='log_health', y='Life_Expectancy', 
            scatter_kws={'color': 'orange'}, line_kws={'color': 'red', 'linestyle': '--'})
plt.xlabel('Log Health Expenditure per Capita')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy vs Log Health Expenditure')
plt.grid(True)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Model 1: Raw health expenditure
model_raw = smf.ols('Life_Expectancy ~ Health_Expenditure_per_Capita', data=data).fit()
print("Model with raw health expenditure:")
print(model_raw.summary().tables[1])
print(f"R-squared: {model_raw.rsquared:.3f}")

In [None]:
# Model 2: Log health expenditure
model_log = smf.ols('Life_Expectancy ~ log_health', data=data).fit()
print("\nModel with log health expenditure:")
print(model_log.summary().tables[1])
print(f"R-squared: {model_log.rsquared:.3f}")

In [None]:
# Model 3: With democracy index control
model_full = smf.ols('Life_Expectancy ~ log_health + Democracy_Index', data=data_full).fit()
print("\nModel with log health expenditure + democracy index:")
print(model_full.summary().tables[1])
print(f"R-squared: {model_full.rsquared:.3f}")

In [None]:
# Residual analysis
residuals = model_log.resid
predictions = model_log.predict()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(residuals, bins=20)
axes[0].set_xlabel('Residuals')
axes[0].set_title('Distribution of Residuals')

axes[1].scatter(predictions, residuals, color='orange')
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Life Expectancy')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')

plt.tight_layout()
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Model Comparison:**

| Model | R² | Key Coefficient |
|-------|----|-----------------|
| Raw expenditure | 0.43 | 0.002 (p < 0.001) |
| Log expenditure | 0.65 | 4.49 (p < 0.001) |
| Log + Democracy | 0.65 | 4.51 (p < 0.001) |

### Interpretation

1. **Log transformation improves fit:** R² increases from 0.43 to 0.65 with log transformation, suggesting diminishing returns to health spending.

2. **Strong relationship:** Each 1 unit increase in log health expenditure (roughly 2.7x higher spending) is associated with ~4.5 years higher life expectancy.

3. **Democracy not significant:** Adding democracy index does not significantly improve the model (p = 0.43), suggesting health spending is a stronger predictor.

### Caveats

- Cross-sectional data cannot establish causation
- Omitted variables (nutrition, education, sanitation) may explain both
- Outliers (very high spenders like US) may influence results

---
## Replication Exercises

### Exercise 1: Outliers
Identify countries with unusually high or low life expectancy given their health spending. What might explain these outliers?

### Exercise 2: Income Controls
Add GDP per capita as a control. Does health spending remain significant after controlling for income?

### Exercise 3: Regional Analysis
Add region as a categorical variable. Does the relationship differ across regions?

### Challenge Exercise
The US is a notable outlier (high spending, moderate life expectancy). Remove the US and re-run. How do results change?

In [None]:
# Your code for exercises
