# ECON 0150 | Replication Notebook

**Title:** Income and Life Expectancy

**Original Authors:** Taivan, Suess

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis. You can run this notebook yourself to explore the data, reproduce the findings, and try the extension exercises at the end.

## About This Replication

**Research Question:** Is higher household income associated with a longer life expectancy?

**Data Source:** Health Inequality Project data (Chetty et al.) - life expectancy by income percentile and gender

**Methods:** OLS regression with log transformation and interaction term (income x gender)

**Main Finding:** Log household income is positively associated with life expectancy. Women have higher baseline life expectancy, but men show a stronger income-life expectancy gradient (interaction coef = 1.10, p < 0.001).

**Course Concepts Used:**
- OLS regression
- Log transformations
- Interaction terms
- Interpreting gender effects
- Residual analysis

---
## Step 0 | Setup

First, we import the necessary libraries and load the data.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
data_url = 'https://tayweid.github.io/econ-0150/projects/replications/0055/data/Health%20Inequality%20Online%20Table%202.csv'
data = pd.read_csv(data_url)

# Preview the data
data.head()

In [None]:
# Check the shape and columns
print(f"Dataset has {len(data)} rows and {len(data.columns)} columns")
print(f"\nColumns: {list(data.columns)}")

---
## Step 1 | Data Preparation

We create log transformations of income for the analysis.

In [None]:
# Create log income variable
data['log_hh_inc'] = np.log(data['hh_inc'])

# Check key variables
print("Key columns:")
print("- gnd: Gender (F/M)")
print("- pctile: Income percentile")
print("- hh_inc: Mean household income for that percentile")
print("- le_agg: Life expectancy")
print(f"\nYears in data: {data['year'].unique()}")

---
## Step 2 | Data Exploration

We explore the distributions of our key variables.

In [None]:
# Summary statistics
data[['hh_inc', 'log_hh_inc', 'le_agg']].describe()

In [None]:
# Distribution of life expectancy by gender
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(data=data, x='log_hh_inc', hue='gnd', ax=axes[0])
axes[0].set_xlabel('Log Household Income')
axes[0].set_title('Distribution of Log Household Income by Gender')

sns.histplot(data=data, x='le_agg', hue='gnd', ax=axes[1])
axes[1].set_xlabel('Life Expectancy')
axes[1].set_title('Distribution of Life Expectancy by Gender')

plt.tight_layout()
plt.show()

---
## Step 3 | Visualization

We visualize the relationship between income and life expectancy, by gender.

In [None]:
# Scatter plot of income vs life expectancy
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='hh_inc', y='le_agg', hue='gnd', alpha=0.5)
plt.xlabel('Household Income ($)')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy vs. Household Income by Gender')
plt.show()

In [None]:
# Scatter plot with log income and regression lines
sns.lmplot(data=data, x='log_hh_inc', y='le_agg', hue='gnd', ci=None)
plt.xlabel('Log Household Income')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy vs. Log Household Income by Gender')
plt.show()

---
## Step 4 | Statistical Analysis

We run an OLS regression with an interaction term to test whether the income-life expectancy relationship differs by gender.

In [None]:
# Model with interaction term
model = smf.ols('le_agg ~ log_hh_inc + gnd + log_hh_inc:gnd', data=data).fit()
print(model.summary().tables[1])

In [None]:
# Residual plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(model.fittedvalues, model.resid, alpha=0.5)
ax.axhline(y=0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('Fitted Values', fontsize=12)
ax.set_ylabel('Residuals', fontsize=12)
ax.set_title('Residuals vs. Fitted Values', fontsize=13)
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Regression Results:**
- **Intercept (Female baseline):** ~66.1 years
- **Male indicator:** -15.95 (p < 0.001) - males have lower baseline life expectancy
- **Log income (Female):** +1.79 years per log unit (p < 0.001)
- **Interaction (Male x Log income):** +1.10 (p < 0.001) - males show a steeper income gradient

### Interpretation

1. **Gender Gap**: At low incomes, women live substantially longer than men. The -15.95 coefficient represents the baseline gender gap.

2. **Income Effect for Women**: For each 1 unit increase in log income (roughly 2.7x more income), women's life expectancy increases by ~1.79 years.

3. **Income Effect for Men**: The total effect for men is 1.79 + 1.10 = 2.89 years per log unit of income. Men's life expectancy is more strongly associated with income.

4. **Converging Gap**: Because men have a steeper income gradient, the gender life expectancy gap narrows at higher incomes.

### Practical Interpretation

A 1% increase in household income is associated with:
- For women: 0.018 years (about 6.6 days) higher life expectancy
- For men: 0.029 years (about 10.6 days) higher life expectancy

---
## Replication Exercises

Try extending this analysis with the following exercises:

### Exercise 1: Time Trends
The data contains multiple years. Run separate regressions for the earliest and latest years. Has the income-life expectancy gradient changed over time?

### Exercise 2: Non-linearity
Add a quadratic term (log_hh_inc squared) to the model. Is there evidence of diminishing returns to income at higher income levels?

### Exercise 3: Percentile Analysis
Instead of using actual income, use the percentile variable (1-100). How does the interpretation change?

### Challenge Exercise
The data shows that men's life expectancy is more sensitive to income than women's. Generate some hypotheses for why this might be the case. What additional data would you need to test these hypotheses?

In [None]:
# Your code for Exercise 1: Time Trends


In [None]:
# Your code for Exercise 2: Non-linearity


In [None]:
# Your code for Exercise 3: Percentile Analysis


In [None]:
# Your code for Challenge Exercise
