# ECON 0150 | Replication Notebook

**Title:** Height and Life Expectancy

**Original Authors:** Sampugnaro

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Does average height affect life expectancy across countries?

**Data Source:** Our World in Data - Average height and life expectancy by country

**Methods:** OLS regression of life expectancy on average height

**Main Finding:** Positive relationship: taller countries tend to have higher life expectancy, but this is likely confounded by development and nutrition.

**Course Concepts Used:**
- Simple linear regression
- Cross-country comparison
- Scatter plots with regression lines
- Residual analysis

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# Load data from Our World in Data
# Height data
height_url = "https://ourworldindata.org/grapher/average-height-by-year-of-birth.csv"
life_url = "https://ourworldindata.org/grapher/life-expectancy.csv"

print("Downloading height data...")
h = pd.read_csv(height_url)
print(f"Height data columns: {h.columns.tolist()[:5]}...")

print("Downloading life expectancy data...")
le = pd.read_csv(life_url)
print(f"Life expectancy columns: {le.columns.tolist()[:5]}...")

---
## Step 1 | Data Preparation

In [None]:
# Find the relevant columns
def find_column_containing(df, keywords):
    for k in keywords:
        for c in df.columns:
            if k.lower() in c.lower():
                return c
    return None

# Height dataset columns
height_val_col = find_column_containing(h, ['height'])
height_entity_col = find_column_containing(h, ['entity','country'])
height_year_col = find_column_containing(h, ['year','birth'])

# Life expectancy columns
life_val_col = find_column_containing(le, ['life','expectancy'])
life_entity_col = find_column_containing(le, ['entity','country'])
life_year_col = find_column_containing(le, ['year'])

print(f"Height columns: {height_entity_col}, {height_year_col}, {height_val_col}")
print(f"Life columns: {life_entity_col}, {life_year_col}, {life_val_col}")

In [None]:
# Get most recent data for each country
# For height: pick most recent birth cohort per country
h_latest = h.sort_values(by=height_year_col).groupby(height_entity_col, as_index=False).last()
h_latest = h_latest[[height_entity_col, height_val_col]].rename(columns={
    height_entity_col: 'Country',
    height_val_col: 'AvgHeight'
})

# For life expectancy: pick most recent year per country
le_latest = le.sort_values(by=life_year_col).groupby(life_entity_col, as_index=False).last()
le_latest = le_latest[[life_entity_col, life_val_col]].rename(columns={
    life_entity_col: 'Country',
    life_val_col: 'LifeExpectancy'
})

# Merge datasets
df = pd.merge(h_latest, le_latest, on='Country', how='inner')
df = df.dropna(subset=['AvgHeight', 'LifeExpectancy']).reset_index(drop=True)

print(f"Merged dataset: {len(df)} countries")
df.head()

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(df[['AvgHeight', 'LifeExpectancy']].describe().round(2))

In [None]:
# Correlation
correlation = df['AvgHeight'].corr(df['LifeExpectancy'])
print(f"Correlation between height and life expectancy: {correlation:.3f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(df['AvgHeight'], df['LifeExpectancy'], alpha=0.7)

# Add regression line
X = sm.add_constant(df['AvgHeight'])
model = sm.OLS(df['LifeExpectancy'], X).fit()
x_sorted = np.linspace(df['AvgHeight'].min(), df['AvgHeight'].max(), 200)
y_line = model.params['const'] + model.params['AvgHeight'] * x_sorted
plt.plot(x_sorted, y_line, 'r-', linewidth=2, label='Regression Line')

plt.xlabel('Average Height (cm)')
plt.ylabel('Life Expectancy (years)')
plt.title('Life Expectancy vs Average Height Across Countries')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
X = sm.add_constant(df['AvgHeight'])
y = df['LifeExpectancy']
model = sm.OLS(y, X).fit()

print(model.summary())

In [None]:
# Residual plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residuals vs fitted
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.7)
axes[0].axhline(0, color='k', linewidth=0.8)
axes[0].set_xlabel('Fitted Life Expectancy')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')

# Histogram of residuals
axes[1].hist(model.resid, bins=20, edgecolor='black')
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: {model.params['const']:.2f}")
print(f"Height coefficient: {model.params['AvgHeight']:.4f}")
print(f"\nInterpretation:")
print(f"  Each 1 cm increase in average height is associated with")
print(f"  {model.params['AvgHeight']:.2f} years higher life expectancy")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['AvgHeight']:.2e}")

---
## Step 5 | Results Interpretation

### Key Findings

1. **Positive Correlation:** Countries with taller populations tend to have higher life expectancy

2. **Effect Size:** Each 1 cm in average height is associated with approximately 0.5-1 year higher life expectancy

3. **Moderate RÂ²:** Height explains a moderate portion of cross-country variation in life expectancy

### Critical Caution: Confounding!

This relationship is almost certainly **NOT causal**. Both height and life expectancy are driven by:

- **Nutrition:** Better nutrition leads to both taller people AND longer lives
- **Healthcare:** Better healthcare improves child development AND longevity
- **Economic Development:** GDP affects nutrition, healthcare, height, and life expectancy
- **Disease burden:** Infectious diseases in childhood stunt growth AND reduce lifespan

### The Real Story

Height is a **proxy** for development and nutrition quality. The correlation captures the fact that developed countries have:
- Better childhood nutrition (taller populations)
- Better healthcare (longer lives)

Height itself likely has minimal causal effect on longevity. In fact, within populations, taller individuals may have *slightly lower* life expectancy!

---
## Replication Exercises

### Exercise 1: Control for GDP
Add GDP per capita as a control variable. Does the height coefficient change?

### Exercise 2: Time Trends
How have global heights and life expectancy changed over time? Plot time series.

### Exercise 3: Regional Analysis
Does the relationship differ by continent?

### Challenge Exercise
Research the relationship between height and longevity within populations. What do individual-level studies find?

In [None]:
# Your code for exercises

# Example: Time trend in global life expectancy
# world_le = le[le[life_entity_col] == 'World']
# plt.plot(world_le[life_year_col], world_le[life_val_col])
# plt.xlabel('Year')
# plt.ylabel('Life Expectancy')
# plt.title('Global Life Expectancy Over Time')
# plt.show()