# ECON 0150 | Replication Notebook

**Title:** Education and Income

**Original Authors:** Evans; Bandi

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a relationship between education level and expected income?

**Data Source:** IPUMS American Community Survey (2023) - 613,212 observations

**Methods:** OLS regression of total income on education level

**Main Finding:** Each unit increase in education is associated with $7,431 higher annual income (p < 0.001, R² = 0.174).

**Course Concepts Used:**
- Simple linear regression
- Outlier removal (IQR method)
- Scatter plots with regression lines
- Interpretation of coefficients

---
## Original Data Structure

The original IPUMS ACS data contained:

| Variable | Description |
|----------|-------------|
| YEAR | Survey year (2023) |
| INCTOT | Total personal income |
| EDUC | Education level code (1-11 scale) |

Data cleaning:
- Filtered to year 2023
- Removed income outliers using IQR method
- Final sample: 613,212 observations

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0036/data/'

data = pd.read_csv(base_url + 'cleaned_data.csv')

# Drop the unnamed index column if present
if 'Unnamed: 0' in data.columns:
    data = data.drop(columns=['Unnamed: 0'])

print(f"Number of observations: {len(data):,}")
print(f"Columns: {data.columns.tolist()}")
data.head()

---
## Step 1 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data.describe().round(2))

In [None]:
# Average income by education level
print("\nAverage Income by Education Level:")
print(data.groupby('EDUC')['INCTOT'].mean().round(0))

---
## Step 2 | Visualization

In [None]:
# Scatter plot: Education vs Income
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='EDUC', y='INCTOT', alpha=0.3)
plt.title('Value of Higher Education')
plt.xlabel('Level of Education Attained')
plt.ylabel('Total Yearly Income')
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 3 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols('INCTOT ~ EDUC', data=data).fit()
print(model.summary())

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='EDUC', y='INCTOT', alpha=0.3, label='Actual Data')
plt.plot(data['EDUC'], model.fittedvalues, color='red', label='Regression Line')
plt.title('OLS Regression of Income on Education')
plt.xlabel('Education Level (EDUC)')
plt.ylabel('Total Yearly Income (INCTOT)')
plt.legend(loc='upper left')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: No relationship between education and income (beta = 0)")
print(f"\nIntercept: ${model.params['Intercept']:,.0f}")
print(f"Education coefficient: ${model.params['EDUC']:,.2f}")
print(f"\nInterpretation:")
print(f"  Each additional unit of education is associated with")
print(f"  ${model.params['EDUC']:,.0f} higher annual income")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['EDUC']:.2e}")
print(f"\nSignificant at 0.05? {'Yes' if model.pvalues['EDUC'] < 0.05 else 'No'}")

---
## Original Analysis Results

The original analysis with 613,212 observations found:

| Variable | Coefficient | Std Error | P-value |
|----------|-------------|-----------|--------|
| Intercept | -$11,310 | 159.96 | < 0.001 |
| EDUC | $7,431.29 | 20.70 | < 0.001 |

**R-squared:** 0.174

**Key Finding:** Each additional unit of education is associated with approximately $7,431 higher annual income. This relationship is highly statistically significant.

---
## Step 4 | Results Interpretation

### Key Findings

1. **Strong Positive Relationship:** Higher education is associated with substantially higher income

2. **Large Effect Size:** Each education unit (roughly a level of schooling) adds ~$7,400 to annual income

3. **Moderate R²:** Education explains about 17% of income variation

### Causal Interpretation?

This is likely a mix of:
- **Causal effect:** Education builds human capital (skills, knowledge)
- **Signaling:** Degrees signal ability to employers
- **Selection:** Higher-ability individuals get more education AND earn more

### Limitations

- **Omitted variables:** Age, occupation, industry, location not controlled
- **Nonlinear returns:** Premium may differ at different education levels
- **Quality variation:** All degrees of same level treated equally

---
## Replication Exercises

### Exercise 1: Age Controls
How might adding age as a control change the education coefficient?

### Exercise 2: Nonlinear Returns
Create education categories and test if returns differ by level.

### Exercise 3: Gender Differences
Does the education premium differ for men and women?

### Challenge Exercise
Research the Mincer earnings equation. How does the standard model incorporate education and experience?

In [None]:
# Your code for exercises

# Example: Education categories
# data['educ_cat'] = pd.cut(data['EDUC'], bins=[0, 6, 9, 11], labels=['HS or less', 'Some College', 'College+'])
# print(data.groupby('educ_cat')['INCTOT'].mean())