# ECON 0150 | Replication Notebook

**Title:** Major and Income

**Original Authors:** Iskandarani; Chau

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Does undergraduate major impact income?

**Data Source:** IPUMS American Community Survey (2023) - 1 million person sample

**Methods:** OLS regression with categorical major variable and controls for age and education level

**Main Finding:** Major significantly predicts income. Engineering (coef = 3.68) and Physical Sciences (coef = 3.63) have highest log income premiums compared to no degree, while Fine Arts (coef = 2.86) shows lowest. Model R² = 0.54.

**Course Concepts Used:**
- Categorical (dummy) variables
- Log transformation of income
- OLS regression with multiple predictors
- Interpreting coefficients relative to baseline

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
# Note: This is a large file (97MB, 1 million observations)
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0020/data/'

df = pd.read_csv(base_url + 'usa_00046_sample.csv')

print(f"Sample size: {len(df):,} observations")
print(f"Columns: {df.columns.tolist()}")
df.head()

---
## Step 1 | Data Preparation

In [None]:
# Major field codes mapping
degfield_map = {
    0: "N/A (No Degree)",
    11: "Agriculture",
    13: "Environment & Natural Resources",
    14: "Architecture",
    15: "Area & Ethnic Studies",
    19: "Communications",
    20: "Communication Tech",
    21: "Computer & Information Sciences",
    22: "Cosmetology",
    23: "Education Admin & Teaching",
    24: "Engineering",
    25: "Engineering Tech",
    26: "Linguistics & Foreign Languages",
    29: "Family & Consumer Sciences",
    32: "Law",
    33: "English Language & Literature",
    34: "Liberal Arts & Humanities",
    35: "Library Science",
    36: "Biology & Life Sciences",
    37: "Mathematics & Statistics",
    38: "Military Tech",
    40: "Interdisciplinary Studies",
    41: "Physical Fitness & Leisure",
    48: "Philosophy & Religion",
    49: "Theology",
    50: "Physical Sciences",
    51: "Nuclear & Biological Tech",
    52: "Psychology",
    53: "Criminal Justice & Fire",
    54: "Public Affairs & Social Work",
    55: "Social Sciences",
    56: "Construction Services",
    57: "Electrical & Mechanical Tech",
    59: "Transportation Sciences",
    60: "Fine Arts",
    61: "Medical & Health Services",
    62: "Business",
    64: "History"
}

# Add major labels
df['DEGFIELD_LABEL'] = df['DEGFIELD'].map(degfield_map)

In [None]:
# Filter to positive income and create log income
df = df[df['INCTOT'] > 0].copy()
df['logINCTOT'] = np.log(df['INCTOT'])

# Drop missing values
df = df.dropna(subset=['AGE', 'logINCTOT', 'DEGFIELD'])

print(f"Analysis sample: {len(df):,} observations")

---
## Step 2 | Data Exploration

In [None]:
# Average income by major
major_income = df.groupby('DEGFIELD_LABEL')['INCTOT'].mean().sort_values(ascending=False)
print("Mean income by major field:")
print(major_income.apply(lambda x: f"${x:,.0f}"))

In [None]:
# Distribution of majors
plt.figure(figsize=(12, 8))
major_counts = df['DEGFIELD_LABEL'].value_counts()
major_counts.plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Major Field')
plt.title('Distribution of Major Fields in Sample')
plt.tight_layout()
plt.show()

In [None]:
# Log income distribution
plt.figure(figsize=(10, 5))
plt.hist(df['logINCTOT'], bins=50, edgecolor='black')
plt.xlabel('Log Income')
plt.ylabel('Frequency')
plt.title('Distribution of Log Total Income')
plt.show()

---
## Step 3 | Visualization

In [None]:
# Income by age
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df.sample(10000), x='AGE', y='logINCTOT', alpha=0.3)
plt.xlabel('Age')
plt.ylabel('Log Income')
plt.title('Log Income vs Age (Sample of 10,000)')
plt.show()

In [None]:
# Box plot of income by major (top 10 majors by count)
top_majors = df['DEGFIELD_LABEL'].value_counts().head(10).index.tolist()
df_top = df[df['DEGFIELD_LABEL'].isin(top_majors)]

plt.figure(figsize=(12, 6))
sns.boxplot(data=df_top, x='DEGFIELD_LABEL', y='logINCTOT', showfliers=False)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Major')
plt.ylabel('Log Income')
plt.title('Log Income Distribution by Major (Top 10 Fields)')
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS regression with categorical major variable and controls
model = smf.ols('logINCTOT ~ C(DEGFIELD) + AGE + EDUC', data=df).fit()
print(model.summary())

In [None]:
# Extract major coefficients (excluding baseline)
major_coefs = model.params.filter(like='DEGFIELD')
major_pvals = model.pvalues.filter(like='DEGFIELD')

# Create summary dataframe
coef_summary = pd.DataFrame({
    'Coefficient': major_coefs,
    'P-value': major_pvals
}).sort_values('Coefficient', ascending=False)

print("\nMajor coefficients (relative to no degree baseline):")
print(coef_summary.head(15))

In [None]:
# Predicted income by major
df['predicted_logINCTOT'] = model.predict(df)

# Average predicted income by major
predicted_by_major = df.groupby('DEGFIELD_LABEL')['predicted_logINCTOT'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 8))
predicted_by_major.plot(kind='barh')
plt.xlabel('Predicted Log Income')
plt.ylabel('Major')
plt.title('Average Predicted Log Income by Major')
plt.tight_layout()
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Model Results (R² = 0.54):**

The regression model explains 54% of the variation in log income using major field, age, and education level.

**Highest Income Majors (log income premium relative to no degree):**
1. Engineering (3.68)
2. Physical Sciences (3.63)
3. Mathematics & Statistics (3.59)
4. History (3.54)
5. Biology & Life Sciences (3.51)

**Lowest Income Majors:**
1. Fine Arts (2.86)
2. Physical Fitness & Leisure (2.83)
3. Family & Consumer Sciences (2.89)

### Interpretation

- All major categories show positive coefficients compared to having no degree
- STEM fields (engineering, physical sciences, math) have the highest income premiums
- The age coefficient is negative (-0.023), suggesting income peaks and then declines
- Education level has a complex relationship (negative coefficient may be due to multicollinearity with major)

### Caveats

- Cross-sectional data cannot establish causation
- Selection effects: who chooses which majors matters
- Occupational sorting explains much of the major-income relationship

---
## Replication Exercises

### Exercise 1: Gender Interaction
Add sex as a control and interaction. Do the income premiums differ by gender for different majors?

### Exercise 2: Age Profiles
Add a squared age term (AGE²) to capture non-linear income-age profiles. How does this change R²?

### Exercise 3: Specific Majors
Focus on comparing just a few majors (e.g., Business vs Engineering vs Education). What's the income gap?

### Challenge Exercise
Research the "major choice" literature in economics. What factors predict major choice, and how does this affect causal interpretation?

In [None]:
# Your code for exercises
