# ECON 0150 | Replication Notebook

**Title:** Wages and Immigration

**Original Authors:** Asanbe

**Original Date:** Fall 2024

---

This notebook documents the analysis from a student final project in ECON 0150: Economic Data Analysis.

**Note:** The original IPUMS CPS dataset is too large to host. This notebook documents the methodology and findings.

## About This Replication

**Research Question:** Is there a relationship between wages earned and immigration status?

**Data Source:** IPUMS Current Population Survey (CPS) - 536,569 observations (2015-2025)

**Methods:** OLS regression: log(wage) ~ immigrant + education + age + year

**Main Finding:** Immigrant coefficient = -0.016 (approximately 1.6% lower wages), but NOT statistically significant at p = 0.061.

**Course Concepts Used:**
- Log transformation of dependent variable
- Multiple regression with controls
- Time series visualization
- Interpretation of coefficients as percent changes

---
## Original Data Structure

The original IPUMS CPS data contained:

| Variable | Description |
|----------|-------------|
| YEAR | Survey year (2015-2025) |
| AGE | Respondent age |
| EDUC | Education level code |
| INCWAGE | Wage and salary income |
| NATIVITY | 1 = Native-born, 2 = Immigrant |

Sample restrictions:
- Working-age adults (18-65)
- Positive wages
- Years 2015-2025

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Create synthetic data based on original analysis patterns
np.random.seed(42)
n = 5000  # Smaller sample for demonstration

# Generate data matching original patterns
year = np.random.choice(range(2015, 2026), n)
age = np.random.randint(18, 66, n)
educ = np.random.randint(50, 125, n)  # IPUMS education codes
immigrant = np.random.binomial(1, 0.025, n)  # ~2.5% immigrant

# Generate log wages with realistic patterns
# Base: ~$45,000 median, log(45000) ≈ 10.7
log_wage = (
    -76.75 +  # Intercept
    0.018 * educ +  # Education premium
    0.021 * age +  # Age/experience premium
    0.042 * year +  # Time trend
    -0.016 * immigrant +  # Immigrant gap (small, not significant)
    np.random.normal(0, 0.8, n)  # Random variation
)

data = pd.DataFrame({
    'YEAR': year,
    'AGE': age,
    'EDUC': educ,
    'immigrant': immigrant,
    'log_wage': log_wage,
    'wage_dollars': np.exp(log_wage)
})

data['immigrant_label'] = np.where(data['immigrant'] == 1, 'Immigrant', 'Native-born')

print(f"Sample size: {len(data):,}")
print(f"\nImmigrant distribution:")
print(data['immigrant_label'].value_counts())

---
## Step 1 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['log_wage', 'wage_dollars', 'AGE', 'EDUC']].describe().round(2))

In [None]:
# Average wages by immigrant status
print("\nAverage Wages by Immigrant Status:")
print(data.groupby('immigrant_label')['wage_dollars'].mean().round(0))

---
## Step 2 | Visualization

In [None]:
# Average wages over time by group
yearly_avg = (
    data
    .groupby(['YEAR', 'immigrant_label'])['wage_dollars']
    .mean()
    .reset_index()
)

plt.figure(figsize=(10, 6))
sns.lineplot(
    data=yearly_avg,
    x='YEAR',
    y='wage_dollars',
    hue='immigrant_label',
    marker='o'
)
plt.xlabel('Year')
plt.ylabel('Average Annual Wage (USD)')
plt.title('Average Annual Wages Over Time by Immigration Status')
plt.legend(title='Status')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Wage gap over time
yearly_avg_wide = yearly_avg.pivot(
    index='YEAR',
    columns='immigrant_label',
    values='wage_dollars'
)

yearly_avg_wide['Wage Gap'] = (
    yearly_avg_wide['Native-born'] - yearly_avg_wide['Immigrant']
)

plt.figure(figsize=(10, 5))
plt.plot(yearly_avg_wide.index, yearly_avg_wide['Wage Gap'], marker='o')
plt.axhline(0, linestyle='--', color='gray')
plt.xlabel('Year')
plt.ylabel('Wage Gap (Native-born − Immigrant) in USD')
plt.title('Wage Gap Between Native-born and Immigrant Workers Over Time')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## Step 3 | Statistical Analysis

In [None]:
# OLS Regression with controls
model = smf.ols(
    'log_wage ~ immigrant + EDUC + AGE + YEAR',
    data=data
).fit()

print(model.summary())

In [None]:
# Key results
beta = model.params['immigrant']
pval = model.pvalues['immigrant']

print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: No difference in wages (beta_immigrant = 0)")
print(f"\nImmigrant coefficient: {beta:.4f}")
print(f"Approximate percent difference: {100*beta:.2f}%")
print(f"P-value: {pval:.3f}")
print(f"\nSignificant at 0.05? {'Yes' if pval < 0.05 else 'No'}")

print(f"\nOther coefficients:")
print(f"  Education: {model.params['EDUC']:.4f} (each unit = {100*model.params['EDUC']:.2f}% higher wages)")
print(f"  Age: {model.params['AGE']:.4f} (each year = {100*model.params['AGE']:.2f}% higher wages)")
print(f"  Year: {model.params['YEAR']:.4f} (annual wage growth = {100*model.params['YEAR']:.2f}%)")

---
## Original Analysis Results

The original analysis with 536,569 observations found:

| Variable | Coefficient | P-value |
|----------|-------------|--------|
| Immigrant | -0.0159 | 0.061 |
| Education | 0.0179 | < 0.001 |
| Age | 0.0210 | < 0.001 |
| Year | 0.0420 | < 0.001 |

**Key Finding:** Immigrant workers earn approximately 1.6% less than native-born workers with similar education and age, but this difference is NOT statistically significant (p = 0.061).

---
## Step 4 | Results Interpretation

### Key Findings

1. **Small Wage Gap:** After controlling for education and age, immigrants earn about 1.6% less

2. **Not Statistically Significant:** p = 0.061 means we cannot reject the null hypothesis of no difference at the 0.05 level

3. **Education and Age Matter More:** These controls have larger and highly significant effects

### Important Caveats

- **Occupation not controlled:** Immigrants may work in different occupations
- **Industry effects:** Some industries employ more immigrants
- **English proficiency:** Language skills affect wages
- **Legal status:** Dataset may not distinguish visa types
- **Selection effects:** Which immigrants are in the workforce?

### The Literature

Research on immigrant-native wage gaps generally finds:
- Initial gap exists for new immigrants
- Gap narrows with time in country
- Second generation often matches or exceeds native-born wages
- Results vary by origin country and education level

---
## Replication Exercises

### Exercise 1: Occupation Controls
If occupation data were available, how might adding it change the immigrant coefficient?

### Exercise 2: Time in Country
Research suggests wage gaps narrow over time. How would you test this?

### Exercise 3: Education Interaction
Does the immigrant gap differ by education level? Add an interaction term.

### Challenge Exercise
Research the Borjas vs. Card debate on immigration and wages. What are the key empirical disagreements?

In [None]:
# Your code for exercises

# Example: Test for interaction with education
# model_interaction = smf.ols(
#     'log_wage ~ immigrant * EDUC + AGE + YEAR',
#     data=data
# ).fit()
# print(model_interaction.summary())