# ECON 0150 | Replication Notebook

**Title:** Housing Prices and Birth Rates

**Original Authors:** Canavan

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** How are changes in housing prices related to birth rates across U.S. states?

**Data Source:** Zillow Home Value Index (ZHVI) and CDC birth rate data by state (2023)

**Methods:** OLS regression of fertility rate on average home value

**Main Finding:** Negative relationship: higher home values are associated with lower fertility rates (coefficient = -1.3e-05, p = 0.006, R² = 0.15).

**Course Concepts Used:**
- Simple linear regression
- Data merging from multiple sources
- Cross-state comparison
- Scatter plots with regression lines

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0031/data/'

birth = pd.read_csv(base_url + 'Birth rate.csv')
zillow = pd.read_csv(base_url + 'zillow.csv')

print("Birth Rate Data:")
print(birth.head())

print("\nZillow Data:")
print(zillow.head())

---
## Step 1 | Data Preparation

In [None]:
# Extract state abbreviations from Zillow MSA names
def get_state(region):
    if isinstance(region, str) and "," in region:
        return region.split(",")[-1].strip()
    return None

zillow['StateAbbr'] = zillow['RegionName'].apply(get_state)

# Keep only rows with valid state abbreviations
msa = zillow[zillow['StateAbbr'].notnull()]

# Select December 2023 home value column
date_col = '12/31/23'

# Compute average home value per state
state_values = (
    msa.groupby('StateAbbr')[date_col]
       .mean()
       .reset_index()
       .rename(columns={date_col: 'HomeValue'})
)

print(f"States with home value data: {len(state_values)}")
state_values.head()

In [None]:
# Merge birth rate and housing data
merged = pd.merge(
    birth,
    state_values,
    left_on='STATE',
    right_on='StateAbbr',
    how='inner'
)

# Filter to 2023 data
df2023 = merged[merged['YEAR'] == 2023].copy()

print(f"Merged data: {len(df2023)} states")
df2023.head()

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(df2023[['FERTILITY RATE', 'HomeValue']].describe())

In [None]:
# Correlation
correlation = df2023['FERTILITY RATE'].corr(df2023['HomeValue'])
print(f"Correlation between fertility rate and home value: {correlation:.3f}")

In [None]:
# Distribution of fertility rates
plt.figure(figsize=(8, 5))
plt.hist(df2023['FERTILITY RATE'], bins=10, edgecolor='black')
plt.xlabel('Fertility Rate (Births per 1,000 Women)')
plt.ylabel('Frequency')
plt.title('Distribution of Fertility Rates by State (2023)')
plt.tight_layout()
plt.show()

---
## Step 3 | Visualization

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))

# Scatterplot
plt.scatter(df2023['HomeValue'], df2023['FERTILITY RATE'], alpha=0.7)

# Regression line
x = df2023['HomeValue']
y = df2023['FERTILITY RATE']
coef = np.polyfit(x, y, 1)
poly_fn = np.poly1d(coef)
plt.plot(np.sort(x), poly_fn(np.sort(x)), color='red', linewidth=2, label='Trendline')

plt.xlabel('Average Home Value ($)')
plt.ylabel('Fertility Rate (Births per 1,000 Women)')
plt.title('Relationship Between Housing Prices and Fertility Rates (2023)')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Alternative visualization with seaborn
plt.figure(figsize=(10, 6))
sns.regplot(data=df2023, x='HomeValue', y='FERTILITY RATE', ci=95)
plt.xlabel('Average Home Value ($)')
plt.ylabel('Fertility Rate (Births per 1,000 Women)')
plt.title('Housing Prices vs Fertility Rates with 95% CI')
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols("Q('FERTILITY RATE') ~ HomeValue", data=df2023).fit()
print(model.summary())

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: {model.params['Intercept']:.2f}")
print(f"HomeValue coefficient: {model.params['HomeValue']:.2e}")
print(f"\nInterpretation:")
print(f"  Each $100,000 increase in average home value is associated with")
print(f"  a {model.params['HomeValue'] * 100_000:.2f} decrease in fertility rate")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['HomeValue']:.4f}")

In [None]:
# Residual plot
plt.figure(figsize=(8, 5))
sns.residplot(x=model.fittedvalues, y=model.resid, lowess=True)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Fertility Rate Model')
plt.tight_layout()
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

| Metric | Value |
|--------|-------|
| HomeValue Coefficient | -1.3e-05 |
| R-squared | 0.15 |
| P-value | 0.006 |

### Interpretation

1. **Statistically Significant:** The negative relationship is statistically significant (p = 0.006)

2. **Effect Size:** Each $100,000 increase in home value is associated with about 1.3 fewer births per 1,000 women

3. **R² = 0.15:** Housing costs explain about 15% of variation in fertility rates

### Why Might Housing Costs Affect Fertility?

- High housing costs delay family formation
- Couples may have fewer children when housing is unaffordable
- Expensive areas often have higher opportunity costs for parents
- Selection effects: different types of people live in expensive vs. affordable areas

### Causal Interpretation?

This is likely correlation rather than causation. Expensive states often have:
- Higher education levels (associated with lower fertility)
- More urban populations
- Different cultural norms around family size

---
## Replication Exercises

### Exercise 1: Time Trends
Extend the analysis to multiple years. Has the relationship strengthened or weakened over time?

### Exercise 2: Controls
Add state-level controls (median income, education, urbanization). Does the housing effect persist?

### Exercise 3: Regional Analysis
Compare the relationship across different Census regions (Northeast, South, Midwest, West).

### Challenge Exercise
Research the economics of fertility decisions. What does the literature say about housing costs and family formation?

In [None]:
# Your code for exercises

# Example: Identify high-cost, low-fertility states
# df2023['HighCost'] = df2023['HomeValue'] > df2023['HomeValue'].median()
# print(df2023.groupby('HighCost')['FERTILITY RATE'].mean())