# ECON 0150 | Replication Notebook

**Title:** Tuition and Enrollment

**Original Authors:** Banawan, Cooper, Voss

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis. You can run this notebook yourself to explore the data, reproduce the findings, and try the extension exercises at the end.

## About This Replication

**Research Question:** How Do Changes in In-State Tuition Affect Changes in Undergraduate Enrollment, and Does This Relationship Differ Between Public and Private Institutions?

**Data Source:** IPEDS data on tuition and enrollment for 2016 and 2023 (2,025 institutions)

**Methods:** OLS regression with arcsinh transformation and interaction term (tuition x institution type)

**Main Finding:** The tuition-enrollment relationship differs by institution type: private institutions show a small positive association (coef = 0.04, p = 0.22), while public institutions show a negative differential effect (interaction = -0.12, p = 0.09).

**Course Concepts Used:**
- OLS regression
- Arcsinh transformations (handling negative values)
- Interaction terms
- First-differencing to study changes over time
- Robust standard errors

---
## Step 0 | Setup

First, we import the necessary libraries and load the data.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0050/data/'

# Load 2016 and 2023 data
data16 = pd.read_csv(base_url + 'Tuition_Enrollment_2016(HC).csv', index_col='unitid')
data23 = pd.read_csv(base_url + 'Tuition_Enrollment_2023(HC1).csv', index_col='unitid')

print(f"2016 data: {len(data16)} institutions")
print(f"2023 data: {len(data23)} institutions")

---
## Step 1 | Data Preparation

We merge the two years and calculate changes in tuition and enrollment.

In [None]:
# Select key columns from 2016 data
data16 = data16[['institution name', 'year',
       'HD2016.Control of institution',
       'IC2016_AY.Published in-state tuition and fees 2016-17',
       'DRVEF2016_RV.Undergraduate enrollment']].dropna()

# Select key columns from 2023 data
data23 = data23[['institution name', 'year',
       'HD2023.Control of institution',
       'IC2023_AY.Published in-state tuition and fees 2023-24',
       'DRVEF2023.Undergraduate enrollment']].dropna()

# Merge on institution ID
data = data16.merge(data23, left_index=True, right_index=True)
print(f"Merged data: {len(data)} institutions")

In [None]:
# Calculate changes (first differences)
data['tuition_diff'] = (data['IC2023_AY.Published in-state tuition and fees 2023-24'] - 
                        data['IC2016_AY.Published in-state tuition and fees 2016-17'])
data['enrollment_diff'] = (data['DRVEF2023.Undergraduate enrollment'] - 
                           data['DRVEF2016_RV.Undergraduate enrollment'])

# Apply arcsinh transformation (handles negative values better than log)
data['arcsinh_tuition_diff'] = np.arcsinh(data['tuition_diff'])
data['arcsinh_enrollment_diff'] = np.arcsinh(data['enrollment_diff'])

# Rename control variable for easier use
data['control'] = data['HD2023.Control of institution']

data[['tuition_diff', 'enrollment_diff', 'arcsinh_tuition_diff', 'arcsinh_enrollment_diff', 'control']].head()

---
## Step 2 | Data Exploration

We explore the distributions of our key variables.

In [None]:
# Distribution of changes
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data['tuition_diff'], bins=30, edgecolor='black')
axes[0].set_xlabel('Change in Tuition (USD)')
axes[0].set_ylabel('Count')
axes[0].set_title('Distribution of Tuition Change (2016-2023)')

axes[1].hist(data['enrollment_diff'], bins=30, edgecolor='black')
axes[1].set_xlabel('Change in Enrollment')
axes[1].set_ylabel('Count')
axes[1].set_title('Distribution of Enrollment Change (2016-2023)')

plt.tight_layout()
plt.show()

In [None]:
# Count by institution type
data['control'].value_counts()

---
## Step 3 | Visualization

We visualize the relationship between tuition and enrollment changes, by institution type.

In [None]:
# Scatter plot with regression lines by institution type
sns.lmplot(data=data, x='arcsinh_tuition_diff', y='arcsinh_enrollment_diff', 
           hue='control', scatter=True, ci=None, 
           scatter_kws={'s': 30, 'alpha': 0.5})
plt.title('Enrollment Change vs. Tuition Change (2016-2023) by Institution Type')
plt.xlabel('Arcsinh(Tuition Change)')
plt.ylabel('Arcsinh(Enrollment Change)')
plt.show()

---
## Step 4 | Statistical Analysis

We run an OLS regression with an interaction term to test whether the tuition-enrollment relationship differs by institution type.

In [None]:
# Model with interaction term and robust standard errors
# Note: Using only Public/Private (excluding for-profit for cleaner comparison)
data_clean = data[data['control'].isin(['Public', 'Private not-for-profit'])].copy()
data_clean['is_public'] = (data_clean['control'] == 'Public').astype(int)

model = smf.ols('arcsinh_enrollment_diff ~ arcsinh_tuition_diff * is_public', 
                data=data_clean).fit(cov_type='HC3')
print(model.summary().tables[1])

In [None]:
# Residual plot
data_clean['residuals'] = model.resid

plt.figure(figsize=(10, 6))
plt.scatter(model.fittedvalues, model.resid, alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Regression Results:**
- **Arcsinh tuition change coefficient:** ~0.04 (p = 0.22)
- **Public indicator:** ~-0.74 (p = 0.17)
- **Interaction (tuition x public):** ~-0.12 (p = 0.09)

### Interpretation

1. **For private institutions**: The baseline effect of tuition changes on enrollment changes is small and not statistically significant (coef = 0.04, p = 0.22).

2. **For public institutions**: The negative interaction term (-0.12) suggests that public institutions may experience a different (more negative) relationship between tuition increases and enrollment changes, though this is only marginally significant.

3. **Practical significance**: The R-squared is low (0.015), indicating that tuition changes explain only a small portion of enrollment changes. Many other factors (demographics, program quality, job market conditions) also affect enrollment.

### Why Arcsinh Transformation?

The arcsinh (inverse hyperbolic sine) transformation is used because:
- Some institutions had enrollment *decreases* (negative values)
- Log transformation doesn't work for negative or zero values
- Arcsinh(x) â‰ˆ log(2x) for large positive x, so it behaves similarly to log for positive values

---
## Replication Exercises

Try extending this analysis with the following exercises:

### Exercise 1: Include For-Profit Institutions
The analysis above excluded for-profit institutions. Include them and add a separate indicator variable. How does the tuition-enrollment relationship differ for for-profit schools?

### Exercise 2: Regional Analysis
If you can identify institution regions (or states), test whether the relationship differs by region. Do some regions show stronger price sensitivity?

### Exercise 3: Large vs Small Schools
Create a binary variable for "large" schools (above median 2016 enrollment). Does the tuition-enrollment relationship differ by school size?

### Challenge Exercise
The analysis uses 7-year changes (2016 to 2023). What are the advantages and disadvantages of this approach compared to year-over-year analysis? What confounding factors might affect a 7-year window that wouldn't affect a 1-year window?

In [None]:
# Your code for Exercise 1: Include For-Profit Institutions


In [None]:
# Your code for Exercise 2: Regional Analysis


In [None]:
# Your code for Exercise 3: Large vs Small Schools


In [None]:
# Your code for Challenge Exercise
