# ECON 0150 | Replication Notebook

**Title:** Bus Stops and Home Values

**Original Authors:** Weir

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a relationship between the number of bus stops and median home value in Pittsburgh neighborhoods?

**Data Source:** Pittsburgh Regional Transit Open Data and Zillow home values

**Methods:** OLS regression of median housing cost on bus stop count

**Main Finding:** Positive relationship: each additional bus stop is associated with approximately $375 higher median home value (p = 0.047).

**Course Concepts Used:**
- Simple linear regression
- Aggregation by geographic unit
- Scatter plots with regression lines
- Hypothesis testing

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0029/data/'

df = pd.read_csv(base_url + 'Transit_stops__by_route.csv', low_memory=False)

print(f"Number of rows: {len(df):,}")
print(f"Columns: {df.columns.tolist()[:10]}...")  # Show first 10 columns
df.head()

---
## Step 1 | Data Preparation

In [None]:
# Select relevant columns
data = df[['hood', 'Count', 'median_housing_cost']].copy()

# Convert to numeric
data['Count'] = pd.to_numeric(data['Count'], errors='coerce')
data['median_housing_cost'] = pd.to_numeric(data['median_housing_cost'], errors='coerce')

# Drop missing values
data = data.dropna(subset=['hood', 'Count', 'median_housing_cost'])

# Rename columns for clarity
data = data.rename(columns={
    'hood': 'Neighborhood',
    'Count': 'BusStops',
    'median_housing_cost': 'MedianHomeValue'
})

print(f"Clean data: {len(data):,} observations")
print(f"Unique neighborhoods: {data['Neighborhood'].nunique()}")

In [None]:
# Find neighborhood with most stops
neighborhood_counts = data['Neighborhood'].value_counts()
print(f"Neighborhood with most stops: {neighborhood_counts.idxmax()}")
print(f"Number of stops: {neighborhood_counts.max()}")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['BusStops', 'MedianHomeValue']].describe())

In [None]:
# Top neighborhoods by bus stops
top_neighborhoods = data.groupby('Neighborhood')['BusStops'].count().sort_values(ascending=False).head(15)

plt.figure(figsize=(12, 6))
plt.barh(top_neighborhoods.index, top_neighborhoods.values)
plt.xlabel('Number of Bus Stop Observations')
plt.title('Top 15 Neighborhoods by Bus Transit Coverage')
plt.tight_layout()
plt.show()

In [None]:
# Distribution of median home values
plt.figure(figsize=(8, 5))
plt.boxplot(data['MedianHomeValue'].dropna())
plt.ylabel('Median Home Value ($)')
plt.title('Distribution of Median Home Values')
plt.show()

---
## Step 3 | Visualization

In [None]:
# Scatter plot: Bus Stops vs Median Home Value
plt.figure(figsize=(10, 6))
plt.scatter(data['BusStops'], data['MedianHomeValue'], alpha=0.3)
plt.xlabel('Number of Bus Stops (Count)')
plt.ylabel('Median Home Value ($)')
plt.title('Median Home Value vs Number of Bus Stops')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(data=data, x='BusStops', y='MedianHomeValue', 
            scatter_kws={'alpha': 0.3}, line_kws={'color': 'red'})
plt.xlabel('Number of Bus Stops (Count)')
plt.ylabel('Median Home Value ($)')
plt.title('Median Home Value vs Number of Bus Stops with Regression Line')
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
X = sm.add_constant(data['BusStops'])
y = data['MedianHomeValue']

model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: ${model.params['const']:,.2f}")
print(f"Bus Stops coefficient: ${model.params['BusStops']:.2f}")
print(f"\nInterpretation:")
print(f"  Each additional bus stop is associated with")
print(f"  ${model.params['BusStops']:.2f} higher median home value")
print(f"\nR-squared: {model.rsquared:.4f}")
print(f"P-value: {model.pvalues['BusStops']:.4f}")

---
## Step 5 | Results Interpretation

### Key Findings

| Metric | Value |
|--------|-------|
| Bus Stops Coefficient | ~$375 |
| P-value | 0.047 |
| R-squared | Low |

### Interpretation

1. **Statistically Significant:** The positive relationship between bus stops and home values is marginally significant (p = 0.047)

2. **Effect Size:** Each additional bus stop is associated with approximately $375 higher median home value

3. **Low RÂ²:** Bus stop count explains only a small fraction of variation in home values

### What Else Matters for Home Values?

- Neighborhood safety
- School quality
- Distance to downtown
- Housing stock age and quality
- Walkability and other amenities

### Causal Interpretation?

Does transit access *cause* higher home values, or is this correlation driven by:
- Denser areas having both more transit AND higher land values?
- Transit being built in areas that are already valuable?
- Omitted variable bias from factors like proximity to downtown?

---
## Replication Exercises

### Exercise 1: Aggregate Analysis
Aggregate data to the neighborhood level (one observation per neighborhood) and re-run the analysis.

### Exercise 2: Multiple Regression
Add other predictors like distance to downtown. How does this change the bus stop coefficient?

### Exercise 3: Route Quality
Does the type of transit service (express vs. local) matter? Analyze by route type.

### Challenge Exercise
Research the economics of transit-oriented development (TOD). What does the literature say about transit and property values?

In [None]:
# Your code for exercises

# Example: Aggregate to neighborhood level
# neighborhood_data = data.groupby('Neighborhood').agg({
#     'BusStops': 'sum',
#     'MedianHomeValue': 'mean'
# }).reset_index()
# print(neighborhood_data.head())