# ECON 0150 | Replication Notebook

**Title:** NYC Housing Price Growth

**Original Author:** Chen

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Which New York boroughs have the highest housing price growth?

**Data Source:** NYC Department of Finance Rolling Sales Data (November 2024 - October 2025)

**Methods:** Chi-square test for independence and descriptive statistics

**Main Finding:** There are significant differences in housing price distributions across NYC boroughs. Manhattan and Brooklyn have higher median prices than other boroughs.

**Course Concepts Used:**
- Data cleaning and preparation
- Combining multiple data sources
- Chi-square test
- Descriptive statistics by group

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from scipy.stats import chi2_contingency

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0011/data/'

# Load sales data for each borough
manhattan = pd.read_csv(base_url + 'manhattan_sales.csv')
bronx = pd.read_csv(base_url + 'bronx_sales.csv')
brooklyn = pd.read_csv(base_url + 'brooklyn_sales.csv')
queens = pd.read_csv(base_url + 'queens_sales.csv')
staten = pd.read_csv(base_url + 'statenisland_sales.csv')

print(f"Manhattan: {len(manhattan)} sales")
print(f"Bronx: {len(bronx)} sales")
print(f"Brooklyn: {len(brooklyn)} sales")
print(f"Queens: {len(queens)} sales")
print(f"Staten Island: {len(staten)} sales")

---
## Step 1 | Data Preparation

In [None]:
# Add borough labels and combine
manhattan['borough_name'] = 'Manhattan'
bronx['borough_name'] = 'Bronx'
brooklyn['borough_name'] = 'Brooklyn'
queens['borough_name'] = 'Queens'
staten['borough_name'] = 'Staten Island'

# Combine all boroughs
data = pd.concat([manhattan, bronx, brooklyn, queens, staten], ignore_index=True)

print(f"Combined dataset: {len(data)} sales")
data.head()

In [None]:
# Clean sale price - convert to numeric and remove $0 sales
data['sale_price'] = pd.to_numeric(data['sale_price'], errors='coerce')
data = data[data['sale_price'] > 0].copy()

# Parse sale date
data['sale_date'] = pd.to_datetime(data['sale_date'], errors='coerce')
data['year'] = data['sale_date'].dt.year

# Drop rows with missing dates
data = data.dropna(subset=['sale_date'])

print(f"Cleaned dataset: {len(data)} sales")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics by borough
data.groupby('borough_name')['sale_price'].describe()

In [None]:
# Median sale price by borough and year
borough_stats = data.groupby(['borough_name', 'year'])['sale_price'].median().reset_index()
borough_stats

---
## Step 3 | Visualization

In [None]:
# Line plot: Median sale price by borough over time
plt.figure(figsize=(12, 6))
sns.lineplot(data=borough_stats, x='year', y='sale_price', hue='borough_name', marker='o')
plt.xlabel('Year')
plt.ylabel('Median Sale Price ($)')
plt.title('Median Sale Price by Borough Over Time')
plt.legend(title='Borough')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Box plot: Sale price distribution by borough
plt.figure(figsize=(12, 6))
# Use log scale for better visibility
order = ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
sns.boxplot(data=data, x='borough_name', y='sale_price', order=order)
plt.yscale('log')
plt.xlabel('Borough')
plt.ylabel('Sale Price (log scale)')
plt.title('Sale Price Distribution by Borough')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Create price bins for chi-square test
data['price_bin'] = pd.qcut(data['sale_price'], q=3, labels=['low', 'medium', 'high'])

# Contingency table
contingency_table = pd.crosstab(data['borough_name'], data['price_bin'])
print("Contingency Table:")
print(contingency_table)

In [None]:
# Chi-square test for independence
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square statistic: {chi2:.2f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.2e}")

In [None]:
# Calculate percentage distribution by borough
pct_table = contingency_table.div(contingency_table.sum(axis=1), axis=0) * 100
print("\nPercentage in each price category:")
print(pct_table.round(1))

In [None]:
# OLS regression: Sale price by borough
model = smf.ols('sale_price ~ C(borough_name)', data=data).fit()
print(model.summary().tables[1])

---
## Step 5 | Results Interpretation

### Key Findings

**Chi-Square Test:**
- Chi-square statistic: Very large
- p-value: Essentially 0
- **Conclusion:** There are significant differences in price distributions across boroughs

**Price Distribution by Borough:**
- **Manhattan:** Highest percentage of "high" priced sales
- **Brooklyn:** Also skewed toward higher prices
- **Queens:** More balanced distribution
- **Bronx:** Skewed toward lower prices
- **Staten Island:** Concentrated in low-to-medium range

### Interpretation

The NYC housing market shows clear geographic stratification. Manhattan commands premium prices, followed by Brooklyn. The outer boroughs (Bronx, Queens, Staten Island) tend toward more affordable pricing.

### Caveats

- Data covers recent sales only (2024-2025)
- Different property types may dominate different boroughs
- $0 sales were excluded (may represent transfers rather than market sales)

---
## Replication Exercises

### Exercise 1: Property Types
Filter to single-family homes only. Does the borough ranking change?

### Exercise 2: Neighborhood Analysis
Which neighborhoods within each borough have the highest prices?

### Exercise 3: Price per Square Foot
Calculate price per square foot (where available). How does this change the comparison?

### Challenge Exercise
Research historical NYC housing data. How have these borough differentials changed over the past decade?

In [None]:
# Your code for exercises
