# ECON 0150 | Replication Notebook

**Title:** Room Sizes and Housing Prices

**Original Authors:** Arthur

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Does the size of a house (number of rooms) have a positive linear relationship with its median sales price?

**Data Source:** Boston Housing Dataset (506 observations)

**Methods:** OLS regression: MEDV ~ RM (Median Value ~ Average Rooms)

**Main Finding:** Strong positive relationship - each additional room is associated with $9,100 higher median price (p < 0.001, R² = 0.48).

**Course Concepts Used:**
- Simple linear regression
- Scatter plots with regression lines
- Correlation analysis
- Residual analysis
- Classic econometric dataset

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0062/data/'

data = pd.read_csv(base_url + 'boston-housing-dataset.csv')

print(f"Number of observations: {len(data)}")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# Check columns
print("Columns:", data.columns.tolist())
print(f"\nKey variables:")
print("  RM: Average number of rooms per dwelling")
print("  MEDV: Median value of owner-occupied homes ($1000s)")

In [None]:
# Check for missing values
print("\nMissing values:")
print(data[['RM', 'MEDV']].isnull().sum())

# Filter to realistic values
data_clean = data[(data['MEDV'] > 5) & (data['RM'] >= 3)]

print(f"\nOriginal rows: {len(data)}")
print(f"Cleaned rows: {len(data_clean)}")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data_clean[['RM', 'MEDV']].describe())

In [None]:
# Correlation
correlation = data_clean['RM'].corr(data_clean['MEDV'])
print(f"\nPearson correlation between RM and MEDV: {correlation:.4f}")

---
## Step 3 | Visualization

In [None]:
# Distribution of rooms
plt.figure(figsize=(10, 6))
sns.histplot(data_clean['RM'], kde=True, bins=30)
plt.title('Distribution of Average Number of Rooms (RM)')
plt.xlabel('Average Number of Rooms')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='RM', y='MEDV', data=data_clean,
            scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
plt.title('Relationship: Average Rooms (RM) vs. Median Home Price (MEDV)')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Median Home Price (MEDV in $1000s)')
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols('MEDV ~ RM', data=data_clean).fit()
print("OLS Regression: MEDV ~ RM")
print(model.summary())

In [None]:
# Residual plot
fitted_vals = model.fittedvalues
residuals = model.resid

plt.figure(figsize=(10, 6))
sns.scatterplot(x=data_clean['RM'], y=residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.title('Residual Plot (RM vs. Residuals)')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Residuals')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: Room count does not predict home price (beta = 0)")
print(f"\nRegression Equation: MEDV = {model.params['Intercept']:.2f} + {model.params['RM']:.2f}(RM)")
print(f"\nModel Results:")
print(f"  Intercept: {model.params['Intercept']:.2f} ($1000s)")
print(f"  RM coefficient: {model.params['RM']:.2f} ($1000s per room)")
print(f"  P-value: {model.pvalues['RM']:.4f}")
print(f"  R-squared: {model.rsquared:.4f}")
print(f"\nInterpretation:")
print(f"  Each additional room is associated with ${model.params['RM']*1000:.0f} higher median price")
print(f"  Room count explains {model.rsquared*100:.1f}% of price variation")

---
## Step 5 | Results Interpretation

### Key Findings

| Statistic | Value |
|-----------|-------|
| RM Coefficient | ~$9,040 per room |
| P-value | < 0.001 |
| R-squared | 0.48 |
| Correlation | 0.70 |

1. **Strong Positive Relationship:** More rooms = higher prices

2. **Good Explanatory Power:** RM explains ~48% of price variance

3. **Highly Significant:** P-value is essentially zero

### The Boston Housing Dataset

This is a classic dataset from the 1970s used in many econometrics textbooks:
- **506 census tracts** in Boston area
- **MEDV** is the key outcome variable (median home values)
- Other features include crime rate, highway access, pupil-teacher ratio

### Why Does RM Predict Price?

- **Size proxy:** More rooms generally means larger homes
- **Family needs:** Families pay premium for bedrooms
- **Quality signal:** More rooms often correlates with higher construction quality
- **Land value:** Larger homes require more land

### Residual Analysis

The residual plot shows some heteroskedasticity - variance increases with RM. This is common in housing data (more expensive homes have more variable prices).

---
## Replication Exercises

### Exercise 1: Other Predictors
Which other variables in the dataset predict MEDV? Try LSTAT (% lower status) or CRIM (crime rate).

### Exercise 2: Multiple Regression
Build a model with RM and one other predictor. How does R² change?

### Exercise 3: Log Transformation
Try log(MEDV) ~ RM. Does this improve the residual pattern?

### Challenge Exercise
Explore the correlation matrix. Which variables are highly correlated with each other?

In [None]:
# Your code for exercises

# Example: Correlation matrix
# print(data_clean[['MEDV', 'RM', 'LSTAT', 'CRIM', 'PTRATIO']].corr())