# ECON 0150 | Replication Notebook

**Title:** MLB Attendance and Manager Change

**Original Authors:** Chirinos, Papa, Mostofa

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis. You can run this notebook yourself to explore the data, reproduce the findings, and try the extension exercises at the end.

## About This Replication

**Research Question:** Does winning affect MLB attendance?

**Data Source:** MLB team data (150 team-season observations) including wins and attendance percentages

**Methods:** OLS regression, residual analysis

**Main Finding:** Each additional win is associated with a 0.67 percentage point increase in stadium attendance capacity, and this relationship is statistically significant (p < 0.001).

**Course Concepts Used:**
- OLS regression
- Coefficient interpretation
- Residual analysis
- R-squared interpretation

---
## Step 0 | Setup

First, we import the necessary libraries and load the data.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
data_url = 'https://tayweid.github.io/econ-0150/projects/replications/0005/data/mlb_attendance.csv'
data = pd.read_csv(data_url)

# Preview the data
data.head()

In [None]:
# Check the shape and columns
print(f"Dataset has {len(data)} rows and {len(data.columns)} columns")
print(f"\nColumns: {list(data.columns)}")

---
## Step 1 | Data Exploration

Before analyzing the data, we explore its structure and key variables.

In [None]:
# Summary statistics
data.describe()

In [None]:
# Distribution of wins
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(data['wins'], bins=15, edgecolor='black')
plt.xlabel('Wins')
plt.ylabel('Frequency')
plt.title('Distribution of Wins')

plt.subplot(1, 2, 2)
plt.hist(data['attend_percent'], bins=15, edgecolor='black')
plt.xlabel('Attendance Percent')
plt.ylabel('Frequency')
plt.title('Distribution of Attendance %')
plt.tight_layout()
plt.show()

---
## Step 2 | Visualization

We visualize the relationship between wins and attendance.

In [None]:
# Scatter plot of wins vs attendance percent
plt.figure(figsize=(10, 6))
plt.scatter(data['wins'], data['attend_percent'], alpha=0.6)
plt.xlabel('Wins', fontsize=12)
plt.ylabel('Attendance Percent', fontsize=12)
plt.title('MLB Wins vs Stadium Attendance Percentage', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 3 | Statistical Analysis

We run an OLS regression to quantify the relationship between wins and attendance.

In [None]:
# OLS Regression: Attendance Percent ~ Wins
model = smf.ols('attend_percent ~ wins', data=data).fit()
print(model.summary().tables[1])

In [None]:
# Residual plot
plt.figure(figsize=(10, 6))
sns.residplot(x=model.fittedvalues, y=model.resid, lowess=True, line_kws={'color': 'red', 'lw': 1})
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='grey', linestyle='--')
plt.show()

---
## Step 4 | Results Interpretation

### Key Findings

**Regression Results:**
- **Wins coefficient:** +0.67 (p < 0.001)
- **Interpretation:** Each additional win is associated with a 0.67 percentage point increase in stadium attendance capacity
- **R-squared:** 0.229 - wins explain about 23% of the variation in attendance

**Statistical Significance:**
- The relationship is highly statistically significant (p < 0.001)
- We can reject the null hypothesis that wins have no effect on attendance

### Conclusion

There is a statistically significant positive relationship between team performance (wins) and stadium attendance. Teams that win more games attract more fans. However, wins only explain about 23% of the variation in attendance, suggesting other factors (market size, ticket prices, stadium quality, etc.) also play important roles.

---
## Replication Exercises

Try extending this analysis with the following exercises:

### Exercise 1: Raw Attendance
Instead of using attendance percent, use the raw attendance numbers. Does the relationship still hold? How does the interpretation change?

### Exercise 2: Identify Outliers
Which teams have unusually high or low attendance given their win totals? Calculate the residuals and identify the top 5 over-performers and under-performers.

### Exercise 3: Non-linear Relationship
Add a squared term for wins (`wins + I(wins**2)`) to test whether the relationship is non-linear. Does winning have diminishing returns for attendance?

### Challenge Exercise
The data includes multiple years. How might you test whether the wins-attendance relationship has changed over time? What would you expect to find and why?

In [None]:
# Your code for Exercise 1: Raw Attendance


In [None]:
# Your code for Exercise 2: Identify Outliers


In [None]:
# Your code for Exercise 3: Non-linear Relationship


In [None]:
# Your code for Challenge Exercise
