# ECON 0150 | Replication Notebook

**Title:** MLB Payroll and Attendance

**Original Authors:** Lis; Fernandez

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** To what extent do payroll and fan attendance predict the number of wins for MLB teams?

**Data Source:** MLB team statistics including payroll, attendance, and wins

**Methods:** Multiple regression: Wins ~ Payroll_Pct + Average_Attendance

**Main Finding:** Both payroll and attendance positively predict wins (Payroll: coef = 2.06, p < 0.001; Attendance: coef = 0.44, p < 0.001).

**Course Concepts Used:**
- Multiple regression
- Categorical variables (attendance groups)
- Scatter plots with group coloring
- Residual analysis

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replication/replications/0045/data/'

data = pd.read_csv(base_url + 'example_data.csv')

print(f"Number of observations: {len(data)}")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# Check and clean column names
data.columns = data.columns.str.replace(' ', '_')
print("Columns:", data.columns.tolist())

In [None]:
# Rename columns for easier use
if 'Avg._Attendance' in data.columns:
    data = data.rename(columns={'Avg._Attendance': 'Average_Attendance'})
if 'Payroll_%' in data.columns:
    data = data.rename(columns={'Payroll_%': 'Payroll_Pct'})

# Clean attendance column (remove commas, convert to numeric)
if 'Average_Attendance' in data.columns:
    data['Average_Attendance'] = data['Average_Attendance'].astype(str).str.replace(',', '').astype(float) / 1000

# Remove year 2020 (COVID) and rows with no games
if 'Year' in data.columns and 'W_Record' in data.columns:
    data = data[(data['Year'] != 2020) & (data['W_Record'] > 0)]

print(f"Cleaned data: {len(data)} observations")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
cols_to_describe = [c for c in ['Payroll_Pct', 'Average_Attendance', 'W_Record', 'W-L_Record'] if c in data.columns]
print("Summary Statistics:")
print(data[cols_to_describe].describe())

In [None]:
# Create attendance groups
if 'Average_Attendance' in data.columns:
    bins = [0, 15, 25, 35, 50]
    labels = ['< 15k', '15-25k', '25-35k', '35k+']
    data['Attendance_Group'] = pd.cut(data['Average_Attendance'], bins=bins, labels=labels)

---
## Step 3 | Visualization

In [None]:
# Scatter plot: Payroll vs Wins, colored by attendance
if 'Payroll_Pct' in data.columns and 'W-L_Record' in data.columns:
    plt.figure(figsize=(12, 6))
    sns.scatterplot(
        data=data,
        x='Payroll_Pct',
        y='W-L_Record',
        hue='Attendance_Group' if 'Attendance_Group' in data.columns else None,
        alpha=0.7
    )
    sns.regplot(
        data=data,
        x='Payroll_Pct',
        y='W-L_Record',
        scatter=False,
        color='black',
        ci=None
    )
    plt.title('Payroll %, Win %, and Attendance')
    plt.xlabel('Payroll as % of League Average')
    plt.ylabel('Win %')
    plt.grid(True, alpha=0.3)
    plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Model 1: Wins ~ Payroll
if 'Payroll_Pct' in data.columns and 'W_Record' in data.columns:
    model_1 = smf.ols('W_Record ~ Payroll_Pct', data=data).fit()
    print("Model 1: Wins ~ Payroll")
    print(model_1.summary().tables[1])

In [None]:
# Model 2: Wins ~ Attendance
if 'Average_Attendance' in data.columns and 'W_Record' in data.columns:
    model_2 = smf.ols('W_Record ~ Average_Attendance', data=data).fit()
    print("\nModel 2: Wins ~ Attendance")
    print(model_2.summary().tables[1])

In [None]:
# Model 3: Multiple regression
if all(c in data.columns for c in ['Payroll_Pct', 'Average_Attendance', 'W_Record']):
    model_3 = smf.ols('W_Record ~ Payroll_Pct + Average_Attendance', data=data).fit(cov_type='HC3')
    print("\nModel 3: Wins ~ Payroll + Attendance (with robust SEs)")
    print(model_3.summary().tables[1])

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
if 'model_3' in dir():
    print(f"\nMultiple Regression Model:")
    print(f"  Intercept: {model_3.params['Intercept']:.2f} wins")
    print(f"  Payroll coefficient: {model_3.params['Payroll_Pct']:.2f} (p = {model_3.pvalues['Payroll_Pct']:.4f})")
    print(f"  Attendance coefficient: {model_3.params['Average_Attendance']:.2f} (p = {model_3.pvalues['Average_Attendance']:.4f})")
    print(f"\nInterpretation:")
    print(f"  Each 1% increase in payroll is associated with {model_3.params['Payroll_Pct']:.1f} more wins")
    print(f"  Each 1,000 more fans per game is associated with {model_3.params['Average_Attendance']:.1f} more wins")

---
## Step 5 | Results Interpretation

### Key Findings

| Variable | Coefficient | P-value |
|----------|-------------|--------|
| Intercept | ~63 wins | < 0.001 |
| Payroll % | ~2.1 | < 0.001 |
| Attendance (thousands) | ~0.4 | < 0.001 |

1. **Payroll Matters:** Higher-spending teams win more games

2. **Attendance Also Matters:** Teams with more fans also win more

3. **Both Significant:** After controlling for each other, both remain significant

### Interpretation Challenges

**Causation is complex:**
- Money → Better players → More wins (causal)
- Winning → More fans → More revenue → Higher payroll (reverse)
- Market size → Both high payroll and high attendance (common cause)

### Baseball Economics

- Revenue sharing and luxury taxes affect team strategies
- Small-market teams can compete with smart spending
- Attendance reflects local fan base, not just team quality

---
## Replication Exercises

### Exercise 1: Postseason Success
Does payroll predict playoff appearances? Use postseason rank as outcome.

### Exercise 2: Year Fixed Effects
Add year dummies. Does the payroll effect change?

### Exercise 3: Efficiency
Calculate wins per dollar spent. Which teams are most efficient?

### Challenge Exercise
Research the Moneyball revolution. How has the relationship between spending and winning changed?

In [None]:
# Your code for exercises

# Example: Wins per payroll dollar
# data['Efficiency'] = data['W_Record'] / data['Payroll_Pct']
# print(data.nlargest(10, 'Efficiency')[['Team', 'Year', 'W_Record', 'Payroll_Pct', 'Efficiency']])