# ECON 0150 | Replication Notebook

**Title:** MLB Payroll Success Rate

**Original Authors:** McCollick; Holcombe

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** What is the MLB Payroll Success Rate? Does higher payroll lead to more wins?

**Data Source:** MLB team payroll and win/loss data (2010-2024)

**Methods:** OLS regression: Win_Loss_Ratio ~ Total_Payroll (and variations)

**Main Finding:** Positive but weak relationship (coef = 1.95e-09, p < 0.001, R² = 0.11). Active roster payroll explains more variance (R² = 0.24) than total payroll.

**Course Concepts Used:**
- Simple linear regression
- Data cleaning and transformation
- Scatter plots with regression lines
- Inflation adjustment
- Sports economics

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0057/data/'

data = pd.read_csv(base_url + 'mlb_payrolls.csv')

print(f"Number of team-seasons: {len(data)}")
data.head()

---
## Step 1 | Data Preparation

In [None]:
# Check columns
print("Columns:", data.columns.tolist())

In [None]:
# Create Win/Loss Ratio
data['Win_Loss_Ratio'] = data['Wins'] / data['Losses']

# Clean payroll columns (remove $ and commas)
if 'Total Payroll Allocations' in data.columns:
    data['Total_Payroll_Clean'] = data['Total Payroll Allocations'].replace({r'[$,]': ''}, regex=True).astype(float)

if 'Active 26-Man' in data.columns:
    data['Active_26_Man_Clean'] = data['Active 26-Man'].replace({r'[$,]': ''}, regex=True).astype(float)

# Drop any rows with missing key variables
data = data.dropna(subset=['Win_Loss_Ratio', 'Total_Payroll_Clean'])

print(f"\nCleaned data: {len(data)} observations")
print(f"Years covered: {data['Year'].min()} - {data['Year'].max()}")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['Win_Loss_Ratio', 'Total_Payroll_Clean', 'Wins', 'Losses']].describe())

In [None]:
# Correlation
correlation = data['Total_Payroll_Clean'].corr(data['Win_Loss_Ratio'])
print(f"\nCorrelation between payroll and win/loss ratio: {correlation:.3f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot: Total Payroll vs Win/Loss Ratio
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Total_Payroll_Clean', y='Win_Loss_Ratio', data=data, alpha=0.6)
sns.regplot(x='Total_Payroll_Clean', y='Win_Loss_Ratio', data=data, scatter=False, color='red')
plt.title('Win/Loss Ratio vs. Total Payroll Allocations')
plt.xlabel('Total Payroll Allocations ($)')
plt.ylabel('Win/Loss Ratio')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot: Active Roster Payroll vs Win/Loss Ratio
if 'Active_26_Man_Clean' in data.columns:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='Active_26_Man_Clean', y='Win_Loss_Ratio', data=data, alpha=0.6)
    sns.regplot(x='Active_26_Man_Clean', y='Win_Loss_Ratio', data=data, scatter=False, color='red')
    plt.title('Win/Loss Ratio vs. Active 26-Man Roster Payroll')
    plt.xlabel('Active 26-Man Roster Payroll ($)')
    plt.ylabel('Win/Loss Ratio')
    plt.grid(True, alpha=0.3)
    plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# Model 1: Total Payroll
model_1 = smf.ols('Win_Loss_Ratio ~ Total_Payroll_Clean', data=data).fit()
print("Model 1: Win_Loss_Ratio ~ Total_Payroll")
print(model_1.summary().tables[1])
print(f"\nR-squared: {model_1.rsquared:.3f}")

In [None]:
# Model 2: Active Roster Payroll
if 'Active_26_Man_Clean' in data.columns:
    model_2 = smf.ols('Win_Loss_Ratio ~ Active_26_Man_Clean', data=data).fit()
    print("\nModel 2: Win_Loss_Ratio ~ Active_26_Man_Payroll")
    print(model_2.summary().tables[1])
    print(f"\nR-squared: {model_2.rsquared:.3f}")

In [None]:
# Inflation-adjusted analysis
base_year = 2024
inflation_rate = 0.025
data['inflation_multiplier'] = (1 + inflation_rate) ** (base_year - data['Year'])
data['Total_Payroll_Adjusted'] = data['Total_Payroll_Clean'] * data['inflation_multiplier']

model_3 = smf.ols('Win_Loss_Ratio ~ Total_Payroll_Adjusted', data=data).fit()
print("\nModel 3: Win_Loss_Ratio ~ Inflation-Adjusted Total Payroll")
print(model_3.summary().tables[1])
print(f"\nR-squared: {model_3.rsquared:.3f}")

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: Payroll does not predict winning (beta = 0)")
print(f"\nModel Comparison:")
print(f"  Total Payroll R²: {model_1.rsquared:.3f}")
if 'model_2' in dir():
    print(f"  Active Roster Payroll R²: {model_2.rsquared:.3f}")
print(f"  Inflation-Adjusted Payroll R²: {model_3.rsquared:.3f}")
print(f"\nInterpretation:")
print(f"  Higher payroll IS significantly associated with more wins")
print(f"  But payroll explains only ~11% of win/loss variation")
print(f"  Active roster payroll is a better predictor (~24%)")

---
## Step 5 | Results Interpretation

### Key Findings

| Model | R-squared |
|-------|----------|
| Total Payroll | 0.109 |
| Active 26-Man Payroll | 0.240 |
| Inflation-Adjusted | 0.115 |

1. **Money Helps, But Not Much:** Higher payroll is associated with more wins, but explains only 11% of variance

2. **Active Roster Matters More:** What you spend on players actually on the field matters more than total payroll (which includes injured/retained/buried money)

3. **Diminishing Returns:** The Moneyball insight holds - smart spending matters more than total spending

### Baseball Economics

- **Revenue sharing:** Large-market teams share revenue with small-market teams
- **Luxury tax:** Teams pay a penalty for exceeding thresholds
- **Development:** Teams can develop cheap, young talent
- **Inefficiencies:** Some teams overpay for declining players

### Why Isn't Payroll More Predictive?

- **Variance in baseball:** Even great teams lose 60+ games per year
- **Player development:** Young, cheap players can outperform expensive veterans
- **Injuries:** Expensive players get hurt
- **Team construction:** Chemistry and roster balance matter

---
## Replication Exercises

### Exercise 1: Postseason Success
Does payroll predict playoff appearances better than regular season wins?

### Exercise 2: Team Efficiency
Calculate wins per million dollars. Which teams are most efficient?

### Exercise 3: Year Trends
Has the relationship between payroll and wins changed over time?

### Challenge Exercise
Research the Moneyball revolution. How did the Oakland A's compete with low payroll?

In [None]:
# Your code for exercises

# Example: Most efficient teams
# data['Wins_Per_Million'] = data['Wins'] / (data['Total_Payroll_Clean'] / 1e6)
# print(data.nlargest(10, 'Wins_Per_Million')[['Team', 'Year', 'Wins', 'Total_Payroll_Clean', 'Wins_Per_Million']])