# ECON 0150 | Replication Notebook

**Title:** MLB Payroll and Wins

**Original Authors:** Michalak

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Is there a relationship between an MLB team's payroll and the amount of games they win?

**Data Source:** MLB team payroll and win data (30 teams)

**Methods:** OLS regression: Wins ~ Payroll

**Main Finding:** Positive relationship between payroll and wins.

**Course Concepts Used:**
- Simple linear regression
- Scatter plots with regression lines
- Sports economics
- Correlation vs. causation

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0059/data/'

data = pd.read_excel(base_url + 'Final Project.xlsx')

print(f"Number of teams: {len(data)}")
data.head(10)

---
## Step 1 | Data Preparation

In [None]:
# Check columns
print("Columns:", data.columns.tolist())

In [None]:
# Rename columns for clarity
data = data.rename(columns={
    'Payroll (in millions)': 'Payroll_Millions'
})

# Calculate win percentage (assuming 162 game season)
data['Win_Pct'] = data['Wins'] / 162

print(f"\nData prepared: {len(data)} teams")
data.head()

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['Payroll_Millions', 'Wins']].describe())

In [None]:
# Correlation
correlation = data['Payroll_Millions'].corr(data['Wins'])
print(f"\nCorrelation between payroll and wins: {correlation:.4f}")

In [None]:
# Top and bottom teams
print("\nHighest payroll teams:")
print(data.nlargest(5, 'Payroll_Millions')[['Team', 'Payroll_Millions', 'Wins']])

print("\nLowest payroll teams:")
print(data.nsmallest(5, 'Payroll_Millions')[['Team', 'Payroll_Millions', 'Wins']])

---
## Step 3 | Visualization

In [None]:
# Bar chart: Payroll by team
plt.figure(figsize=(14, 6))
data_sorted = data.sort_values('Payroll_Millions', ascending=False)
plt.bar(data_sorted['Team'], data_sorted['Payroll_Millions'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Payroll ($ Millions)')
plt.title('MLB Team Payrolls')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Payroll_Millions', y='Wins', data=data,
            scatter_kws={'s': 80, 'alpha': 0.7})
plt.title('MLB Payroll vs. Wins')
plt.xlabel('Payroll ($ Millions)')
plt.ylabel('Wins')
plt.axhline(81, linestyle='--', color='red', alpha=0.5, label='.500 Record (81 wins)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Labeled scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(data['Payroll_Millions'], data['Wins'], s=80, alpha=0.7)

# Label select teams
for idx, row in data.iterrows():
    if row['Payroll_Millions'] > 250 or row['Wins'] > 90 or row['Wins'] < 65:
        plt.annotate(row['Team'], (row['Payroll_Millions'], row['Wins']),
                    fontsize=8, alpha=0.8)

plt.xlabel('Payroll ($ Millions)')
plt.ylabel('Wins')
plt.title('MLB Payroll vs. Wins (Select Teams Labeled)')
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols('Wins ~ Payroll_Millions', data=data).fit()
print("OLS Regression: Wins ~ Payroll_Millions")
print(model.summary())

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"\nNull Hypothesis: Payroll does not predict wins (beta = 0)")
print(f"\nModel Results:")
print(f"  Intercept: {model.params['Intercept']:.2f} wins")
print(f"  Payroll coefficient: {model.params['Payroll_Millions']:.4f}")
print(f"  P-value: {model.pvalues['Payroll_Millions']:.4f}")
print(f"  R-squared: {model.rsquared:.3f}")
print(f"\nInterpretation:")
print(f"  Each additional $10 million in payroll is associated with")
print(f"  {model.params['Payroll_Millions']*10:.2f} additional wins")
if model.pvalues['Payroll_Millions'] < 0.05:
    print(f"\nConclusion: REJECT null hypothesis")
    print(f"  Payroll IS significantly associated with wins")
else:
    print(f"\nConclusion: FAIL TO REJECT null hypothesis")
    print(f"  Payroll is NOT significantly associated with wins")

---
## Step 5 | Results Interpretation

### Key Findings

1. **Positive Relationship:** Higher payroll is associated with more wins

2. **Effect Size:** The coefficient tells us the marginal effect of payroll spending

3. **Unexplained Variance:** Payroll explains only part of the variation in wins

### Economic Context

- **Moneyball Effect:** Some low-payroll teams (like Oakland) find ways to compete
- **Diminishing Returns:** At some point, more spending yields fewer additional wins
- **Competitive Balance:** MLB has revenue sharing but no salary cap

### Outliers to Note

- **High Payroll + Low Wins:** Some big spenders underperform
- **Low Payroll + High Wins:** Efficient teams get more "bang for buck"

### Limitations

- Single season (results vary year to year)
- Payroll doesn't account for player injuries
- Some payroll is tied up in bad contracts

---
## Replication Exercises

### Exercise 1: Cost Per Win
Calculate each team's cost per win. Which teams are most/least efficient?

### Exercise 2: Historical Analysis
Collect data from multiple years. Is the relationship consistent?

### Exercise 3: Playoff Success
Add playoff performance. Does payroll predict postseason success?

### Challenge Exercise
Research the "Moneyball" approach. How did the Oakland A's compete with low payroll?

In [None]:
# Your code for exercises

# Example: Cost per win
# data['Cost_Per_Win'] = data['Payroll_Millions'] / data['Wins']
# print(data.nsmallest(5, 'Cost_Per_Win')[['Team', 'Payroll_Millions', 'Wins', 'Cost_Per_Win']])