# ECON 0150 | Replication Notebook

**Title:** MLB Salaries and Wins

**Original Authors:** Harrer; Reardon; Hu

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Do higher team salaries lead to higher win percentages in the MLB?

**Data Source:** MLB team salary and win percentage data (2020-2025)

**Methods:** OLS regression of win percentage on total team salary

**Main Finding:** Higher team salaries are associated with higher win percentages. Each $100 million increase in salary is associated with approximately 5.7 percentage points higher win percentage (p < 0.001, R² = 0.21).

**Course Concepts Used:**
- Simple linear regression
- Scatter plots with regression lines
- Residual analysis
- Interpreting small coefficients with large predictors

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0028/data/'

df = pd.read_csv(base_url + 'mlb_salary_wins.csv')

print(f"Number of observations: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
df.head()

---
## Step 1 | Data Preparation

In [None]:
# Clean and prepare data
# Keep only relevant columns and drop missing values
data = df[['Year', 'Team', 'Total Salary', 'Win Percentage']].copy()

# Convert to numeric
data['Total Salary'] = pd.to_numeric(data['Total Salary'], errors='coerce')
data['Win Percentage'] = pd.to_numeric(data['Win Percentage'], errors='coerce')

# Drop missing values
data = data.dropna(subset=['Total Salary', 'Win Percentage'])

# Create salary in millions for easier interpretation
data['Salary_Millions'] = data['Total Salary'] / 1_000_000

print(f"Clean data: {len(data)} team-season observations")
print(f"Years: {data['Year'].min()} to {data['Year'].max()}")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data[['Salary_Millions', 'Win Percentage']].describe())

In [None]:
# Distribution of variables
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data['Salary_Millions'], bins=20, edgecolor='black')
axes[0].set_xlabel('Total Salary ($ Millions)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Team Salaries')

axes[1].hist(data['Win Percentage'], bins=20, edgecolor='black')
axes[1].set_xlabel('Win Percentage')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Win Percentages')

plt.tight_layout()
plt.show()

In [None]:
# Correlation
correlation = data['Salary_Millions'].corr(data['Win Percentage'])
print(f"Correlation between salary and win percentage: {correlation:.3f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(data['Total Salary'], data['Win Percentage'], alpha=0.6)
plt.xlabel('Total Salary ($)')
plt.ylabel('Win Percentage')
plt.title('MLB Team Salary vs Win Percentage')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot with regression line using seaborn
plt.figure(figsize=(10, 6))
sns.regplot(data=data, x='Salary_Millions', y='Win Percentage', 
            scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
plt.xlabel('Total Salary ($ Millions)')
plt.ylabel('Win Percentage')
plt.title('MLB Team Salary vs Win Percentage with Regression Line')
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression using raw salary (in dollars)
X = sm.add_constant(data['Total Salary'])
y = data['Win Percentage']

model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Alternative: Regression with salary in millions for easier interpretation
model_millions = smf.ols('Q("Win Percentage") ~ Salary_Millions', data=data).fit()
print("\nRegression with Salary in Millions:")
print(model_millions.summary().tables[1])

In [None]:
# Residual analysis
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residual plot
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.5)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')

# Histogram of residuals
axes[1].hist(model.resid, bins=20, edgecolor='black')
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

In [None]:
# Key results
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: {model.params['const']:.4f}")
print(f"Salary coefficient: {model.params['Total Salary']:.2e}")
print(f"\nInterpretation:")
print(f"  Each $100 million increase in team salary is associated with")
print(f"  a {model.params['Total Salary'] * 100_000_000:.3f} increase in win percentage")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['Total Salary']:.2e}")

---
## Step 5 | Results Interpretation

### Key Findings

| Metric | Value |
|--------|-------|
| Salary Coefficient | 5.73e-10 |
| R-squared | 0.21 |
| P-value | < 0.001 |

### Interpretation

1. **Statistically Significant:** There is a significant positive relationship between team salary and winning (p < 0.001)

2. **Effect Size:** Each $100 million in additional salary is associated with approximately 5.7 percentage points higher win percentage

3. **R-squared = 0.21:** Salary explains about 21% of the variation in win percentage

### What Else Matters?

The moderate R² suggests other factors also matter:
- Team chemistry and coaching
- Player injuries
- Efficient salary allocation (getting value)
- Minor league development

### Causal Interpretation?

Does spending more *cause* more wins? This is likely the causal direction, but:
- Some high-salary teams underperform (Yankees example)
- Some low-salary teams overperform (Oakland A's "Moneyball" era)

---
## Replication Exercises

### Exercise 1: Year Effects
Add year fixed effects. Has the salary-wins relationship changed over time?

### Exercise 2: Playoff Success
Does salary predict playoff appearances better than regular season wins?

### Exercise 3: Efficiency
Calculate "wins per million dollars" for each team. Which teams are most efficient?

### Challenge Exercise
Research the "Moneyball" hypothesis. Does spending on analytics substitute for spending on salary?

In [None]:
# Your code for exercises

# Example: Calculate wins per million dollars
# data['Efficiency'] = data['Win Percentage'] / data['Salary_Millions']
# print(data.nlargest(10, 'Efficiency')[['Team', 'Year', 'Salary_Millions', 'Win Percentage', 'Efficiency']])