# Model 1: Baseline Models Tutorial

## A Complete Guide to Hockey Prediction Baselines

This tutorial teaches you how to use baseline models for hockey goal prediction. Baselines are essential reference points for evaluating more complex models.

---

## Table of Contents

1. [What Are Baseline Models?](#1-what-are-baseline-models)
2. [Why Baselines Matter](#2-why-baselines-matter)
3. [The 6 Baseline Models](#3-the-6-baseline-models)
4. [Setup and Data](#4-setup-and-data)
5. [Training Baselines](#5-training-baselines)
6. [Making Predictions](#6-making-predictions)
7. [Evaluation and Comparison](#7-evaluation-and-comparison)
8. [Choosing the Right Baseline](#8-choosing-the-right-baseline)
9. [Next Steps](#9-next-steps)

## 1. What Are Baseline Models?

A **baseline model** is a simple prediction approach that serves as a reference point. Before building complex models like ELO ratings or neural networks, you need baselines to answer:

- *"Is my complex model actually better than just predicting the average?"*
- *"How much improvement does my model provide?"*

### Common Baseline Approaches

| Baseline | Prediction Method | Complexity |
|----------|-------------------|------------|
| Global Mean | League average goals | Simplest |
| Team Mean | Team-specific averages | Simple |
| Home/Away | Location-aware averages | Simple |
| Moving Average | Recent games only | Medium |
| Weighted History | Decay older games | Medium |
| Poisson | Statistical model | Medium |

### The Baseline Rule of Thumb

> **A model that can't beat GlobalMean is worthless.**  
> **A model that barely beats TeamMean needs work.**  
> **A model that significantly beats Poisson is promising.**

## 2. Why Baselines Matter

### Real-World Example

Imagine you build a complex neural network for hockey prediction that achieves RMSE = 1.85.

**Is that good?** You can't know without baselines.

- If GlobalMean gets RMSE = 1.90, your model is barely better
- If GlobalMean gets RMSE = 2.50, your model is a significant improvement

### Competition Context

For the Wharton HS Hockey competition:

- Target RMSE: < 2.0
- GlobalMean typically: ~1.8-2.2 (depending on data)
- Your models need to beat baselines to be competitive

## 3. The 6 Baseline Models

### 3.1 GlobalMeanBaseline

**Concept:** Predict the league-wide average for all games.

```
home_prediction = average(all home goals in training data)
away_prediction = average(all away goals in training data)
```

**Use Case:** The absolute minimum bar. Any model must beat this.

---

### 3.2 TeamMeanBaseline

**Concept:** Each team has its own offensive and defensive average.

```
home_pred = (home_team_offense_avg + away_team_defense_avg) / 2
```

**Use Case:** Standard baseline that accounts for team quality.

---

### 3.3 HomeAwayBaseline

**Concept:** Separate statistics for home and away games.

**Why?** Teams often perform differently at home vs away.

---

### 3.4 MovingAverageBaseline

**Concept:** Only use the last N games (window size).

**Why?** Teams change over a season. Recent form matters more than early season.

**Hyperparameter:** `window` (default: 5)

---

### 3.5 WeightedHistoryBaseline

**Concept:** Recent games count more than older games (exponential decay).

```
weight = decay^(days_ago)
```

**Hyperparameter:** `decay` (default: 0.9, range 0.7-0.99)

---

### 3.6 PoissonBaseline

**Concept:** Statistical Poisson regression approach.

Goals in hockey follow a Poisson distribution. This baseline estimates:
- Attack strength per team
- Defense strength per team  
- Home field advantage factor

**Why?** Academic standard for sports prediction.

## 4. Setup and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
import os
import sys

# Add paths
sys.path.insert(0, os.path.dirname(os.getcwd()))

print("Libraries loaded.")

In [None]:
# Full baseline model implementation (self-contained)

COLUMN_ALIASES = {
    'home_team': ['home_team', 'home', 'team_home'],
    'away_team': ['away_team', 'away', 'team_away', 'visitor'],
    'home_goals': ['home_goals', 'home_score', 'h_goals', 'goals_home'],
    'away_goals': ['away_goals', 'away_score', 'a_goals', 'goals_away'],
}

def get_value(game, field, default=None):
    """Get value from game record with column alias support."""
    aliases = COLUMN_ALIASES.get(field, [field])
    for alias in aliases:
        if alias in game:
            val = game[alias]
            if pd.isna(val):
                return default
            return val
    return default

def get_column(df, field):
    """Find the correct column name in DataFrame."""
    aliases = COLUMN_ALIASES.get(field, [field])
    for alias in aliases:
        if alias in df.columns:
            return alias
    return None


class BaselineModel:
    """Abstract base class for baseline models."""
    
    def __init__(self, params=None):
        self.params = params or {}
        self.is_fitted = False
    
    def fit(self, games_df):
        raise NotImplementedError
    
    def predict_goals(self, game):
        raise NotImplementedError
    
    def evaluate(self, games_df):
        """Evaluate model on a dataset."""
        if not self.is_fitted:
            raise RuntimeError("Model must be fitted first")
        
        home_preds, home_actuals = [], []
        for _, game in games_df.iterrows():
            hp, ap = self.predict_goals(game)
            home_preds.append(hp)
            home_actuals.append(get_value(game, 'home_goals', 0))
        
        rmse = mean_squared_error(home_actuals, home_preds, squared=False)
        mae = mean_absolute_error(home_actuals, home_preds)
        return {'rmse': rmse, 'mae': mae}


class GlobalMeanBaseline(BaselineModel):
    """Predict league-wide average goals."""
    
    def fit(self, games_df):
        hc = get_column(games_df, 'home_goals')
        ac = get_column(games_df, 'away_goals')
        self.home_mean = games_df[hc].mean()
        self.away_mean = games_df[ac].mean()
        self.is_fitted = True
        return self
    
    def predict_goals(self, game):
        return self.home_mean, self.away_mean


class TeamMeanBaseline(BaselineModel):
    """Predict based on team-specific averages."""
    
    def fit(self, games_df):
        htc = get_column(games_df, 'home_team')
        atc = get_column(games_df, 'away_team')
        hgc = get_column(games_df, 'home_goals')
        agc = get_column(games_df, 'away_goals')
        
        goals_for, goals_against, games = {}, {}, {}
        
        for _, g in games_df.iterrows():
            ht, at = g[htc], g[atc]
            hg, ag = g[hgc], g[agc]
            
            goals_for[ht] = goals_for.get(ht, 0) + hg
            goals_against[ht] = goals_against.get(ht, 0) + ag
            games[ht] = games.get(ht, 0) + 1
            
            goals_for[at] = goals_for.get(at, 0) + ag
            goals_against[at] = goals_against.get(at, 0) + hg
            games[at] = games.get(at, 0) + 1
        
        self.offense = {t: goals_for[t] / games[t] for t in games}
        self.defense = {t: goals_against[t] / games[t] for t in games}
        self.global_mean = games_df[hgc].mean()
        self.is_fitted = True
        return self
    
    def predict_goals(self, game):
        ht = get_value(game, 'home_team')
        at = get_value(game, 'away_team')
        ho = self.offense.get(ht, self.global_mean)
        hd = self.defense.get(ht, self.global_mean)
        ao = self.offense.get(at, self.global_mean)
        ad = self.defense.get(at, self.global_mean)
        return (ho + ad) / 2, (ao + hd) / 2


class MovingAverageBaseline(BaselineModel):
    """Use only last N games."""
    
    def __init__(self, params=None):
        super().__init__(params)
        self.window = self.params.get('window', 5)
    
    def fit(self, games_df):
        htc = get_column(games_df, 'home_team')
        atc = get_column(games_df, 'away_team')
        hgc = get_column(games_df, 'home_goals')
        agc = get_column(games_df, 'away_goals')
        
        self.history = {}
        
        for _, g in games_df.iterrows():
            ht, at = g[htc], g[atc]
            hg, ag = g[hgc], g[agc]
            
            if ht not in self.history: self.history[ht] = []
            if at not in self.history: self.history[at] = []
            
            self.history[ht].append((hg, ag))
            self.history[at].append((ag, hg))
        
        self.global_mean = games_df[hgc].mean()
        self.is_fitted = True
        return self
    
    def _recent_avg(self, team):
        if team not in self.history:
            return self.global_mean, self.global_mean
        rec = self.history[team][-self.window:]
        return np.mean([x[0] for x in rec]), np.mean([x[1] for x in rec])
    
    def predict_goals(self, game):
        ht = get_value(game, 'home_team')
        at = get_value(game, 'away_team')
        ho, hd = self._recent_avg(ht)
        ao, ad = self._recent_avg(at)
        return (ho + ad) / 2, (ao + hd) / 2


class PoissonBaseline(BaselineModel):
    """Statistical Poisson model."""
    
    def fit(self, games_df):
        hgc = get_column(games_df, 'home_goals')
        agc = get_column(games_df, 'away_goals')
        htc = get_column(games_df, 'home_team')
        atc = get_column(games_df, 'away_team')
        
        self.league_avg = games_df[hgc].mean()
        self.home_factor = games_df[hgc].mean() / max(games_df[agc].mean(), 0.01)
        
        goals_for, goals_against, games = {}, {}, {}
        
        for _, g in games_df.iterrows():
            ht, at = g[htc], g[atc]
            hg, ag = g[hgc], g[agc]
            
            goals_for[ht] = goals_for.get(ht, 0) + hg
            goals_against[ht] = goals_against.get(ht, 0) + ag
            games[ht] = games.get(ht, 0) + 1
            
            goals_for[at] = goals_for.get(at, 0) + ag
            goals_against[at] = goals_against.get(at, 0) + hg
            games[at] = games.get(at, 0) + 1
        
        self.attack = {t: (goals_for[t]/games[t]) / self.league_avg for t in games}
        self.defense = {t: (goals_against[t]/games[t]) / self.league_avg for t in games}
        self.is_fitted = True
        return self
    
    def predict_goals(self, game):
        ht = get_value(game, 'home_team')
        at = get_value(game, 'away_team')
        
        ha = self.attack.get(ht, 1.0)
        hd = self.defense.get(ht, 1.0)
        aa = self.attack.get(at, 1.0)
        ad = self.defense.get(at, 1.0)
        
        home_goals = self.league_avg * ha * ad * self.home_factor
        away_goals = self.league_avg * aa * hd / self.home_factor
        return home_goals, away_goals


print("All baseline models loaded.")

In [None]:
# Create sample data for the tutorial
np.random.seed(42)

teams = ['Bruins', 'Rangers', 'Maple Leafs', 'Canadiens', 'Flyers', 'Penguins']
team_strength = {'Bruins': 1.2, 'Rangers': 1.1, 'Maple Leafs': 1.0, 
                 'Canadiens': 0.9, 'Flyers': 0.95, 'Penguins': 1.15}

games = []
for i in range(200):
    home, away = np.random.choice(teams, 2, replace=False)
    
    # Poisson goals based on team strength
    home_lambda = 3.0 * team_strength[home] * 1.1  # Home advantage
    away_lambda = 3.0 * team_strength[away] * 0.9
    
    games.append({
        'game_date': pd.Timestamp('2025-10-01') + pd.Timedelta(days=i),
        'home_team': home,
        'away_team': away,
        'home_goals': np.random.poisson(home_lambda),
        'away_goals': np.random.poisson(away_lambda)
    })

games_df = pd.DataFrame(games)

# Split 80/20
split = int(len(games_df) * 0.8)
train_df = games_df.iloc[:split]
test_df = games_df.iloc[split:]

print(f"Training games: {len(train_df)}")
print(f"Test games: {len(test_df)}")
print(f"\nSample game:")
games_df.head(3)

## 5. Training Baselines

Training is simple: call `.fit()` with your training data.

In [None]:
# Train GlobalMean - the simplest baseline
global_model = GlobalMeanBaseline()
global_model.fit(train_df)

print("GlobalMean trained!")
print(f"  Average home goals: {global_model.home_mean:.2f}")
print(f"  Average away goals: {global_model.away_mean:.2f}")

In [None]:
# Train TeamMean - accounts for team quality
team_model = TeamMeanBaseline()
team_model.fit(train_df)

print("TeamMean trained!")
print("\nTeam Offensive Averages:")
for team, avg in sorted(team_model.offense.items(), key=lambda x: -x[1]):
    print(f"  {team}: {avg:.2f} goals/game")

In [None]:
# Train MovingAverage with window=5 (last 5 games)
ma_model = MovingAverageBaseline({'window': 5})
ma_model.fit(train_df)

print("MovingAverage (window=5) trained!")
print(f"  Using last {ma_model.window} games for each team")

In [None]:
# Train Poisson - statistical model
poisson_model = PoissonBaseline()
poisson_model.fit(train_df)

print("Poisson trained!")
print(f"  League average: {poisson_model.league_avg:.2f}")
print(f"  Home factor: {poisson_model.home_factor:.2f}")

## 6. Making Predictions

Use `.predict_goals(game)` to get predicted home and away goals.

In [None]:
# Predict a specific game
example_game = pd.Series({
    'home_team': 'Bruins',
    'away_team': 'Canadiens'
})

print(f"Game: {example_game['home_team']} vs {example_game['away_team']}")
print("="*50)

models = {
    'GlobalMean': global_model,
    'TeamMean': team_model,
    'MovingAverage': ma_model,
    'Poisson': poisson_model
}

for name, model in models.items():
    home_pred, away_pred = model.predict_goals(example_game)
    print(f"  {name:15}: Home {home_pred:.2f} - Away {away_pred:.2f}")

In [None]:
# Notice how different models give different predictions
# GlobalMean: Same for all games
# TeamMean: Bruins (strong) vs Canadiens (weak) = home advantage
# Poisson: Most sophisticated prediction

# Try another matchup
example_game2 = pd.Series({
    'home_team': 'Canadiens',
    'away_team': 'Bruins'
})

print(f"\nReversed: {example_game2['home_team']} vs {example_game2['away_team']}")
print("="*50)

for name, model in models.items():
    home_pred, away_pred = model.predict_goals(example_game2)
    print(f"  {name:15}: Home {home_pred:.2f} - Away {away_pred:.2f}")

## 7. Evaluation and Comparison

Use `.evaluate(test_df)` to compute RMSE and MAE.

In [None]:
# Evaluate all models on test set
results = []

for name, model in models.items():
    metrics = model.evaluate(test_df)
    results.append({
        'Model': name,
        'RMSE': round(metrics['rmse'], 4),
        'MAE': round(metrics['mae'], 4)
    })
    print(f"{name}: RMSE={metrics['rmse']:.4f}, MAE={metrics['mae']:.4f}")

results_df = pd.DataFrame(results).sort_values('RMSE')
results_df

In [None]:
# Visualize comparison
plt.figure(figsize=(10, 5))

colors = ['green' if x == results_df['RMSE'].min() else 'steelblue' 
          for x in results_df['RMSE']]

plt.barh(results_df['Model'], results_df['RMSE'], color=colors)
plt.xlabel('RMSE (lower is better)')
plt.title('Baseline Model Comparison')
plt.gca().invert_yaxis()

# Add value labels
for i, (model, rmse) in enumerate(zip(results_df['Model'], results_df['RMSE'])):
    plt.text(rmse + 0.02, i, f'{rmse:.3f}', va='center')

plt.tight_layout()
plt.show()

## 8. Choosing the Right Baseline

### Decision Guide

| Situation | Recommended Baseline |
|-----------|---------------------|
| Competition submission benchmark | GlobalMean |
| Standard model comparison | TeamMean |
| Academic paper | Poisson |
| Data with team form changes | MovingAverage |

### Key Insights

1. **GlobalMean** is the floor - any model must beat this
2. **TeamMean** is a fair baseline - accounts for team quality
3. **Poisson** is the academic standard - uses statistical theory
4. **MovingAverage** captures form - good for dynamic seasons

In [None]:
# Calculate improvement over GlobalMean
global_rmse = results_df[results_df['Model'] == 'GlobalMean']['RMSE'].values[0]

print("Improvement over GlobalMean baseline:")
print("="*40)

for _, row in results_df.iterrows():
    improvement = (global_rmse - row['RMSE']) / global_rmse * 100
    symbol = "✅" if improvement > 0 else "➖" if improvement == 0 else "❌"
    print(f"  {symbol} {row['Model']}: {improvement:+.1f}%")

## 9. Next Steps

Now that you understand baseline models:

### For the Competition:

1. Run `train_baseline.ipynb` on your data to find the best baseline
2. Record the best baseline RMSE as your benchmark
3. Build more complex models (ELO, XGBoost, ensemble)
4. Compare all models against your baseline benchmark

### Recommended Next Models:

| Model | Expected Improvement | Complexity |
|-------|---------------------|------------|
| ELO Ratings | 5-15% | Medium |
| XGBoost | 10-20% | High |
| Neural Network | 5-25% | Very High |
| Ensemble | 15-30% | High |

In [None]:
# Final summary
best_baseline = results_df.iloc[0]

print("\n" + "="*50)
print("TUTORIAL COMPLETE")
print("="*50)
print(f"\nBest Baseline: {best_baseline['Model']}")
print(f"RMSE: {best_baseline['RMSE']}")
print(f"\nThis is your benchmark to beat with advanced models!")