# Forward Validation Demo
## Hybrid Ensemble Model — Premier League 2023–2025

This notebook demonstrates **time-aware forward validation** of a hybrid probabilistic football model blending:
- **Dixon-Coles** goal-distribution parameters (per-team attack/defence strength)
- **LightGBM** gradient boosting on engineered features: xG differential, Elo ratings, recent form, rest days

Training is strictly cut off before the evaluation window — no lookahead, no data leakage.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.metrics import accuracy_score, brier_score_loss
import warnings
warnings.filterwarnings('ignore')

import os
os.makedirs('assets', exist_ok=True)

plt.rcParams.update({
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f9fa',
    'axes.grid': True,
    'grid.alpha': 0.4,
    'grid.linestyle': '--',
    'font.family': 'sans-serif',
    'axes.spines.top': False,
    'axes.spines.right': False,
})
print("Libraries loaded.")

In [None]:
df = pd.read_csv('sample_dataset.csv')
df['match_date'] = pd.to_datetime(df['match_date'])
df = df.sort_values('match_date').reset_index(drop=True)

# Derive binary target and home-win probability from clean columns
df['home_win'] = (df['actual_result'] == 'H').astype(int)
df['model_prob_home_win'] = df['prob_H']

print(f"Dataset: {len(df)} matches | {df['match_date'].min().date()} → {df['match_date'].max().date()}")
print(f"Seasons: {sorted(df['season'].unique())}")
print(f"\nSample rows:")
print(df[['match_date','home_team','away_team','actual_result','predicted_result','correct','prob_H','prob_D','prob_A','elo_diff']].head(6).to_string(index=False))

## 1. Season Boundary Train / Test Split

The model is evaluated using a **hard season boundary split** — training on complete seasons 2021-22 through 2024-25, evaluating on the entire 2025-26 season as a genuinely held-out walk-forward validation.

This is the correct unit of separation in football analytics: promotion/relegation reshuffles the team population each summer, meaning a mid-season percentage cut would split newly-promoted sides across train and test. A season boundary avoids that entirely and supports the precise claim: **"4 complete seasons trained; 2025-26 held out."**


In [None]:
season_boundary = pd.Timestamp('2025-08-01')
train = df[df['match_date'] < season_boundary].copy()
test  = df[df['match_date'] >= season_boundary].copy()
split_idx = len(train)  # for chart positioning

print(f"Training : {train['match_date'].min().date()} to {train['match_date'].max().date()}  ({len(train)} matches, 2021-22 → 2024-25)")
print(f"Test     : {test['match_date'].min().date()} to {test['match_date'].max().date()}   ({len(test)} matches, 2025-26)")

fig, ax = plt.subplots(figsize=(12, 2.5))
ax.barh(y=0, width=len(train), left=0, color='#003366',
        label=f'Training — 4 complete seasons ({len(train)} matches)', height=0.6)
ax.barh(y=0, width=len(test), left=len(train), color='#E63946',
        label=f'Walk-Forward Validation — 2025-26 ({len(test)} matches)', height=0.6)
ax.axvline(split_idx, color='white', linewidth=2, linestyle='--')

ax.text(split_idx / 2, 0,
        f"Train\n{train['match_date'].min().date()} to {train['match_date'].max().date()}",
        ha='center', va='center', color='white', fontsize=9, fontweight='bold')
ax.text(split_idx + len(test) / 2, 0,
        f"Test\n{test['match_date'].min().date()} to {test['match_date'].max().date()}",
        ha='center', va='center', color='white', fontsize=9, fontweight='bold')

ax.set_xlabel('Match index (chronological)', fontsize=10)
ax.set_title('Forward Validation — Season Boundary Split (2025-26 Held Out Entirely)', fontsize=12, fontweight='bold')
ax.legend(loc='upper right', fontsize=9)
ax.set_xlim(0, len(df))
ax.set_yticks([])
plt.tight_layout()
plt.savefig('assets/forward_validation_split.png', dpi=150, bbox_inches='tight')
plt.show()


![Forward Validation Split](https://raw.githubusercontent.com/vkenard/football-performance-analytics/main/assets/forward_validation_split.png)

## 2. Performance Metrics on the Test Set

Four standard metrics are reported on the held-out test set only:

- **3-Way Accuracy** -- fraction of matches where the predicted H/D/A outcome matched reality
- **Binary Accuracy** -- home win correctly identified (Yes/No) using a 0.5 probability threshold
- **Brier Score** -- mean squared error of probabilities; lower is better (0.0 = perfect)
- **Brier Skill Score (BSS)** -- improvement over a naive baseline of always predicting the test-set historical mean win rate; positive confirms the model adds genuine predictive value beyond the base rate

In [None]:
y_true = test['home_win'].values
y_prob = test['model_prob_home_win'].values
y_pred_binary = (y_prob >= 0.5).astype(int)

# 3-way accuracy: use predicted_result column already in the CSV
acc_3way   = test['correct'].mean()
acc_binary = accuracy_score(y_true, y_pred_binary)

# Brier Score vs test-set naive baseline (always predict historical mean)
brier          = brier_score_loss(y_true, y_prob)
baseline_prob  = np.full(len(y_true), y_true.mean())
brier_baseline = brier_score_loss(y_true, baseline_prob)
bss            = 1 - (brier / brier_baseline)

print("=" * 56)
print("  FORWARD VALIDATION -- TEST SET RESULTS")
print("=" * 56)
print(f"  {'Test set home win rate':<38} {y_true.mean():.3f}")
print(f"  {'3-Way Match Accuracy (H/D/A)':<38} {acc_3way:.3f}")
print(f"  {'Binary Accuracy (Home Win, t=0.5)':<38} {acc_binary:.3f}")
print(f"  {'Brier Score (model)':<38} {brier:.4f}")
print(f"  {'Brier Score (naive baseline)':<38} {brier_baseline:.4f}")
print(f"  {'Brier Skill Score':<38} {bss:+.4f}  ({'beats naive baseline' if bss > 0 else 'below naive baseline'})")
print("=" * 56)

## 3. Rolling Performance Drift Detection

A **20-match rolling Brier score** is tracked across the full dataset to identify periods of performance drift.  
Rising values indicate the model is struggling — often coinciding with squad changes, managerial turnover, or shifting tactical profiles mid-season.  
This kind of longitudinal monitoring is directly applicable to academy development tracking.

In [None]:
window = 20
y_all  = df['home_win'].values
p_all  = df['model_prob_home_win'].values

rolling_brier = [
    np.mean((y_all[i:i+window] - p_all[i:i+window]) ** 2)
    for i in range(len(df) - window + 1)
]
rolling_dates = df['match_date'].iloc[window - 1:].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(13, 4.5))
ax.plot(rolling_dates, rolling_brier, color='#003366', linewidth=1.8,
        label=f'Rolling Brier ({window}-match window)')
ax.axhline(brier, color='#E63946', linewidth=1.5, linestyle='--',
           label=f'Overall test Brier = {brier:.4f}')
ax.axhline(0.25, color='#888888', linewidth=1.2, linestyle=':',
           label='Coin-flip reference (0.25)')
ax.axvline(pd.Timestamp(split_date), color='#2a9d8f', linewidth=1.8, linestyle='--',
           label=f'Train / Test split ({split_date})')

ax.fill_between(rolling_dates, rolling_brier, 0.25,
                where=[v < 0.25 for v in rolling_brier],
                alpha=0.12, color='#003366', interpolate=True, label='Better than coin-flip')
ax.fill_between(rolling_dates, rolling_brier, 0.25,
                where=[v >= 0.25 for v in rolling_brier],
                alpha=0.12, color='#E63946', interpolate=True, label='Worse than coin-flip')

ax.set_ylim(0.10, 0.35)
ax.set_xlabel('Match Date', fontsize=11)
ax.set_ylabel('Rolling Brier Score', fontsize=11)
ax.set_title('Rolling Brier Score -- Performance Drift Detection\n(Lower = More Accurate Probability Estimates)',
             fontsize=12, fontweight='bold')
ax.legend(fontsize=9, loc='upper right')
plt.tight_layout()
plt.savefig('assets/drift_monitoring.png', dpi=150, bbox_inches='tight')
plt.show()
print("Charts saved to assets/")

![Drift Monitoring](https://raw.githubusercontent.com/vkenard/football-performance-analytics/main/assets/drift_monitoring.png)

## 4. Feature Importance -- What is Driving the Predictions?

Understanding *why* the model produces a particular probability is as important as the probability itself. This matters directly for coaching and academy staff: if rest days are a top driver, that informs scheduling decisions; if xG differential is dominant, it validates the use of expected goals as a development metric.

A Random Forest is fitted to the 8 engineered input features (excluding the probability outputs) to quantify which features contribute most to predicting home win outcomes. Feature importance is measured by mean decrease in impurity across all trees.

> **Note:** The Dixon-Coles and LightGBM blend already encodes most of this signal in `prob_H`. The chart below shows the *raw feature signal* independently -- useful for confirming which inputs the ensemble is actually relying on.

In [None]:
from sklearn.ensemble import RandomForestClassifier

feature_cols = ['elo_diff', 'xg_diff', 'dc_home_attack', 'dc_away_defence',
                'form_home_5', 'form_away_5', 'rest_days_home', 'rest_days_away']
feature_labels = {
    'elo_diff':         'Elo Rating Gap',
    'xg_diff':          'xG Differential',
    'dc_home_attack':   'DC Home Attack',
    'dc_away_defence':  'DC Away Defence',
    'form_home_5':      'Home Form (5 games)',
    'form_away_5':      'Away Form (5 games)',
    'rest_days_home':   'Home Rest Days',
    'rest_days_away':   'Away Rest Days',
}

X = df[feature_cols].values
y = df['home_win'].values

rf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
rf.fit(X, y)

importances = (
    pd.Series(rf.feature_importances_, index=feature_cols)
    .rename(index=feature_labels)
    .sort_values()
)

fig, ax = plt.subplots(figsize=(9, 5))
colors = ['#003366' if v >= importances.median() else '#6c9ec8' for v in importances.values]
bars = ax.barh(importances.index, importances.values, color=colors, edgecolor='none')

for bar, val in zip(bars, importances.values):
    ax.text(val + 0.002, bar.get_y() + bar.get_height() / 2,
            f'{val:.3f}', va='center', fontsize=9)

ax.set_xlabel('Feature Importance (Mean Decrease in Impurity)', fontsize=10)
ax.set_title('Feature Importance -- Drivers of Home Win Prediction\n(Random Forest, 300 trees, full 348-match sample)',
             fontsize=11, fontweight='bold')
ax.set_xlim(0, importances.max() * 1.2)
plt.tight_layout()
plt.savefig('assets/feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()
print("Saved: assets/feature_importance.png")

![Feature Importance](https://raw.githubusercontent.com/vkenard/football-performance-analytics/main/assets/feature_importance.png)