# Probability Calibration Analysis
## Hybrid Ensemble Model — Premier League 2023–2025

This notebook evaluates **probability calibration** of the hybrid Dixon-Coles / LightGBM ensemble.  
A well-calibrated model means: when it predicts a 70% chance of a home win, that outcome should occur approximately 70% of the time.  
Poor calibration — even with high accuracy — leads to systematically over- or under-confident predictions, directly harming any downstream decision-making in recruitment or performance analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss
import warnings
warnings.filterwarnings('ignore')

import os
os.makedirs('assets', exist_ok=True)

plt.rcParams.update({
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f9fa',
    'axes.grid': True,
    'grid.alpha': 0.4,
    'grid.linestyle': '--',
    'font.family': 'sans-serif',
    'axes.spines.top': False,
    'axes.spines.right': False,
})
print("Libraries loaded.")

In [None]:
df = pd.read_csv('sample_dataset.csv')
df['match_date'] = pd.to_datetime(df['match_date'])
df = df.sort_values('match_date').reset_index(drop=True)

# Derive binary target and home-win probability from clean columns
df['home_win'] = (df['actual_result'] == 'H').astype(int)
df['model_prob_home_win'] = df['prob_H']

y = df['home_win'].values
p = df['model_prob_home_win'].values

print(f"Loaded {len(df)} matches | Home win rate: {y.mean():.3f} | Mean predicted prob: {p.mean():.3f}")
print(df[['match_date','home_team','away_team','actual_result','predicted_result','correct','prob_H','prob_D','prob_A']].head(5).to_string(index=False))

## 1. Reliability Curve (Calibration Curve)

The diagonal represents perfect calibration. Points above the line indicate the model is **under-confident** (events happen more often than predicted). Points below indicate **over-confidence**.  
Quantile-based binning is used to ensure each bin contains a roughly equal number of matches, producing more statistically stable estimates than uniform width bins.

In [None]:
prob_true, prob_pred = calibration_curve(y, p, n_bins=10, strategy='quantile')

fig, ax = plt.subplots(figsize=(7, 6))
ax.plot([0, 1], [0, 1], linestyle='--', color='#888888', linewidth=1.5, label='Perfect calibration')
ax.plot(prob_pred, prob_true, marker='o', color='#003366', linewidth=2,
        markersize=7, label='Ensemble model (Dixon-Coles + LightGBM)')

ax.fill_between(prob_pred, prob_pred, prob_true,
                where=(prob_true > prob_pred), alpha=0.12, color='#2a9d8f', label='Under-confident')
ax.fill_between(prob_pred, prob_pred, prob_true,
                where=(prob_true <= prob_pred), alpha=0.12, color='#E63946', label='Over-confident')

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel('Mean Predicted Probability', fontsize=11)
ax.set_ylabel('Observed Win Frequency', fontsize=11)
ax.set_title('Reliability Curve — Home Win Probability\n(Premier League 2023–2025)', fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
plt.tight_layout()
plt.savefig('assets/calibration_curve.png', dpi=150, bbox_inches='tight')
plt.show()
print("Saved: assets/calibration_curve.png")

![Calibration Curve](https://raw.githubusercontent.com/vkenard/football-performance-analytics/main/assets/calibration_curve.png)

## 3. Brier Score — Measuring Probability Accuracy

The **Brier Score** is the mean squared error between predicted probabilities and actual outcomes. Unlike accuracy, it rewards confident *correct* predictions and penalises confident *wrong* ones.

$$BS = \frac{1}{N}\sum_{i=1}^{N}(p_i - y_i)^2$$

- **Range:** 0 (perfect) to 1 (perfectly wrong)
- **Baseline:** A naive model predicting the historical mean win rate for every match
- **Brier Skill Score (BSS):** How much better the model is vs the naive baseline — positive values confirm the model adds predictive value

> A Brier Score improvement over baseline demonstrates that the ensemble captures genuine signal about match outcomes, not just historical averages.

In [None]:
brier = brier_score_loss(y, p)

# Baseline: predict the global mean win rate for every match
baseline_prob = np.full(len(y), y.mean())
brier_baseline = brier_score_loss(y, baseline_prob)

# Brier Skill Score: how much better than the naive baseline
bss = 1 - (brier / brier_baseline)

print("=" * 42)
print("  Brier Score Analysis")
print("=" * 42)
print(f"  Model Brier Score    : {brier:.4f}")
print(f"  Baseline Brier Score : {brier_baseline:.4f}")
print(f"  Brier Skill Score    : {bss:+.4f}")
print("-" * 42)
if bss > 0:
    print(f"  Model is {bss*100:.1f}% better than naive baseline")
else:
    print("  Model underperforms naive baseline")

## 4. Probability Sharpness — Decile Reliability Table

A well-calibrated model should show **monotonic alignment** between predicted probability bands and observed frequencies. If matches the model rates as 70–80% likely to produce a home win actually do so ~75% of the time, the probabilities are genuinely informative.

This decile table splits all predictions into 10 equally-populated buckets and compares:
- **mean_predicted** — average model probability in that bucket
- **actual_rate** — true observed frequency in that bucket
- **count** — number of matches

*Monotonically increasing actual_rate confirms the model discriminates meaningfully between high and low-probability matches.*

In [None]:
df_cal = df[['model_prob_home_win', 'home_win']].copy()
df_cal['decile'] = pd.qcut(df_cal['model_prob_home_win'], q=10, labels=False, duplicates='drop')

summary = (
    df_cal.groupby('decile')
    .agg(
        mean_predicted=('model_prob_home_win', 'mean'),
        actual_rate=('home_win', 'mean'),
        count=('home_win', 'count')
    )
    .reset_index(drop=True)
)
summary.index = [f"D{i+1}" for i in range(len(summary))]

print(summary.to_string())

# Bar chart: predicted vs actual by decile
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(summary))
w = 0.35
ax.bar(x - w/2, summary['mean_predicted'], w, label='Mean Predicted', color='#003366', alpha=0.85)
ax.bar(x + w/2, summary['actual_rate'], w, label='Actual Rate', color='#E63946', alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(summary.index, fontsize=9)
ax.set_ylabel('Probability / Win Rate')
ax.set_xlabel('Probability Decile (D1 = lowest predicted, D10 = highest)')
ax.set_title('Predicted vs Actual Win Rate by Probability Decile\n(Premier League 2023–2025)', fontsize=12, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.savefig('assets/decile_reliability.png', dpi=150, bbox_inches='tight')
plt.show()
print("Saved: assets/decile_reliability.png")

![Decile Reliability](https://raw.githubusercontent.com/vkenard/football-performance-analytics/main/assets/decile_reliability.png)