# SCOPE - Model Training & Evaluation
## Premier League Corner Prediction using XGBoost

This notebook trains and evaluates the corner prediction model following GUIDE.md specifications.

**Workflow:**
1. Load and prepare data
2. Compute venue-aware rolling features
3. Train XGBoost model with time-based split
4. Evaluate on test set (2025-26 season)
5. Analyze feature importance and calibration

In [11]:
# =============================================================================
# CELL 1: Imports
# =============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import pickle
from datetime import datetime

# ML imports
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import TimeSeriesSplit

# XGBoost
import xgboost as xgb

warnings.filterwarnings('ignore')
print("Imports complete")

Imports complete


In [12]:
# =============================================================================
# CELL 2: CONFIGURABLE PARAMETERS
# =============================================================================
# Adjust these parameters to experiment with different configurations

# ----- Data Parameters -----
ROLLING_WINDOW = 5          # Number of previous games for rolling stats (try: 3, 5, 10)
MIN_MATCHES_REQUIRED = 3    # Minimum matches before making predictions

# ----- Train/Test Split -----
TEST_SEASON = '2025-26'     # Season to use as test set

# ----- XGBoost Parameters -----
XGBOOST_PARAMS = {
    'objective': 'reg:squarederror',  # or 'count:poisson' for count data
    'eval_metric': 'rmse',
    'max_depth': 4,                    # Tree depth (try: 3, 4, 5, 6)
    'learning_rate': 0.05,             # Learning rate (try: 0.01, 0.03, 0.05, 0.1)
    'n_estimators': 500,               # Max trees (early stopping will find optimal)
    'min_child_weight': 10,            # Min samples per leaf (try: 5, 10, 20)
    'subsample': 0.8,                  # Row sampling (try: 0.7, 0.8, 0.9)
    'colsample_bytree': 0.8,           # Feature sampling (try: 0.7, 0.8, 0.9)
    'reg_alpha': 0.1,                  # L1 regularization
    'reg_lambda': 1.0,                 # L2 regularization
    'random_state': 42,
    'n_jobs': -1,
    'verbosity': 0
}

# ----- Early Stopping -----
EARLY_STOPPING_ROUNDS = 50  # Stop if no improvement for N rounds
VALIDATION_SPLIT = 0.2     # Fraction of training data for validation

# ----- Over/Under Thresholds -----
OU_THRESHOLDS = [8.5, 9.5, 10.5, 11.5, 12.5]

print("Parameters configured:")
print(f"  Rolling window: {ROLLING_WINDOW}")
print(f"  Test season: {TEST_SEASON}")
print(f"  XGBoost max_depth: {XGBOOST_PARAMS['max_depth']}")
print(f"  XGBoost learning_rate: {XGBOOST_PARAMS['learning_rate']}")

Parameters configured:
  Rolling window: 5
  Test season: 2025-26
  XGBoost max_depth: 4
  XGBoost learning_rate: 0.05


---
## Section 1: Data Loading

In [13]:
# =============================================================================
# CELL 3: Load Data from Football-Data.co.uk
# =============================================================================

SEASONS = {
    '2020-21': '2021',
    '2021-22': '2122',
    '2022-23': '2223',
    '2023-24': '2324',
    '2024-25': '2425',
    '2025-26': '2526'  # Test season
}

BASE_URL = 'https://www.football-data.co.uk/mmz4281/{code}/E0.csv'

COLS = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG',
        'HS', 'AS', 'HST', 'AST', 'HC', 'AC']

print("Loading data from Football-Data.co.uk...\n")
dfs = []
for season_name, season_code in SEASONS.items():
    url = BASE_URL.format(code=season_code)
    try:
        df = pd.read_csv(url, encoding='utf-8')
        available_cols = [c for c in COLS if c in df.columns]
        df = df[available_cols].copy()
        df['Season'] = season_name
        print(f"  {season_name}: {len(df)} matches")
        dfs.append(df)
    except Exception as e:
        print(f"  {season_name}: Failed - {e}")

df = pd.concat(dfs, ignore_index=True)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
df = df.sort_values('Date').reset_index(drop=True)

# Create target
df['TotalCorners'] = df['HC'] + df['AC']

print(f"\nTotal matches: {len(df)}")
print(f"Date range: {df['Date'].min().strftime('%Y-%m-%d')} to {df['Date'].max().strftime('%Y-%m-%d')}")

Loading data from Football-Data.co.uk...

  2020-21: 380 matches
  2021-22: 380 matches
  2022-23: 380 matches
  2023-24: 380 matches
  2024-25: 380 matches
  2025-26: 200 matches

Total matches: 2100
Date range: 2020-09-12 to 2026-01-04


---
## Section 2: Feature Engineering

In [14]:
# =============================================================================
# CELL 4: Compute Venue-Aware Rolling Features
# =============================================================================

def compute_rolling_features(df, n=5):
    """
    Compute all venue-aware rolling features as defined in GUIDE.md.
    Uses only historical data (no data leakage).
    """
    print(f"Computing rolling features (N={n})...")
    
    # Initialize all feature columns
    rolling_cols = [
        # Category 1: Corners
        'home_corners_for', 'home_corners_against', 'home_corners_total', 'home_corner_std',
        'away_corners_for', 'away_corners_against', 'away_corners_total', 'away_corner_std',
        # Category 2: Shots
        'home_shots_for', 'home_shots_against', 'home_sot_for', 'home_sot_against',
        'away_shots_for', 'away_shots_against', 'away_sot_for', 'away_sot_against',
        # Category 5: Blocked shots
        'home_blocked_shots', 'home_blocked_against',
        'away_blocked_shots', 'away_blocked_against',
        # Category 6: Shot dominance
        'home_shot_dominance', 'away_shot_dominance'
    ]
    
    for col in rolling_cols:
        df[col] = np.nan
    
    # Get all teams
    all_teams = set(df['HomeTeam'].unique()) | set(df['AwayTeam'].unique())
    
    # Process each team
    for team in all_teams:
        home_mask = df['HomeTeam'] == team
        away_mask = df['AwayTeam'] == team
        
        home_indices = df[home_mask].index.tolist()
        away_indices = df[away_mask].index.tolist()
        
        # HOME games rolling stats
        for i, idx in enumerate(home_indices):
            if i >= n:
                prev = home_indices[i-n:i]
                prev_data = df.loc[prev]
                
                df.loc[idx, 'home_corners_for'] = prev_data['HC'].mean()
                df.loc[idx, 'home_corners_against'] = prev_data['AC'].mean()
                df.loc[idx, 'home_corners_total'] = (prev_data['HC'] + prev_data['AC']).mean()
                df.loc[idx, 'home_corner_std'] = prev_data['HC'].std()
                
                df.loc[idx, 'home_shots_for'] = prev_data['HS'].mean()
                df.loc[idx, 'home_shots_against'] = prev_data['AS'].mean()
                df.loc[idx, 'home_sot_for'] = prev_data['HST'].mean()
                df.loc[idx, 'home_sot_against'] = prev_data['AST'].mean()
                
                df.loc[idx, 'home_blocked_shots'] = (prev_data['HS'] - prev_data['HST']).mean()
                df.loc[idx, 'home_blocked_against'] = (prev_data['AS'] - prev_data['AST']).mean()
                df.loc[idx, 'home_shot_dominance'] = (prev_data['HS'] - prev_data['AS']).mean()
        
        # AWAY games rolling stats
        for i, idx in enumerate(away_indices):
            if i >= n:
                prev = away_indices[i-n:i]
                prev_data = df.loc[prev]
                
                df.loc[idx, 'away_corners_for'] = prev_data['AC'].mean()
                df.loc[idx, 'away_corners_against'] = prev_data['HC'].mean()
                df.loc[idx, 'away_corners_total'] = (prev_data['HC'] + prev_data['AC']).mean()
                df.loc[idx, 'away_corner_std'] = prev_data['AC'].std()
                
                df.loc[idx, 'away_shots_for'] = prev_data['AS'].mean()
                df.loc[idx, 'away_shots_against'] = prev_data['HS'].mean()
                df.loc[idx, 'away_sot_for'] = prev_data['AST'].mean()
                df.loc[idx, 'away_sot_against'] = prev_data['HST'].mean()
                
                df.loc[idx, 'away_blocked_shots'] = (prev_data['AS'] - prev_data['AST']).mean()
                df.loc[idx, 'away_blocked_against'] = (prev_data['HS'] - prev_data['HST']).mean()
                df.loc[idx, 'away_shot_dominance'] = (prev_data['AS'] - prev_data['HS']).mean()
    
    print("  Rolling features computed.")
    return df

df = compute_rolling_features(df, n=ROLLING_WINDOW)

Computing rolling features (N=5)...
  Rolling features computed.


In [15]:
# =============================================================================
# CELL 5: Compute Match-Level Composite Features
# =============================================================================

def compute_match_features(df):
    """Compute match-level composite features from rolling stats."""
    print("Computing match-level features...")
    
    # Category 1: Corner composites
    df['expected_corners_for'] = df['home_corners_for'] + df['away_corners_for']
    df['expected_corners_against'] = df['home_corners_against'] + df['away_corners_against']
    df['expected_corners_total'] = (df['home_corners_total'] + df['away_corners_total']) / 2
    df['corner_differential'] = df['home_corners_for'] - df['away_corners_for']
    
    # Category 2: Shot composites
    df['combined_shots_for'] = df['home_shots_for'] + df['away_shots_for']
    df['combined_sot_for'] = df['home_sot_for'] + df['away_sot_for']
    df['shot_differential'] = df['home_shots_for'] - df['away_shots_for']
    
    # Category 3: Efficiency ratios
    df['home_shot_accuracy'] = df['home_sot_for'] / df['home_shots_for'].replace(0, np.nan)
    df['away_shot_accuracy'] = df['away_sot_for'] / df['away_shots_for'].replace(0, np.nan)
    df['avg_shot_accuracy'] = (df['home_shot_accuracy'] + df['away_shot_accuracy']) / 2
    df['home_corners_per_shot'] = df['home_corners_for'] / df['home_shots_for'].replace(0, np.nan)
    df['away_corners_per_shot'] = df['away_corners_for'] / df['away_shots_for'].replace(0, np.nan)
    df['combined_corners_per_shot'] = df['home_corners_per_shot'] + df['away_corners_per_shot']
    
    # Category 4: Pressure index
    df['home_shot_share'] = df['home_shots_for'] / (df['home_shots_for'] + df['home_shots_against']).replace(0, np.nan)
    df['away_shot_share'] = df['away_shots_for'] / (df['away_shots_for'] + df['away_shots_against']).replace(0, np.nan)
    df['home_corner_share'] = df['home_corners_for'] / (df['home_corners_for'] + df['home_corners_against']).replace(0, np.nan)
    df['away_corner_share'] = df['away_corners_for'] / (df['away_corners_for'] + df['away_corners_against']).replace(0, np.nan)
    df['pressure_sum'] = df['home_shot_share'] + df['away_shot_share']
    df['pressure_gap'] = df['home_shot_share'] - df['away_shot_share']
    
    # Category 5: Blocked shots
    df['combined_blocked_shots'] = df['home_blocked_shots'] + df['away_blocked_shots']
    df['blocked_shot_ratio'] = df['combined_blocked_shots'] / df['combined_shots_for'].replace(0, np.nan)
    
    # Category 6: Shot imbalance
    df['expected_shot_imbalance'] = abs(df['home_shot_dominance'] - df['away_shot_dominance'])
    df['dominance_mismatch'] = df['home_shot_dominance'] + df['away_shot_dominance']
    
    # Category 7: Volatility
    df['home_corner_cv'] = df['home_corner_std'] / df['home_corners_for'].replace(0, np.nan)
    df['away_corner_cv'] = df['away_corner_std'] / df['away_corners_for'].replace(0, np.nan)
    df['combined_corner_volatility'] = df['home_corner_std'] + df['away_corner_std']
    
    print("  Match-level features computed.")
    return df

df = compute_match_features(df)
print(f"\nTotal columns: {len(df.columns)}")

Computing match-level features...
  Match-level features computed.

Total columns: 61


In [16]:
# =============================================================================
# CELL 6: Define Feature List
# =============================================================================

# All features to use for training
FEATURE_COLUMNS = [
    # Category 1: Rolling Corners
    'home_corners_for', 'home_corners_against', 'home_corners_total',
    'away_corners_for', 'away_corners_against', 'away_corners_total',
    'expected_corners_for', 'expected_corners_against', 'expected_corners_total',
    'corner_differential', 'home_corner_std', 'away_corner_std',
    
    # Category 2: Rolling Shots
    'home_shots_for', 'home_shots_against', 'home_sot_for', 'home_sot_against',
    'away_shots_for', 'away_shots_against', 'away_sot_for', 'away_sot_against',
    'combined_shots_for', 'combined_sot_for', 'shot_differential', 'avg_shot_accuracy',
    
    # Category 3: Efficiency Ratios
    'home_shot_accuracy', 'away_shot_accuracy',
    'home_corners_per_shot', 'away_corners_per_shot', 'combined_corners_per_shot',
    
    # Category 4: Pressure Index
    'home_shot_share', 'away_shot_share',
    'home_corner_share', 'away_corner_share',
    'pressure_sum', 'pressure_gap',
    
    # Category 5: Blocked Shots
    'home_blocked_shots', 'home_blocked_against',
    'away_blocked_shots', 'away_blocked_against',
    'combined_blocked_shots', 'blocked_shot_ratio',
    
    # Category 6: Shot Imbalance
    'home_shot_dominance', 'away_shot_dominance',
    'expected_shot_imbalance', 'dominance_mismatch',
    
    # Category 7: Volatility
    'home_corner_cv', 'away_corner_cv', 'combined_corner_volatility'
]

TARGET_COLUMN = 'TotalCorners'

print(f"Total features: {len(FEATURE_COLUMNS)}")
print(f"Target: {TARGET_COLUMN}")

Total features: 48
Target: TotalCorners


---
## Section 3: Train/Test Split

In [17]:
# =============================================================================
# CELL 7: Prepare Train/Test Data
# =============================================================================

# Filter to rows with complete features
df_model = df.dropna(subset=FEATURE_COLUMNS + [TARGET_COLUMN]).copy()
print(f"Matches with complete features: {len(df_model)} / {len(df)}")

# Time-based split
train_df = df_model[df_model['Season'] != TEST_SEASON].copy()
test_df = df_model[df_model['Season'] == TEST_SEASON].copy()

print(f"\nTraining set: {len(train_df)} matches")
print(f"  Seasons: {train_df['Season'].unique().tolist()}")
print(f"\nTest set: {len(test_df)} matches")
print(f"  Season: {TEST_SEASON}")

# Prepare X and y
X_train = train_df[FEATURE_COLUMNS]
y_train = train_df[TARGET_COLUMN]

X_test = test_df[FEATURE_COLUMNS]
y_test = test_df[TARGET_COLUMN]

print(f"\nFeature matrix shape: {X_train.shape}")

Matches with complete features: 1920 / 2100

Training set: 1730 matches
  Seasons: ['2020-21', '2021-22', '2022-23', '2023-24', '2024-25']

Test set: 190 matches
  Season: 2025-26

Feature matrix shape: (1730, 48)


In [18]:
# =============================================================================
# CELL 8: Create Validation Split for Early Stopping
# =============================================================================

# Use last portion of training data as validation (time-based)
val_size = int(len(X_train) * VALIDATION_SPLIT)
X_train_fit = X_train.iloc[:-val_size]
y_train_fit = y_train.iloc[:-val_size]
X_val = X_train.iloc[-val_size:]
y_val = y_train.iloc[-val_size:]

print(f"Training (fit): {len(X_train_fit)} matches")
print(f"Validation: {len(X_val)} matches")
print(f"Test: {len(X_test)} matches")

Training (fit): 1384 matches
Validation: 346 matches
Test: 190 matches


---
## Section 4: Model Training

In [19]:
# =============================================================================
# CELL 9: Train XGBoost Model
# =============================================================================

print("Training XGBoost model...")
print(f"Parameters: {XGBOOST_PARAMS}")

# Create model with early stopping in constructor (XGBoost 2.0+ API)
model = xgb.XGBRegressor(
    **XGBOOST_PARAMS,
    early_stopping_rounds=EARLY_STOPPING_ROUNDS
)

# Train with eval set for early stopping
model.fit(
    X_train_fit, y_train_fit,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"\nTraining complete!")
print(f"Best iteration: {model.best_iteration}")
print(f"Best validation RMSE: {model.best_score:.4f}")

Training XGBoost model...
Parameters: {'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'max_depth': 4, 'learning_rate': 0.05, 'n_estimators': 500, 'min_child_weight': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 'reg_alpha': 0.1, 'reg_lambda': 1.0, 'random_state': 42, 'n_jobs': -1, 'verbosity': 0}

Training complete!
Best iteration: 5
Best validation RMSE: 3.4900


---
## Section 5: Model Evaluation

In [20]:
# =============================================================================
# CELL 10: Make Predictions
# =============================================================================

# Predictions on all sets
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)

# Add predictions to test dataframe
test_df['Predicted'] = y_pred_test
test_df['Residual'] = test_df['TotalCorners'] - test_df['Predicted']

print("Predictions generated.")

Predictions generated.


In [21]:
# =============================================================================
# CELL 11: Evaluation Metrics
# =============================================================================

def evaluate_predictions(y_true, y_pred, set_name):
    """Calculate and print evaluation metrics."""
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    corr = np.corrcoef(y_true, y_pred)[0, 1]
    
    print(f"\n{set_name}:")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  MAE:  {mae:.4f}")
    print(f"  R²:   {r2:.4f}")
    print(f"  Correlation: {corr:.4f}")
    
    return {'rmse': rmse, 'mae': mae, 'r2': r2, 'corr': corr}

print("="*60)
print("MODEL EVALUATION")
print("="*60)

train_metrics = evaluate_predictions(y_train, y_pred_train, "Training Set")
val_metrics = evaluate_predictions(y_val, y_pred_val, "Validation Set")
test_metrics = evaluate_predictions(y_test, y_pred_test, "Test Set")

print("\n" + "="*60)

MODEL EVALUATION

Training Set:
  RMSE: 3.3790
  MAE:  2.6983
  R²:   0.0347
  Correlation: 0.3815

Validation Set:
  RMSE: 3.4900
  MAE:  2.7896
  R²:   0.0010
  Correlation: 0.0731

Test Set:
  RMSE: 3.3665
  MAE:  2.6842
  R²:   -0.0296
  Correlation: -0.0262



In [28]:
# =============================================================================
# CELL 12: Over/Under Accuracy
# =============================================================================

print("="*60)
print("OVER/UNDER ACCURACY (Test Set)")
print("="*60)

for threshold in OU_THRESHOLDS:
    actual_over = (y_test > threshold)
    actual_under = (y_test <= threshold)
    pred_over = (y_pred_test > threshold)
    pred_under = (y_pred_test <= threshold)
    
    # Overall accuracy
    overall_acc = ((actual_over == pred_over).mean()) * 100
    
    # Over accuracy: when we predict Over, how often correct?
    over_correct = ((pred_over) & (actual_over)).sum()
    over_total = pred_over.sum()
    over_acc = (over_correct / over_total * 100) if over_total > 0 else 0
    
    # Under accuracy: when we predict Under, how often correct?
    under_correct = ((pred_under) & (actual_under)).sum()
    under_total = pred_under.sum()
    under_acc = (under_correct / under_total * 100) if under_total > 0 else 0
    
    # Rates
    actual_over_rate = actual_over.mean() * 100
    pred_over_rate = pred_over.mean() * 100
    
    print(f"\n{'='*40}")
    print(f"Threshold: {threshold}")
    print(f"{'='*40}")
    print(f"  Overall Accuracy:    {overall_acc:.1f}%")
    print(f"  Over Accuracy:       {over_acc:.1f}% ({over_correct}/{over_total} correct)")
    print(f"  Under Accuracy:      {under_acc:.1f}% ({under_correct}/{under_total} correct)")
    print(f"  ---")
    print(f"  Actual Over rate:    {actual_over_rate:.1f}%")
    print(f"  Predicted Over rate: {pred_over_rate:.1f}%")

OVER/UNDER ACCURACY (Test Set)

Threshold: 8.5
  Overall Accuracy:    65.8%
  Over Accuracy:       65.8% (125/190 correct)
  Under Accuracy:      0.0% (0/0 correct)
  ---
  Actual Over rate:    65.8%
  Predicted Over rate: 100.0%

Threshold: 9.5
  Overall Accuracy:    55.3%
  Over Accuracy:       55.3% (105/190 correct)
  Under Accuracy:      0.0% (0/0 correct)
  ---
  Actual Over rate:    55.3%
  Predicted Over rate: 100.0%

Threshold: 10.5
  Overall Accuracy:    53.2%
  Over Accuracy:       40.5% (17/42 correct)
  Under Accuracy:      56.8% (84/148 correct)
  ---
  Actual Over rate:    42.6%
  Predicted Over rate: 22.1%

Threshold: 11.5
  Overall Accuracy:    72.1%
  Over Accuracy:       0.0% (0/0 correct)
  Under Accuracy:      72.1% (137/190 correct)
  ---
  Actual Over rate:    27.9%
  Predicted Over rate: 0.0%

Threshold: 12.5
  Overall Accuracy:    80.5%
  Over Accuracy:       0.0% (0/0 correct)
  Under Accuracy:      80.5% (153/190 correct)
  ---
  Actual Over rate:    19.5%
  

In [23]:
# =============================================================================
# CELL 13: Actual vs Predicted Plot
# =============================================================================

fig = px.scatter(
    x=y_test,
    y=y_pred_test,
    labels={'x': 'Actual Corners', 'y': 'Predicted Corners'},
    title=f'Actual vs Predicted Total Corners (Test Set, R² = {test_metrics["r2"]:.4f})',
    template='plotly_white',
    opacity=0.6
)

# Add perfect prediction line
min_val = min(y_test.min(), y_pred_test.min())
max_val = max(y_test.max(), y_pred_test.max())
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash')
))

fig.show()

In [24]:
# =============================================================================
# CELL 14: Residual Distribution
# =============================================================================

fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Residual Distribution', 'Residuals Over Time'
))

# Histogram of residuals
fig.add_trace(
    go.Histogram(x=test_df['Residual'], nbinsx=30, name='Residuals'),
    row=1, col=1
)

# Residuals over time
fig.add_trace(
    go.Scatter(
        x=test_df['Date'],
        y=test_df['Residual'],
        mode='markers',
        name='Residual',
        opacity=0.6
    ),
    row=1, col=2
)
fig.add_hline(y=0, line_dash='dash', line_color='red', row=1, col=2)

fig.update_layout(
    title='Residual Analysis (Test Set)',
    template='plotly_white',
    showlegend=False,
    height=400
)
fig.show()

print(f"Residual mean: {test_df['Residual'].mean():.4f}")
print(f"Residual std: {test_df['Residual'].std():.4f}")

Residual mean: -0.5260
Residual std: 3.3339


In [25]:
# =============================================================================
# CELL 15: Feature Importance
# =============================================================================

# Get feature importances
importance_df = pd.DataFrame({
    'Feature': FEATURE_COLUMNS,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot top 20 features
top_20 = importance_df.head(20)

fig = px.bar(
    top_20,
    x='Importance',
    y='Feature',
    orientation='h',
    title='Top 20 Feature Importances',
    template='plotly_white'
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'}, height=600)
fig.show()

print("\nTop 10 Features:")
print(importance_df.head(10).to_string(index=False))


Top 10 Features:
               Feature  Importance
   home_shot_dominance    0.030819
    away_shots_against    0.030343
    away_corners_total    0.029979
  away_blocked_against    0.027746
 away_corners_per_shot    0.027573
    away_blocked_shots    0.025282
    combined_shots_for    0.025275
combined_blocked_shots    0.025093
          home_sot_for    0.024167
     shot_differential    0.023891


---
## Section 6: Calibration Analysis

In [26]:
# =============================================================================
# CELL 16: Calibration - Predicted vs Actual by Bin
# =============================================================================

# Bin predictions
test_df['PredBin'] = pd.cut(test_df['Predicted'], bins=10, labels=False)

calibration = test_df.groupby('PredBin').agg({
    'Predicted': 'mean',
    'TotalCorners': 'mean',
    'Date': 'count'
}).rename(columns={'Date': 'Count'}).reset_index()

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=calibration['Predicted'],
    y=calibration['TotalCorners'],
    mode='markers+lines',
    name='Actual',
    marker=dict(size=calibration['Count'] / 2)
))

# Perfect calibration line
min_val = calibration['Predicted'].min()
max_val = calibration['Predicted'].max()
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    name='Perfect Calibration',
    line=dict(color='red', dash='dash')
))

fig.update_layout(
    title='Calibration Plot: Predicted vs Actual Corners',
    xaxis_title='Mean Predicted Corners',
    yaxis_title='Mean Actual Corners',
    template='plotly_white'
)
fig.show()

---
## Section 7: Save Model

In [27]:
# =============================================================================
# CELL 17: Save Model and Artifacts
# =============================================================================

# Save model
model_filename = f'model_xgb_{datetime.now().strftime("%Y%m%d_%H%M%S")}.pkl'
with open(model_filename, 'wb') as f:
    pickle.dump({
        'model': model,
        'feature_columns': FEATURE_COLUMNS,
        'params': XGBOOST_PARAMS,
        'rolling_window': ROLLING_WINDOW,
        'test_metrics': test_metrics,
        'train_date': datetime.now().isoformat()
    }, f)

print(f"Model saved to: {model_filename}")

# Save predictions
predictions_filename = f'predictions_{TEST_SEASON.replace("-", "")}.csv'
test_df[['Date', 'HomeTeam', 'AwayTeam', 'TotalCorners', 'Predicted', 'Residual']].to_csv(
    predictions_filename, index=False
)
print(f"Predictions saved to: {predictions_filename}")

Model saved to: model_xgb_20260106_202443.pkl
Predictions saved to: predictions_202526.csv


---
## Summary

This notebook trained an XGBoost model for predicting Premier League total corners.

**Key Results:**
- Model trained on historical data with venue-aware rolling features
- Evaluated on held-out test season
- Feature importance shows which features contribute most to predictions
- Calibration analysis shows how well predicted values match actual outcomes

**Next Steps:**
- Tune hyperparameters in Cell 2 to improve performance
- Try different rolling windows (N = 3, 5, 10)
- Experiment with `count:poisson` objective for count data
- Add more features or remove low-importance ones