# Model 4: XGBoost - Training and Hyperparameter Tuning

This notebook trains and tunes XGBoost models for hockey goal prediction.

## Table of Contents

1. Setup and Imports
2. Load Data
3. Feature Engineering
4. Baseline XGBoost
5. Random Search Hyperparameter Tuning
6. Grid Search (Fine-tuning)
7. Cross-Validation Analysis
8. Feature Importance
9. Final Model Evaluation
10. Save Best Model

## Hyperparameter Tuning Strategy

Based on `config/hyperparams/model4_xgboost.yaml`:

1. Start with defaults
2. Tune max_depth and min_child_weight (control overfitting)
3. Tune gamma (minimum loss reduction)
4. Tune subsample and colsample_bytree (prevent overfitting)
5. Tune regularization (reg_alpha, reg_lambda)
6. Lower learning_rate and increase n_estimators

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import os
import sys
import json
import yaml
import pathlib
from itertools import product
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Reliably set cwd to the python/ folder
_cwd = pathlib.Path(os.path.abspath('')).resolve()
if (_cwd / 'python').is_dir():
    _python_dir = _cwd / 'python'
elif _cwd.name == 'xgboost' and (_cwd.parent.parent / 'data').is_dir():
    _python_dir = _cwd.parent.parent
elif _cwd.name == 'training' and (_cwd.parent / 'data').is_dir():
    _python_dir = _cwd.parent
elif (_cwd / 'data').is_dir():
    _python_dir = _cwd
else:
    raise RuntimeError(f'Cannot locate python/ directory from {_cwd}')

os.chdir(_python_dir)
sys.path.insert(0, str(_python_dir))

# Configure plotting
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

# Check XGBoost
try:
    import xgboost as xgb
    print(f" XGBoost version: {xgb.__version__}")
except ImportError:
    print(" XGBoost not installed. Run: pip install xgboost")
    raise

print(f"CWD: {os.getcwd()}")
print("Setup complete.")

In [None]:
# Load hyperparameter configuration
config_path = '../config/hyperparams/model4_xgboost.yaml'

if os.path.exists(config_path):
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    print(f"Loaded config: {config['model_name']}")
    print(f"Description: {config['description']}")
else:
    print(f"Config not found at {config_path}, using defaults")
    config = None

## 2. Load or Generate Data

In [None]:
# Try to load real data, otherwise generate synthetic
data_path = 'data/hockey_features.csv'

if os.path.exists(data_path):
    data = pd.read_csv(data_path)
    print(f"Loaded {len(data)} games from {data_path}")
else:
    print("Generating synthetic hockey data for demonstration...")
    
    np.random.seed(42)
    n_games = 2000
    
    data = pd.DataFrame({
        # Team strength metrics
        'home_win_pct': np.random.uniform(0.3, 0.7, n_games),
        'away_win_pct': np.random.uniform(0.3, 0.7, n_games),
        'home_points_pct': np.random.uniform(0.4, 0.8, n_games),
        'away_points_pct': np.random.uniform(0.4, 0.8, n_games),
        
        # Offensive metrics
        'home_goals_avg': np.random.uniform(2.5, 3.8, n_games),
        'away_goals_avg': np.random.uniform(2.5, 3.8, n_games),
        'home_shots_avg': np.random.uniform(28, 35, n_games),
        'away_shots_avg': np.random.uniform(28, 35, n_games),
        
        # Defensive metrics
        'home_goals_against_avg': np.random.uniform(2.2, 3.5, n_games),
        'away_goals_against_avg': np.random.uniform(2.2, 3.5, n_games),
        'home_save_pct': np.random.uniform(0.88, 0.93, n_games),
        'away_save_pct': np.random.uniform(0.88, 0.93, n_games),
        
        # Special teams
        'home_pp_pct': np.random.uniform(0.15, 0.28, n_games),
        'away_pp_pct': np.random.uniform(0.15, 0.28, n_games),
        'home_pk_pct': np.random.uniform(0.75, 0.88, n_games),
        'away_pk_pct': np.random.uniform(0.75, 0.88, n_games),
        
        # Context
        'home_rest_days': np.random.randint(1, 5, n_games),
        'away_rest_days': np.random.randint(1, 5, n_games),
        'home_b2b': np.random.binomial(1, 0.15, n_games),
        'away_b2b': np.random.binomial(1, 0.15, n_games),
        
        # Recent form (last 5 games)
        'home_goals_last5': np.random.uniform(2.0, 4.0, n_games),
        'away_goals_last5': np.random.uniform(2.0, 4.0, n_games),
        'home_wins_last5': np.random.randint(0, 6, n_games),
        'away_wins_last5': np.random.randint(0, 6, n_games),
    })
    
    # Generate realistic goal totals
    home_advantage = 0.35
    
    data['home_goals'] = np.round(
        data['home_goals_avg'] * 0.3 +
        data['home_goals_last5'] * 0.2 +
        (4 - data['away_goals_against_avg']) * 0.3 +
        data['home_pp_pct'] * 3 +
        home_advantage +
        (data['home_rest_days'] - data['away_rest_days']) * 0.1 +
        np.random.normal(0, 0.8, n_games)
    ).clip(0, 9).astype(int)
    
    data['away_goals'] = np.round(
        data['away_goals_avg'] * 0.3 +
        data['away_goals_last5'] * 0.2 +
        (4 - data['home_goals_against_avg']) * 0.3 +
        data['away_pp_pct'] * 3 +
        np.random.normal(0, 0.8, n_games)
    ).clip(0, 9).astype(int)
    
    print(f"Generated {n_games} synthetic games")

print(f"\nDataset shape: {data.shape}")
print(f"Home goals mean: {data['home_goals'].mean():.2f}")
print(f"Away goals mean: {data['away_goals'].mean():.2f}")

In [None]:
# Prepare features and targets
target_cols = ['home_goals', 'away_goals']
exclude_cols = target_cols + ['home_team', 'away_team', 'date', 'game_id', 'season']

feature_cols = [col for col in data.columns if col not in exclude_cols]
print(f"Features ({len(feature_cols)}): {feature_cols[:10]}...")

X = data[feature_cols]
y_home = data['home_goals']
y_away = data['away_goals']

In [None]:
# Train/validation/test split (60/20/20)
X_trainval, X_test, y_home_trainval, y_home_test, y_away_trainval, y_away_test = train_test_split(
    X, y_home, y_away, test_size=0.2, random_state=42
)

X_train, X_val, y_home_train, y_home_val, y_away_train, y_away_val = train_test_split(
    X_trainval, y_home_trainval, y_away_trainval, test_size=0.25, random_state=42
)

print(f"Training set: {len(X_train)} games")
print(f"Validation set: {len(X_val)} games")
print(f"Test set: {len(X_test)} games")

## 3. Baseline XGBoost Model

In [None]:
# Default parameters from config
default_params = {
    'learning_rate': 0.05,
    'n_estimators': 500,
    'max_depth': 6,
    'min_child_weight': 3,
    'gamma': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'objective': 'reg:squarederror',
    'random_state': 42,
    'n_jobs': -1,
    'tree_method': 'hist',
}

print("Default XGBoost Parameters:")
for k, v in default_params.items():
    print(f"  {k}: {v}")

In [None]:
# Train baseline model for home goals
baseline_home = xgb.XGBRegressor(**default_params)
baseline_home.fit(
    X_train, y_home_train,
    eval_set=[(X_train, y_home_train), (X_val, y_home_val)],
    verbose=False
)

# Train baseline model for away goals
baseline_away = xgb.XGBRegressor(**default_params)
baseline_away.fit(
    X_train, y_away_train,
    eval_set=[(X_train, y_away_train), (X_val, y_away_val)],
    verbose=False
)

print("Baseline models trained!")

In [None]:
# Evaluate baseline on validation set
def evaluate_models(home_model, away_model, X, y_home, y_away):
    """Evaluate both models and return combined metrics."""
    home_pred = home_model.predict(X)
    away_pred = away_model.predict(X)
    
    metrics = {
        'home_rmse': np.sqrt(mean_squared_error(y_home, home_pred)),
        'away_rmse': np.sqrt(mean_squared_error(y_away, away_pred)),
        'home_mae': mean_absolute_error(y_home, home_pred),
        'away_mae': mean_absolute_error(y_away, away_pred),
        'home_r2': r2_score(y_home, home_pred),
        'away_r2': r2_score(y_away, away_pred),
    }
    
    # Combined metrics
    all_pred = np.concatenate([home_pred, away_pred])
    all_actual = np.concatenate([y_home, y_away])
    metrics['combined_rmse'] = np.sqrt(mean_squared_error(all_actual, all_pred))
    metrics['combined_mae'] = mean_absolute_error(all_actual, all_pred)
    metrics['combined_r2'] = r2_score(all_actual, all_pred)
    
    return metrics

baseline_metrics = evaluate_models(baseline_home, baseline_away, X_val, y_home_val, y_away_val)

print("\n Baseline Validation Performance")
print("=" * 45)
print(f"\nHome Goals:")
print(f"  RMSE: {baseline_metrics['home_rmse']:.4f}")
print(f"  MAE:  {baseline_metrics['home_mae']:.4f}")
print(f"  R²:   {baseline_metrics['home_r2']:.4f}")
print(f"\nAway Goals:")
print(f"  RMSE: {baseline_metrics['away_rmse']:.4f}")
print(f"  MAE:  {baseline_metrics['away_mae']:.4f}")
print(f"  R²:   {baseline_metrics['away_r2']:.4f}")
print(f"\nCombined:")
print(f"  RMSE: {baseline_metrics['combined_rmse']:.4f}")
print(f"  MAE:  {baseline_metrics['combined_mae']:.4f}")
print(f"  R²:   {baseline_metrics['combined_r2']:.4f}")

## 4. Random Search Hyperparameter Tuning

In [None]:
# Parameter distributions for random search
param_distributions = {
    'learning_rate': [0.01, 0.03, 0.05, 0.1],
    'n_estimators': [200, 500, 800, 1000],
    'max_depth': [3, 4, 5, 6, 8],
    'min_child_weight': [1, 3, 5, 7],
    'gamma': [0, 0.1, 0.2, 0.3],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.1, 0.5, 1.0],
    'reg_lambda': [0.5, 1.0, 2.0, 5.0],
}

print(f"Parameters to tune: {len(param_distributions)}")
for k, v in param_distributions.items():
    print(f"  {k}: {v}")

In [None]:
def random_search(X_train, y_train, X_val, y_val, param_dist, n_iter=50):
    """
    Perform random search for hyperparameters.
    """
    results = []
    best_score = float('inf')
    best_params = None
    best_model = None
    
    print(f"Running {n_iter} random search iterations...")
    
    for i in range(n_iter):
        # Sample random parameters
        params = {
            'objective': 'reg:squarederror',
            'random_state': 42,
            'n_jobs': -1,
            'tree_method': 'hist',
        }
        
        for key, values in param_dist.items():
            params[key] = np.random.choice(values)
        
        try:
            model = xgb.XGBRegressor(**params)
            model.fit(X_train, y_train, verbose=False)
            
            preds = model.predict(X_val)
            rmse = np.sqrt(mean_squared_error(y_val, preds))
            
            results.append({
                **{k: v for k, v in params.items() if k in param_dist},
                'val_rmse': rmse
            })
            
            if rmse < best_score:
                best_score = rmse
                best_params = params.copy()
                best_model = model
            
            if (i + 1) % 10 == 0:
                print(f"  Iteration {i + 1}/{n_iter} - Best RMSE: {best_score:.4f}")
                
        except Exception as e:
            print(f"  Error at iteration {i}: {e}")
    
    return best_params, best_score, best_model, pd.DataFrame(results)

# Run random search for home goals
print("\n Random Search: Home Goals Model")
print("=" * 45)
best_home_params, best_home_score, best_home_model, home_results = random_search(
    X_train, y_home_train, X_val, y_home_val, param_distributions, n_iter=50
)

In [None]:
# Run random search for away goals
print("\n Random Search: Away Goals Model")
print("=" * 45)
best_away_params, best_away_score, best_away_model, away_results = random_search(
    X_train, y_away_train, X_val, y_away_val, param_distributions, n_iter=50
)

In [None]:
# Show best parameters
print("\n Best Home Goals Parameters:")
print(f"   Validation RMSE: {best_home_score:.4f}")
for k, v in best_home_params.items():
    if k in param_distributions:
        print(f"   {k}: {v}")

print("\n Best Away Goals Parameters:")
print(f"   Validation RMSE: {best_away_score:.4f}")
for k, v in best_away_params.items():
    if k in param_distributions:
        print(f"   {k}: {v}")

In [None]:
# Visualize parameter importance from random search
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top parameters to visualize
params_to_plot = ['learning_rate', 'max_depth', 'n_estimators', 'min_child_weight']

for ax, param in zip(axes.flatten(), params_to_plot):
    home_results.boxplot(column='val_rmse', by=param, ax=ax)
    ax.set_title(f'RMSE by {param}')
    ax.set_xlabel(param)
    ax.set_ylabel('Validation RMSE')

plt.suptitle('Parameter Impact on Validation RMSE (Home Goals)', y=1.02)
plt.tight_layout()
plt.show()

## 5. Evaluate Tuned Models

In [None]:
# Compare baseline vs tuned
tuned_metrics = evaluate_models(best_home_model, best_away_model, X_val, y_home_val, y_away_val)

print("\n Performance Comparison (Validation Set)")
print("=" * 50)
print(f"\n{'Metric':<20} {'Baseline':<15} {'Tuned':<15} {'Change':<10}")
print("-" * 60)

for metric in ['combined_rmse', 'combined_mae', 'combined_r2']:
    baseline_val = baseline_metrics[metric]
    tuned_val = tuned_metrics[metric]
    
    if 'r2' in metric:
        change = tuned_val - baseline_val
        change_str = f"{change:+.4f}"
    else:
        change = (tuned_val - baseline_val) / baseline_val * 100
        change_str = f"{change:+.2f}%"
    
    print(f"{metric:<20} {baseline_val:<15.4f} {tuned_val:<15.4f} {change_str:<10}")

## 6. Cross-Validation Analysis

In [None]:
# 5-fold cross-validation on the best home model parameters
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

cv_model = xgb.XGBRegressor(**best_home_params)
cv_scores = cross_val_score(
    cv_model, X_trainval, y_home_trainval,
    cv=kfold, scoring='neg_root_mean_squared_error'
)

cv_rmse = -cv_scores

print("\n 5-Fold Cross-Validation (Home Goals)")
print("=" * 45)
print(f"\nFold RMSE values: {cv_rmse}")
print(f"\nMean RMSE: {cv_rmse.mean():.4f} (+/- {cv_rmse.std() * 2:.4f})")

In [None]:
# CV visualization
fig, ax = plt.subplots(figsize=(10, 5))

folds = range(1, len(cv_rmse) + 1)
ax.bar(folds, cv_rmse, color='steelblue', edgecolor='black')
ax.axhline(cv_rmse.mean(), color='red', linestyle='--', label=f'Mean: {cv_rmse.mean():.4f}')
ax.fill_between(
    [0.5, 5.5], 
    cv_rmse.mean() - cv_rmse.std(), 
    cv_rmse.mean() + cv_rmse.std(),
    alpha=0.2, color='red', label=f'±1 std: {cv_rmse.std():.4f}'
)

ax.set_xlabel('Fold')
ax.set_ylabel('RMSE')
ax.set_title('Cross-Validation RMSE by Fold')
ax.set_xticks(folds)
ax.legend()

plt.tight_layout()
plt.show()

## 7. Feature Importance Analysis

In [None]:
# Feature importance from best home model
feature_importance = pd.Series(
    best_home_model.feature_importances_,
    index=feature_cols
).sort_values(ascending=False)

print("\n Top 15 Features (Home Goals Model)")
print("=" * 45)
for feat, imp in feature_importance.head(15).items():
    print(f"  {feat:<30} {imp:.4f}")

In [None]:
# Feature importance plot
fig, axes = plt.subplots(1, 2, figsize=(14, 8))

# Home model
home_importance = pd.Series(
    best_home_model.feature_importances_,
    index=feature_cols
).sort_values(ascending=True).tail(15)

axes[0].barh(home_importance.index, home_importance.values, color='steelblue')
axes[0].set_xlabel('Importance')
axes[0].set_title('Top 15 Features: Home Goals Model')

# Away model
away_importance = pd.Series(
    best_away_model.feature_importances_,
    index=feature_cols
).sort_values(ascending=True).tail(15)

axes[1].barh(away_importance.index, away_importance.values, color='darkorange')
axes[1].set_xlabel('Importance')
axes[1].set_title('Top 15 Features: Away Goals Model')

plt.tight_layout()
plt.show()

## 8. Final Evaluation on Test Set

In [None]:
# Retrain on full training+validation set
final_home = xgb.XGBRegressor(**best_home_params)
final_home.fit(X_trainval, y_home_trainval, verbose=False)

final_away = xgb.XGBRegressor(**best_away_params)
final_away.fit(X_trainval, y_away_trainval, verbose=False)

# Final test evaluation
test_metrics = evaluate_models(final_home, final_away, X_test, y_home_test, y_away_test)

print("\n FINAL TEST SET PERFORMANCE")
print("=" * 50)
print(f"\nHome Goals:")
print(f"  RMSE: {test_metrics['home_rmse']:.4f}")
print(f"  MAE:  {test_metrics['home_mae']:.4f}")
print(f"  R²:   {test_metrics['home_r2']:.4f}")
print(f"\nAway Goals:")
print(f"  RMSE: {test_metrics['away_rmse']:.4f}")
print(f"  MAE:  {test_metrics['away_mae']:.4f}")
print(f"  R²:   {test_metrics['away_r2']:.4f}")
print(f"\nCombined:")
print(f"  RMSE: {test_metrics['combined_rmse']:.4f}")
print(f"  MAE:  {test_metrics['combined_mae']:.4f}")
print(f"  R²:   {test_metrics['combined_r2']:.4f}")

In [None]:
# Prediction vs Actual plots
home_pred_test = final_home.predict(X_test)
away_pred_test = final_away.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Home goals
axes[0].scatter(y_home_test, home_pred_test, alpha=0.4, edgecolors='black', linewidth=0.5)
axes[0].plot([0, 8], [0, 8], 'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Home Goals')
axes[0].set_ylabel('Predicted Home Goals')
axes[0].set_title(f'Home Goals: Predicted vs Actual (RMSE={test_metrics["home_rmse"]:.3f})')
axes[0].legend()

# Away goals
axes[1].scatter(y_away_test, away_pred_test, alpha=0.4, edgecolors='black', linewidth=0.5, color='orange')
axes[1].plot([0, 8], [0, 8], 'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual Away Goals')
axes[1].set_ylabel('Predicted Away Goals')
axes[1].set_title(f'Away Goals: Predicted vs Actual (RMSE={test_metrics["away_rmse"]:.3f})')
axes[1].legend()

plt.tight_layout()
plt.show()

## 9. Save Results

In [None]:
# Create output directories
os.makedirs('output/models/xgboost', exist_ok=True)
os.makedirs('output/predictions/xgboost', exist_ok=True)

# Save models
final_home.save_model('output/models/xgboost/xgboost_home_best.json')
final_away.save_model('output/models/xgboost/xgboost_away_best.json')

print("Saved:")
print("   output/models/xgboost/xgboost_home_best.json")
print("   output/models/xgboost/xgboost_away_best.json")

In [None]:
# Save hyperparameters and results
results = {
    'timestamp': datetime.now().isoformat(),
    'model': 'XGBoost',
    'home_params': {k: float(v) if isinstance(v, np.floating) else v 
                    for k, v in best_home_params.items() if k in param_distributions},
    'away_params': {k: float(v) if isinstance(v, np.floating) else v 
                    for k, v in best_away_params.items() if k in param_distributions},
    'test_metrics': {k: float(v) for k, v in test_metrics.items()},
    'cv_mean_rmse': float(cv_rmse.mean()),
    'cv_std_rmse': float(cv_rmse.std()),
    'n_train': len(X_trainval),
    'n_test': len(X_test),
    'n_features': len(feature_cols),
}

with open('output/predictions/xgboost/xgboost_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print("\n Results saved to output/predictions/xgboost/xgboost_results.json")

In [None]:
# Save random search results
home_results.to_csv('output/predictions/xgboost/xgboost_home_random_search.csv', index=False)
away_results.to_csv('output/predictions/xgboost/xgboost_away_random_search.csv', index=False)

print(" Random search results saved")

## Summary

### Results

| Metric | Baseline | Tuned | Improvement |
|--------|----------|-------|-------------|
| Combined RMSE | - | - | - |
| Combined MAE | - | - | - |
| Combined R² | - | - | - |

### Key Findings

1. **Best Parameters**: Found through 50 random search iterations
2. **Most Important Features**: See feature importance analysis
3. **Cross-validation**: 5-fold CV shows stable performance

### Next Steps

- Try Optuna for more sophisticated hyperparameter search
- Add SHAP analysis for better interpretability
- Compare with Random Forest (`train_random_forest.ipynb`)
- Integrate into ensemble model