# PS2: Match Winner Prediction - Rigorous Retraining

## ‚ö†Ô∏è CRITICAL CORRECTION: ADDRESSING POTENTIAL DATA LEAKAGE

My previous analysis produced unrealistically high performance. This was likely due to using a random train-test split, which allows the model to see data from the future and leads to **data leakage**.

This notebook has been completely rewritten to follow proper machine learning best practices for time-series data.

### Corrected Workflow:
1.  **Isolate Target:** The target variable (`FTR` - Full Time Result) is strictly used for evaluation.
2.  **Temporal Validation:** The data is sorted by an implicit time order (index) and split. The model is trained on earlier matches and tested on the most recent matches.
3.  **Robust Models:** Focus on tree-based models that perform well on tabular data.
4.  **Realistic Evaluation:** Performance is measured on a hold-out test set of the most recent matches, providing an honest assessment of the model's predictive power.

### Prediction Goal
- **Predict the match winner (Home Win, Draw, Away Win) for a given match.**

## 1. Setup

In [11]:
# --- V4 Ultimate: Data Loading & Prep ---
# Objective: Load the granular dataset to build the most powerful predictive features.
# Methodology: We will use 'Match Winner.csv' as it contains the necessary raw match data.

import pandas as pd
import numpy as np
import joblib
import json
import os
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# Check for XGBoost
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
except ImportError:
    XGB_AVAILABLE = False

# --- 1. Data Loading and Basic Cleaning ---

# Define file paths
DATA_FILE = '../datasets/Match Winner.csv' # Reverting to the best available granular dataset
MODEL_DIR = 'models'
MODEL_NAME = 'ps2_match_winner_best_model_v4_ultimate.joblib'
METADATA_NAME = 'ps2_match_winner_metadata_v4_ultimate.json'

# Create model directory if it doesn't exist
os.makedirs(MODEL_DIR, exist_ok=True)

# Load the data
df = pd.read_csv(DATA_FILE)

# Data Cleaning and Preparation
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values('Date').reset_index(drop=True)
df = df[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR']] # Select essential columns

# Remove matches with missing results
df.dropna(subset=['FTHG', 'FTAG', 'FTR'], inplace=True)

print("V4 Data loaded and cleaned successfully.")
print(f"Shape of the dataframe: {df.shape}")
print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
df.head()

V4 Data loaded and cleaned successfully.
Shape of the dataframe: (6840, 6)
Date range: 2000-08-19 00:00:00 to 2018-05-13 00:00:00


  df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)


Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,2000-08-19,Charlton,Man City,4,0,H
1,2000-08-19,Chelsea,West Ham,4,2,H
2,2000-08-19,Coventry,Middlesbrough,1,3,NH
3,2000-08-19,Derby,Southampton,2,2,NH
4,2000-08-19,Leeds,Everton,2,0,H


In [12]:
# --- V4 Ultimate: Advanced Feature Engineering ---
# This cell computes Exponentially Weighted Moving Averages (EWMA) and Head-to-Head (H2H) stats.
# EWMA gives more weight to recent games, providing a more sensitive measure of form.
# H2H stats capture the specific historical dynamics between two teams.

# Function to calculate rolling features for a team
def get_advanced_features(team_history, opponent_history, h2h_history, windows, alphas):
    features = {}
    
    # --- Standard Rolling Averages ---
    for W in windows:
        if len(team_history) >= W:
            window_df = pd.DataFrame(team_history[-W:])
            features[f'avg_gs_{W}'] = window_df['gs'].mean()
            features[f'avg_gc_{W}'] = window_df['gc'].mean()
            features[f'avg_gd_{W}'] = window_df['gd'].mean()
            features[f'avg_pts_{W}'] = window_df['pts'].mean()
        else:
            features[f'avg_gs_{W}'] = np.nan
            features[f'avg_gc_{W}'] = np.nan
            features[f'avg_gd_{W}'] = np.nan
            features[f'avg_pts_{W}'] = np.nan

    # --- Exponentially Weighted Moving Averages (EWMA) ---
    if len(team_history) > 1:
        history_df = pd.DataFrame(team_history)
        for alpha in alphas:
            features[f'ewma_gs_{alpha}'] = history_df['gs'].ewm(alpha=alpha).mean().iloc[-1]
            features[f'ewma_gc_{alpha}'] = history_df['gc'].ewm(alpha=alpha).mean().iloc[-1]
            features[f'ewma_gd_{alpha}'] = history_df['gd'].ewm(alpha=alpha).mean().iloc[-1]
    else:
        for alpha in alphas:
            features[f'ewma_gs_{alpha}'] = np.nan
            features[f'ewma_gc_{alpha}'] = np.nan
            features[f'ewma_gd_{alpha}'] = np.nan
            
    # --- Head-to-Head (H2H) Features ---
    if len(h2h_history) > 0:
        h2h_df = pd.DataFrame(h2h_history)
        features['h2h_avg_gs'] = h2h_df['gs'].mean()
        features['h2h_avg_gc'] = h2h_df['gc'].mean()
        features['h2h_win_rate'] = (h2h_df['pts'] == 3).mean()
    else:
        features['h2h_avg_gs'] = np.nan
        features['h2h_avg_gc'] = np.nan
        features['h2h_win_rate'] = np.nan
        
    return features

# Define windows and alphas for EWMA
WINDOWS = [5, 10]
ALPHAS = [0.1, 0.2]

# Dictionaries to store histories
teams = pd.concat([df['HomeTeam'], df['AwayTeam']]).unique()
team_histories = {team: [] for team in teams}
h2h_histories = {} # Key will be a sorted tuple of team names

new_features_list = []

print("Starting V4 Ultimate feature engineering...")

# Iterate through each match chronologically
for index, row in df.iterrows():
    home_team = row['HomeTeam']
    away_team = row['AwayTeam']
    
    h2h_key = tuple(sorted((home_team, away_team)))
    if h2h_key not in h2h_histories:
        h2h_histories[h2h_key] = []

    # --- 1. Get features based on past data ---
    home_feats = get_advanced_features(
        team_histories[home_team], team_histories[away_team], h2h_histories[h2h_key],
        WINDOWS, ALPHAS
    )
    away_feats = get_advanced_features(
        team_histories[away_team], team_histories[home_team], h2h_histories[h2h_key],
        WINDOWS, ALPHAS
    )

    # Combine features
    match_features = {}
    for key, value in home_feats.items():
        match_features[f'H_{key}'] = value
    for key, value in away_feats.items():
        match_features[f'A_{key}'] = value
        
    # Add difference features
    for W in WINDOWS:
        match_features[f'diff_avg_gd_{W}'] = home_feats.get(f'avg_gd_{W}') - away_feats.get(f'avg_gd_{W}')
    for alpha in ALPHAS:
        match_features[f'diff_ewma_gd_{alpha}'] = home_feats.get(f'ewma_gd_{alpha}') - away_feats.get(f'ewma_gd_{alpha}')
        
    new_features_list.append(match_features)

    # --- 2. Update histories with the outcome of the *current* match ---
    home_goals = row['FTHG']
    away_goals = row['FTAG']
    
    if row['FTR'] == 'H':
        home_pts, away_pts = 3, 0
    elif row['FTR'] == 'A':
        home_pts, away_pts = 0, 3
    else:
        home_pts, away_pts = 1, 1

    team_histories[home_team].append({'gs': home_goals, 'gc': away_goals, 'gd': home_goals - away_goals, 'pts': home_pts})
    team_histories[away_team].append({'gs': away_goals, 'gc': home_goals, 'gd': away_goals - home_goals, 'pts': away_pts})
    h2h_histories[h2h_key].append({'team': home_team, 'gs': home_goals, 'gc': away_goals, 'pts': home_pts})

# Create a new DataFrame with the engineered features
features_df = pd.DataFrame(new_features_list, index=df.index)
df_featured = pd.concat([df, features_df], axis=1)
df_featured.dropna(inplace=True)

print("V4 Ultimate feature engineering complete.")
print(f"Shape of the new featured dataframe: {df_featured.shape}")
df_featured.head()

Starting V4 Ultimate feature engineering...
V4 Ultimate feature engineering complete.
Shape of the new featured dataframe: (6129, 44)


Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,H_avg_gs_5,H_avg_gc_5,H_avg_gd_5,H_avg_pts_5,...,A_ewma_gs_0.2,A_ewma_gc_0.2,A_ewma_gd_0.2,A_h2h_avg_gs,A_h2h_avg_gc,A_h2h_win_rate,diff_avg_gd_5,diff_avg_gd_10,diff_ewma_gd_0.1,diff_ewma_gd_0.2
179,2000-12-22,Coventry,Southampton,1,1,NH,0.6,1.0,-0.4,1.0,...,1.175708,1.731583,-0.555876,1.0,2.0,0.0,-0.2,-0.4,-0.403576,-0.219131
180,2000-12-23,Man United,Ipswich,2,0,H,1.8,0.8,1.0,1.4,...,1.686901,0.810695,0.876206,1.0,1.0,0.0,0.2,1.0,0.698118,0.197919
181,2000-12-23,Tottenham,Middlesbrough,0,0,NH,1.8,1.4,0.4,1.6,...,0.710837,1.267856,-0.557019,1.0,1.0,0.0,1.2,0.9,0.505798,0.504492
182,2000-12-23,Sunderland,Man City,1,0,H,1.2,0.6,0.6,1.6,...,1.835071,1.720754,0.114317,4.0,2.0,1.0,0.2,0.7,0.350594,0.072696
183,2000-12-23,Liverpool,Arsenal,4,0,H,1.2,1.0,0.2,1.0,...,1.655085,0.59565,1.059435,2.0,0.0,1.0,-0.4,-0.1,-0.408611,-0.482969


## 2. Load Data

In [13]:
# --- V4: Define Features, Target, and Temporal Split ---

# Define features (X) and target (y) from the new, feature-rich dataframe
le = LabelEncoder()
df_featured['FTR_encoded'] = le.fit_transform(df_featured['FTR'])
target_mapping = {i: l for i, l in enumerate(le.classes_)}

y = df_featured['FTR_encoded']
X = df_featured.drop(columns=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'FTR_encoded'])

# --- Temporal Train-Test Split ---
# Using last 15% as a hold-out test set.
split_index = int(len(df_featured) * 0.85)

X_train = X.iloc[:split_index]
y_train = y.iloc[:split_index]
X_test = X.iloc[split_index:]
y_test = y.iloc[split_index:]

print("--- V4: Temporal Split Details ---")
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"Training set size: {len(X_train)} matches")
print(f"Test set size: {len(X_test)} matches")
print(f"Target classes mapping: {target_mapping}")

--- V4: Temporal Split Details ---
Features (X) shape: (6129, 38)
Target (y) shape: (6129,)
Training set size: 5209 matches
Test set size: 920 matches
Target classes mapping: {0: 'H', 1: 'NH'}


## 3. Train Models

In [14]:
# --- V4 Ultimate: Exhaustive Model Training (GridSearchCV) ---
# This is the final, most intensive training step.
# We use GridSearchCV for an exhaustive search of the best hyperparameters.
# The parameter grid is expanded to allow for longer training times and more complex models.

def create_smote_pipeline(model):
    return ImbPipeline([
        ('imputer', SimpleImputer(strategy='mean')), 
        ('scaler', StandardScaler()),
        ('smote', SMOTE(random_state=42)),
        ('model', model)
    ])

# Define models to train
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')
}

# Define a more focused but deeper hyperparameter grid for GridSearchCV
# This will take a long time to run.
param_grids = {
    'RandomForest': {
        'model__n_estimators': [200, 400], # Increased estimators for longer training
        'model__max_depth': [10, 20, 30],
        'model__min_samples_leaf': [1, 2],
        'model__max_features': ['sqrt', 'log2']
    },
    'XGBoost': {
        'model__n_estimators': [200, 400], # Increased estimators
        'model__learning_rate': [0.05, 0.1],
        'model__max_depth': [5, 7],
        'model__subsample': [0.8, 1.0]
    }
}

# Use TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=5)

print(f"\n--- V4: Training Models with GridSearchCV (This will take a long time) ---")

results = {}
for name, model in models.items():
    print(f"Training {name}...")
    pipeline = create_smote_pipeline(model)
    
    # Using GridSearchCV for an exhaustive search
    search = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grids[name],
        cv=tscv,
        scoring='accuracy',
        n_jobs=-1, # Use all available cores
        verbose=1 # Show progress
    )
    search.fit(X_train, y_train)
    
    y_pred = search.predict(X_test)
    
    results[name] = {
        'model': search.best_estimator_,
        'best_params': search.best_params_,
        'accuracy': accuracy_score(y_test, y_pred),
        'f1_macro': f1_score(y_test, y_pred, average='macro'),
        'report': classification_report(y_test, y_pred),
        'confusion_matrix': confusion_matrix(y_test, y_pred)
    }
    print(f"  Done. Best Accuracy on CV: {search.best_score_:.4f}")

# --- Display Results ---
print("\n--- V4: Model Performance on Hold-out Test Set (Ultimate Features) ---")
for name, res in results.items():
    print(f"\n----- {name} -----")
    print(f"  Accuracy: {res['accuracy']:.4f}")
    print(f"  F1 (Macro): {res['f1_macro']:.4f}")
    print("  Classification Report:")
    print(res['report'])
    print("  Confusion Matrix:")
    print(res['confusion_matrix'])


--- V4: Training Models with GridSearchCV (This will take a long time) ---
Training RandomForest...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
  Done. Best Accuracy on CV: 0.6260
Training XGBoost...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
  Done. Best Accuracy on CV: 0.6260
Training XGBoost...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
  Done. Best Accuracy on CV: 0.6152

--- V4: Model Performance on Hold-out Test Set (Ultimate Features) ---

----- RandomForest -----
  Accuracy: 0.6380
  F1 (Macro): 0.6354
  Classification Report:
              precision    recall  f1-score   support

           0       0.61      0.60      0.60       424
           1       0.66      0.67      0.67       496

    accuracy                           0.64       920
   macro avg       0.64      0.64      0.64       920
weighted avg       0.64      0.64      0.64       920

  Confusion Matrix:
[[254 170]
 [163 333]]

----- XGBoost -----
  Accuracy: 0

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [15]:
# --- V4 Ultimate: Save Best Model and Metadata ---

best_model_name = None
best_accuracy = -1

for name, res in results.items():
    if res['accuracy'] > best_accuracy:
        best_accuracy = res['accuracy']
        best_model_name = name

print(f"\n--- Saving Best Model (V4 Ultimate): {best_model_name} ---")

best_model_details = results[best_model_name]
best_model_obj = best_model_details['model']

# Save the model object
joblib.dump(best_model_obj, os.path.join(MODEL_DIR, MODEL_NAME))

# Create metadata dictionary
metadata = {
    "problem_statement": "PS2: Match Winner Prediction",
    "version": "4.0 (Ultimate Features & GridSearchCV)",
    "model_name": best_model_name,
    "model_path": os.path.join(MODEL_DIR, MODEL_NAME),
    "features": list(X.columns),
    "target_variable": "FTR_encoded",
    "target_mapping": target_mapping,
    "evaluation_metric": "accuracy",
    "performance_metrics": {
        "accuracy": best_model_details['accuracy'],
        "f1_macro": best_model_details['f1_macro'],
        "classification_report": best_model_details['report'],
        "confusion_matrix": best_model_details['confusion_matrix'].tolist()
    },
    "hyperparameters": best_model_details['best_params'],
    "training_methodology": "Temporal split (85/15), Ultimate Features (EWMA, H2H), GridSearchCV with 5-fold TimeSeriesSplit and SMOTE."
}

# Save metadata to JSON
with open(os.path.join(MODEL_DIR, METADATA_NAME), 'w') as f:
    json.dump(metadata, f, indent=4)

print(f"Model saved to: {os.path.join(MODEL_DIR, MODEL_NAME)}")
print(f"Metadata saved to: {os.path.join(MODEL_DIR, METADATA_NAME)}")


--- Saving Best Model (V4 Ultimate): RandomForest ---
Model saved to: models\ps2_match_winner_best_model_v4_ultimate.joblib
Metadata saved to: models\ps2_match_winner_metadata_v4_ultimate.json


## 4. Save Models

In [6]:
# Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('Test_Accuracy', ascending=False)

print("\n" + "="*80)
print("üìä MODEL COMPARISON - PS2: Match Winner Prediction (H/D/A)")
print("="*80)
print(results_df.to_string(index=False))

# Find best model
best_model_name = results_df.iloc[0]['Model']
best_accuracy = results_df.iloc[0]['Test_Accuracy']
best_f1 = results_df.iloc[0]['Test_F1']

print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   Test Accuracy: {best_accuracy:.4f}")
print(f"   Test F1 Score: {best_f1:.4f}")

# Get best model object
model_mapping = {
    'Logistic Regression': grid_lr,
    'Random Forest': grid_rf,
    'Gradient Boosting': grid_gb,
    'XGBoost': grid_xgb,
    'LightGBM': grid_lgbm
}
best_model = model_mapping[best_model_name]

print(f"\nüìà Performance Analysis:")
print(f"   ‚Ä¢ Random baseline (guess): 33.3% accuracy")
print(f"   ‚Ä¢ Your model: {best_accuracy*100:.1f}% accuracy")
print(f"   ‚Ä¢ Improvement: {(best_accuracy - 0.333) / 0.333 * 100:.1f}% better than random!")
print(f"   ‚Ä¢ This dataset: {len(X_train)} training samples")
print(f"   ‚Ä¢ For football match prediction, 50-60% accuracy is EXCELLENT!")
print(f"   ‚Ä¢ Professional betting models achieve 50-55%")

# Confusion Matrix
print(f"\nüìä Confusion Matrix (Best Model):")
if best_model_name == 'Logistic Regression':
    y_pred_best = y_pred_lr
elif best_model_name == 'Random Forest':
    y_pred_best = y_pred_rf
elif best_model_name == 'Gradient Boosting':
    y_pred_best = y_pred_gb
elif best_model_name == 'XGBoost':
    y_pred_best = y_pred_xgb
else:
    y_pred_best = y_pred_lgbm

cm = confusion_matrix(y_test, y_pred_best)
print(cm)
print(f"\n   Rows: Actual | Columns: Predicted")
print(f"   [0] Home Win, [1] Draw, [2] Away Win")

# Classification Report
print(f"\nüìã Classification Report:")
print(classification_report(y_test, y_pred_best, 
                            target_names=['Home Win', 'Draw', 'Away Win']))

# Save results
results_df.to_csv('../visualizations/ps2_model_comparison.csv', index=False)
print(f"\nüíæ Results saved to: ../visualizations/ps2_model_comparison.csv")

# Save best model
import joblib
joblib.dump({
    'model': best_model.best_estimator_,
    'scaler': scaler,
    'features': feature_cols,
    'model_name': best_model_name,
    'accuracy': best_accuracy,
    'f1_score': best_f1,
    'class_mapping': {0: 'H', 1: 'D', 2: 'A'},
    'class_names': {0: 'Home Win', 1: 'Draw', 2: 'Away Win'}
}, 'models/ps2_match_winner_model.joblib')

print(f"\nüíæ Best model saved to: models/ps2_match_winner_model.joblib")

print("\n" + "="*80)
print("‚úÖ PS2: MATCH WINNER PREDICTION - COMPLETE!")
print("="*80)
print(f"\nüéØ Final Results:")
print(f"   Best Model: {best_model_name}")
print(f"   Test Accuracy: {best_accuracy:.4f}")
print(f"   Test F1 Score: {best_f1:.4f}")
print(f"\nüìù Output Format for Deployment:")
print(f"   Input: Home team stats, Away team stats")
print(f"   Output: Team name (e.g., 'Manchester United') OR 'Draw'")
print(f"   NO scores, NO probabilities, ONLY winner name or 'Draw'")


üìä MODEL COMPARISON - PS2: Match Winner Prediction (H/D/A)
              Model                                                                                                                                           Best_Params  CV_Accuracy  Test_Accuracy  Test_F1  Training_Time_s
            XGBoost {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 100, 'num_class': 3, 'objective': 'multi:softmax', 'subsample': 1.0}     0.454463       0.460806 0.321700         9.474627
  Gradient Boosting                                                                        {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 100, 'subsample': 1.0}     0.447682       0.443956 0.321353        57.016826
      Random Forest                                   {'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}     0.402419       0.413919 0.364009        11.766964
           LightGBM {'class_weight': 'bala