# PS5: Match Result Prediction (Retrained with Temporal Validation)

This notebook retrains the model for **Problem Statement 5: Predicting the match result (Home Win, Draw, Away Win)**.

**Correction Log:**
1.  **Data Source**: Switched to `data/raw/data_raw_match.csv` which contains the necessary `Date` and `FTR` columns for a proper temporal analysis.
2.  **Target Engineering**: The multi-class target `Result` is engineered from the `FTR` column ('H', 'D', 'A').
3.  **Temporal Splitting**: Replaced the incorrect `train_test_split` with a strict chronological split. The first 80% of the data is used for training and the final 20% for testing, preserving the timeline.
4.  **Cross-Validation**: Implemented `TimeSeriesSplit` for all cross-validation procedures to prevent data leakage during hyperparameter tuning.

# Problem Statement 5: Match Result Prediction
## Predict Match Result (Home/Draw/Away)

**Author:** ScoreSight ML Team  
**Date:** 2025-11-12  
**Problem Type:** Multi-class Classification (H/D/A)

### Dataset
- **File:** `data/match_prediction_corrected.csv`
- **Task:** Predict match result (Home/Draw/Away)
- **Features:** Match data from corrected dataset
- **Target:** Result (H/D/A)

## 1. Setup

In [5]:
import pandas as pd
import numpy as np
import json
import joblib
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
from sklearn.impute import SimpleImputer

# Define file paths
MODELS_DIR = Path('models')
MODELS_DIR.mkdir(exist_ok=True)

DATA_PATH = Path('../data/raw/data_raw_match.csv')

print("[OK] All libraries imported and paths configured.")

[OK] All libraries imported and paths configured.


## 2. Load Data

In [7]:
# --- 1. Data Loading and Initial Preprocessing ---
df = pd.read_csv(DATA_PATH)
df.columns = df.columns.str.lower().str.strip()

# Convert date and sort
# Use format='mixed' to handle potential inconsistencies (e.g., yy vs yyyy)
df['date'] = pd.to_datetime(df['date'], format='mixed')
df = df.sort_values('date').reset_index(drop=True)

# --- 2. Target Engineering ---
# The 'ftr' column contains 'H' (Home Win), 'A' (Away Win), and 'D' (Draw).
# We will map these to numerical values for a multi-class classification problem.
# H -> 0, D -> 1, A -> 2
# Note: The original raw data had 'NH' for non-home wins. Let's create a proper 3-class target.
def create_target(row):
    if row['fthg'] > row['ftag']:
        return 'H'
    elif row['fthg'] == row['ftag']:
        return 'D'
    else:
        return 'A'

df['result'] = df.apply(create_target, axis=1)
target_map = {'H': 0, 'D': 1, 'A': 2}
df['target'] = df['result'].map(target_map)


# --- 3. Feature Selection ---
# Exclude identifiers, date, target, and other non-feature columns
exclude_cols = [
    'unnamed: 0', 'date', 'hometeam', 'awayteam', 
    'fthg', 'ftag', 'ftr', 'result', 'target'
]

# Select numeric and categorical features
numeric_features = df.select_dtypes(include=np.number).columns.tolist()
numeric_features = [col for col in numeric_features if col not in exclude_cols]

categorical_features = [
    'hm1', 'hm2', 'hm3', 'hm4', 'hm5', 
    'am1', 'am2', 'am3', 'am4', 'am5'
]

# Final feature list
features = numeric_features + categorical_features
X = df[features].copy()
y = df['target'].copy()

print("--- Data Loading and Preparation ---")
print(f"Data loaded from: {DATA_PATH}")
print(f"Shape after sorting: {df.shape}")
print("\n--- Target Variable ---")
print("Engineered 'result' column from goal difference.")
print(df['result'].value_counts())
print("\nEncoded 'target' column (0=H, 1=D, 2=A):")
print(y.value_counts())

print("\n--- Features ---")
print(f"Total features selected: {len(features)}")
print(f"Numeric features ({len(numeric_features)}): {numeric_features[:5]}...")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
print(f"\n[OK] Data prepared for temporal splitting.")

--- Data Loading and Preparation ---
Data loaded from: ..\data\raw\data_raw_match.csv
Shape after sorting: (6840, 42)

--- Target Variable ---
Engineered 'result' column from goal difference.
result
H    3176
A    1913
D    1751
Name: count, dtype: int64

Encoded 'target' column (0=H, 1=D, 2=A):
target
0    3176
2    1913
1    1751
Name: count, dtype: int64

--- Features ---
Total features selected: 31
Numeric features (21): ['htgs', 'atgs', 'htgc', 'atgc', 'htp']...
Categorical features (10): ['hm1', 'hm2', 'hm3', 'hm4', 'hm5', 'am1', 'am2', 'am3', 'am4', 'am5']

[OK] Data prepared for temporal splitting.


## 3. Train Models

In [8]:
# --- 4. Temporal Train-Test Split ---
# Split the data chronologically: 80% for training, 20% for testing.
split_index = int(len(df) * 0.8)

X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

train_dates = df['date'].iloc[:split_index]
test_dates = df['date'].iloc[split_index:]

print("--- Temporal Splitting ---")
print(f"Training data: {len(X_train)} samples (from {train_dates.min().date()} to {train_dates.max().date()})")
print(f"Testing data:  {len(X_test)} samples (from {test_dates.min().date()} to {test_dates.max().date()})")

print("\n--- Training Set Target Distribution ---")
print(y_train.value_counts(normalize=True))

print("\n--- Test Set Target Distribution ---")
print(y_test.value_counts(normalize=True))
print("\n[OK] Temporal split completed.")

--- Temporal Splitting ---
Training data: 5472 samples (from 2000-01-10 to 2014-12-13)
Testing data:  1368 samples (from 2014-12-13 to 2018-12-03)

--- Training Set Target Distribution ---
target
0    0.466557
2    0.275585
1    0.257858
Name: proportion, dtype: float64

--- Test Set Target Distribution ---
target
0    0.455409
2    0.296053
1    0.248538
Name: proportion, dtype: float64

[OK] Temporal split completed.


## 4. Save Models

In [9]:
# --- 5. Model Training with Temporal Cross-Validation ---

# Define preprocessing steps for different column types
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'
)

# Define models and their hyperparameter grids
models_to_train = {
    'LogisticRegression': LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42),
    'RandomForestClassifier': RandomForestClassifier(random_state=42),
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=42)
}

param_grids = {
    'LogisticRegression': {
        'classifier__C': [0.01, 0.1, 1, 10, 100]
    },
    'RandomForestClassifier': {
        'classifier__n_estimators': [100, 200, 300],
        'classifier__max_depth': [10, 20, 30, None],
        'classifier__min_samples_split': [2, 5, 10]
    },
    'GradientBoostingClassifier': {
        'classifier__n_estimators': [100, 200],
        'classifier__learning_rate': [0.01, 0.1, 0.2],
        'classifier__max_depth': [3, 5, 7]
    }
}

# Use TimeSeriesSplit for cross-validation
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

# Store results
trained_models = {}
results_summary = {}

print("--- Starting Model Training with Temporal CV ---")
for model_name, model in models_to_train.items():
    print(f"\n[TRAINING] ==> {model_name}")

    # Create the full pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Randomized Search with Temporal CV
    search = RandomizedSearchCV(
        pipeline,
        param_distributions=param_grids[model_name],
        n_iter=10,
        scoring='f1_macro',
        cv=tscv,
        random_state=42,
        n_jobs=-1,
        verbose=1
    )
    
    search.fit(X_train, y_train)

    # Store results
    trained_models[model_name] = search.best_estimator_
    results_summary[model_name] = {
        'best_score_f1_macro': search.best_score_,
        'best_params': search.best_params_
    }
    
    print(f"[RESULT] Best F1-Macro (avg over {n_splits} splits): {search.best_score_:.4f}")
    print(f"[RESULT] Best Parameters: {search.best_params_}")

# --- 6. Identify the Best Overall Model ---
best_model_name = max(results_summary, key=lambda k: results_summary[k]['best_score_f1_macro'])
best_model = trained_models[best_model_name]
best_model_score = results_summary[best_model_name]['best_score_f1_macro']

print("\n--- Cross-Validation Summary ---")
summary_df = pd.DataFrame(results_summary).T
summary_df = summary_df.sort_values('best_score_f1_macro', ascending=False)
print(summary_df)

print(f"\nüèÜ Best performing model from CV: '{best_model_name}' with F1-Macro: {best_model_score:.4f}")
print("\n[OK] Model training and cross-validation complete.")

--- Starting Model Training with Temporal CV ---

[TRAINING] ==> LogisticRegression
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[RESULT] Best F1-Macro (avg over 5 splits): 0.4385
[RESULT] Best Parameters: {'classifier__C': 0.1}

[TRAINING] ==> RandomForestClassifier
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[RESULT] Best F1-Macro (avg over 5 splits): 0.4385
[RESULT] Best Parameters: {'classifier__C': 0.1}

[TRAINING] ==> RandomForestClassifier
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[RESULT] Best F1-Macro (avg over 5 splits): 0.4250
[RESULT] Best Parameters: {'classifier__n_estimators': 200, 'classifier__min_samples_split': 5, 'classifier__max_depth': 20}

[TRAINING] ==> GradientBoostingClassifier
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[RESULT] Best F1-Macro (avg over 5 splits): 0.4250
[RESULT] Best Parameters: {'classifier__n_estimators': 200, 'classifier__min_samples_split': 5, 'classifier__max_depth': 20}


In [10]:
# --- 7. Evaluate Best Model on Temporal Test Set ---
print(f"--- Evaluating Best Model: {best_model_name} ---")

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')

# Store final metrics
final_metrics = {
    'model_name': best_model_name,
    'accuracy': accuracy,
    'f1_macro': f1,
    'precision_macro': precision,
    'recall_macro': recall,
    'best_cv_params': results_summary[best_model_name]['best_params']
}

print(f"Test Set Accuracy: {accuracy:.4f}")
print(f"Test Set F1-Macro: {f1:.4f}")
print(f"Test Set Precision-Macro: {precision:.4f}")
print(f"Test Set Recall-Macro: {recall:.4f}")

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Home Win', 'Draw', 'Away Win']))

print("\n[OK] Evaluation on test set complete.")

--- Evaluating Best Model: LogisticRegression ---
Test Set Accuracy: 0.5197
Test Set F1-Macro: 0.4190
Test Set Precision-Macro: 0.4499
Test Set Recall-Macro: 0.4453

--- Classification Report ---
              precision    recall  f1-score   support

    Home Win       0.55      0.80      0.65       623
        Draw       0.29      0.08      0.13       340
    Away Win       0.51      0.45      0.48       405

    accuracy                           0.52      1368
   macro avg       0.45      0.45      0.42      1368
weighted avg       0.47      0.52      0.47      1368


[OK] Evaluation on test set complete.


In [11]:
# --- 8. Save Model and Metrics ---
# Define paths for artifacts
model_path = MODELS_DIR / 'ps5_match_result_model.joblib'
metrics_path = MODELS_DIR / 'ps5_match_result_metrics.json'

# Save the best model pipeline
joblib.dump(best_model, model_path)

# Save the final metrics
with open(metrics_path, 'w') as f:
    json.dump(final_metrics, f, indent=4)

print(f"--- Artifacts Saved ---")
print(f"‚úÖ Best Model saved to: {model_path}")
print(f"‚úÖ Final Metrics saved to: {metrics_path}")

# Display the saved metrics
print("\n--- Saved Metrics ---")
print(json.dumps(final_metrics, indent=4))

--- Artifacts Saved ---
‚úÖ Best Model saved to: models\ps5_match_result_model.joblib
‚úÖ Final Metrics saved to: models\ps5_match_result_metrics.json

--- Saved Metrics ---
{
    "model_name": "LogisticRegression",
    "accuracy": 0.5197368421052632,
    "f1_macro": 0.41903158203900454,
    "precision_macro": 0.449946891389925,
    "recall_macro": 0.445303003987002,
    "best_cv_params": {
        "classifier__C": 0.1
    }
}


In [None]:
# --- 9. Final Model Performance Comparison ---
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path

# Define paths to all metric files
METRICS_PATHS = {
    'PS1: League Winner': Path('models/ps1_league_winner_metadata.json'),
    'PS2: Match Winner': Path('models/ps2_match_winner_metadata_v4_ultimate.json'),
    'PS4: Total Points': Path('../models/ps4_total_points_metadata.json'), # Note the different path
    'PS5: Match Result': Path('models/ps5_match_result_metrics.json')
}

# --- Data Loading ---
# Check which files exist to avoid errors
existing_paths = {name: path for name, path in METRICS_PATHS.items() if path.exists()}
print("Found the following metric files:")
for name, path in existing_paths.items():
    print(f"- {name}: {path}")

# --- Metric Extraction ---
summary_data = []
for name, path in existing_paths.items():
    with open(path, 'r') as f:
        metrics = json.load(f)
    
    if 'PS4' in name:
        # Regression model
        score = metrics.get('r2_score')
        metric_name = 'R-Squared'
    else:
        # Classification models
        # Use the primary metric from the dictionary, could be 'f1_macro' or 'accuracy'
        score = metrics.get('f1_macro', metrics.get('accuracy'))
        metric_name = 'F1-Macro / Accuracy'
        
    summary_data.append({
        'Model': name,
        'Primary Metric': metric_name,
        'Score': score
    })

summary_df = pd.DataFrame(summary_data)

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(12, 7))

# Create the bar plot
sns.barplot(x='Score', y='Model', data=summary_df, ax=ax, palette='viridis')

# Add score labels to the bars
for index, row in summary_df.iterrows():
    ax.text(row['Score'] + 0.01, index, f"{row['Score']:.3f}", 
            color='black', ha="left", va='center', fontsize=12)

# Set titles and labels
ax.set_title('Final Retrained Model Performance Comparison', fontsize=18, pad=20)
ax.set_xlabel('Performance Score (F1-Macro or R-Squared)', fontsize=12)
ax.set_ylabel('Problem Statement', fontsize=12)
ax.set_xlim(0, 1.05) # R-squared can be up to 1.0

# --- Save the Figure ---
VIZ_DIR = Path('../visualizations')
VIZ_DIR.mkdir(exist_ok=True)
output_path = VIZ_DIR / 'final_model_performance_comparison.png'
plt.savefig(output_path, bbox_inches='tight')

print(f"\n--- Comparison Chart Generated ---")
print(summary_df)
print(f"\n‚úÖ Chart saved to: {output_path}")

plt.show()