# Match preditor

## Environment Setup

**Installing XGBoost**

In [1]:
#!pip install xgboost

**Import libaries**

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [3]:

try:
    from xgboost import XGBClassifier
    HAS_XGB = True
except ImportError:
    HAS_XGB = False
    print("XGBoost install error")

## 1. Helper methods

These utility functions support preprocessing steps used throughout the project:
- **normalize_squad_column(df)** - Ensures that all datasets use a consistent column name (Squad) for team identifiers, renaming the column if needed and cleaning whitespace.
- **compute_form_score(last5)** - Converts a team’s last-five-matches string (e.g., “WDLWW”) into a numerical form score by mapping wins, draws, and losses to +1, 0, and −1 respectively.
  

In [1]:
def normalize_squad_column(df):
    if "Squad" in df.columns:
        pass
    elif "Team" in df.columns:
        df = df.rename(columns={"Team": "Squad"})
    else:
        raise ValueError(f"No Squad/Team column found: {df.columns.tolist()}")
    df["Squad"] = df["Squad"].astype(str).str.strip()
    return df


def compute_form_score(last5):
    if not isinstance(last5, str):
        return 0.0
    mapping = {"W": 1, "D": 0, "L": -1}
    return sum(mapping.get(ch, 0) for ch in last5.strip())

## 2. Match-level dataset construction per season

The function below builds a complete match-level dataset for a given Premier League season by combining fixtures with corresponding team statistics.
- **Load team statistics** - Reads the merged squad–team CSV, normalizes team names, and calculates a form score from each team’s last five matches.
- **Load fixtures and derive outcomes** - Reads the season’s fixture list, converts goal columns to numeric values, computes goal difference, and assigns the match result as H (home win), A (away win), or D (draw).
- **Merge home and away features** - Joins team-level stats onto each fixture, once for the home team and once for the away team, creating a full set of performance features for both sides.
- **Add season metadata** - Labels each row with the season and assigns a season-specific weight so that more recent seasons contribute more to the model.

This produces a unified, feature-rich match dataset for a single Premier League season.

In [2]:
def build_match_dataset_for_season(season_dir, season_weight):
    season_dir = Path(season_dir)
    season_label = season_dir.name

    # Team stats
    stats_files = list(season_dir.glob("*squad_team_merged*.csv"))
    if not stats_files:
        raise FileNotFoundError(f"No *squad_team_merged*.csv in {season_dir}")
    stats = pd.read_csv(stats_files[0])
    stats = normalize_squad_column(stats)

    # Adding recent performance score from the last 5 games
    if "Last 5" in stats.columns:
        stats["form_score"] = stats["Last 5"].apply(compute_form_score)
    else:
        stats["form_score"] = 0.0

    # Fixtures
    fix_files = list(season_dir.glob("*fixtures*.csv"))
    if not fix_files:
        raise FileNotFoundError(f"No *fixtures*.csv in {season_dir}")
    fixtures = pd.read_csv(fix_files[0])

    fixtures["Home_Goals"] = pd.to_numeric(fixtures["Home_Goals"], errors="coerce")
    fixtures["Away_Goals"] = pd.to_numeric(fixtures["Away_Goals"], errors="coerce")

    fixtures["goal_diff"] = fixtures["Home_Goals"] - fixtures["Away_Goals"]
    fixtures["result"] = np.where(
        fixtures["goal_diff"] > 0, "H",
        np.where(fixtures["goal_diff"] < 0, "A", "D")
    )
    fixtures["season"] = season_label

    # Merge home stats
    home_stats = stats.add_prefix("home_")
    merged = fixtures.merge(
        home_stats, left_on="Home_Team", right_on="home_Squad", how="left"
    )

    # Merge away stats
    away_stats = stats.add_prefix("away_")
    merged = merged.merge(
        away_stats, left_on="Away_team", right_on="away_Squad",
        how="left", suffixes=("", "_dupAway")
    )

    merged = merged.drop(columns=["home_Squad", "away_Squad"], errors="ignore")

    # Season weights -> recent seasons are more important1
    merged["season_weight"] = season_weight

    return merged

## 3. Building the full multi-season dataset

The function below constructs the complete match-level dataset by loading and combining all available Premier League seasons from the project’s data folder.

- **Identify all season folders** -  Scans the base data directory for subfolders that represent individual seasons, each containing the corresponding fixtures and team statistics.
- **Apply season weights** - Newer seasons receive higher weights. This ensures that the model prioritizes more recent team performance trends.
- **Build per-season datasets** - For each season directory, we call build_match_dataset_for_season() to generate a unified feature table of fixtures, results, and team-level stats.
- **Combine into one dataset** - All season-specific DataFrames are concatenated into a single, large dataset used for training and evaluation.

This produces the full historical dataset required for feature engineering and model training.

In [6]:
def build_full_dataset(base_dir = "CSV_files"):
    base = Path(base_dir)
    season_dirs = sorted(base.glob("Season_*"))
    if not season_dirs:
        raise ValueError("CSV_files doesn't contain the correct folder.")

    all_dfs = []
    n = len(season_dirs)

    for i, season_dir in enumerate(season_dirs):
        weight = 0.6 + 0.4 * (i / (n - 1)) if n > 1 else 1.0
        print(f"{season_dir.name} -> weight = {weight:.2f}")
        df_season = build_match_dataset_for_season(season_dir, season_weight=weight)
        all_dfs.append(df_season)

    full = pd.concat(all_dfs, ignore_index=True)
    return full

## 4. Data preparation and feature engineering

This section prepares the full dataset for model training by constructing meaningful numerical features and splitting the data into training and testing sets.
- **Load and clean data** The multi-season dataset is built and rows with missing match results are removed.
- **Create home–away difference features** For each numerical team statistic, we compute **home–away difference features** capturing the relative advantage of the home team for that match.
- **Select model features** - Only the engineered columns are used as predictors, ensuring a consistent and fully numeric feature space.
- **Prepare target labels and weights** - The match result (H, D, A) is used as the prediction target, and season weights are carried over to emphasize recent seasons.
- **Train-test split** - The dataset is split into training and testing subsets (80/20) using stratification to preserve the overall distribution of match outcomes.

This produces a clean and structured feature matrix ready for model development.

In [7]:
full_df = build_full_dataset("../CSV_files")
full_df = full_df.dropna(subset=["result"])

for col in list(full_df.columns):
    if col.startswith("home_"):
        base = col[5:]
        away_col = "away_" + base
        if away_col in full_df.columns:
            if full_df[col].dtype != "O" and full_df[away_col].dtype != "O":
                full_df[f"diff_{base}"] = full_df[col] - full_df[away_col]

feature_cols = [
    c for c in full_df.columns
    if c.startswith("diff_") and full_df[c].dtype != "O"
]

X = full_df[feature_cols]
y = full_df["result"].astype(str)
w = full_df["season_weight"]

mask = X.notna().all(axis=1)
X = X[mask]
y = y[mask]
w = w[mask]

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, test_size=0.2, random_state=42, stratify=y
)

Season_2020-2021 -> weight = 0.60
Season_2021-2022 -> weight = 0.68
Season_2022-2023 -> weight = 0.76
Season_2023-2024 -> weight = 0.84
Season_2024-2025 -> weight = 0.92
Season_2025-2026 -> weight = 1.00


## 5. Training three prediction models: Logistic Regression, Random Forest, and XGBoost

In this section we train three different machine-learning models to predict match outcomes (H/D/A) using the engineered match-level features:

- **Logistic Regression** - A linear classifier with feature scaling and balanced class weights.
Useful as a baseline model to evaluate linear separability in the data.
- **Random Forest** - A non-linear ensemble of decision trees capable of capturing complex interactions between features.
Trained with class balancing and 500 trees for stable predictions.
- **XGBoost** - A gradient-boosted tree model well-suited for multi-class classification and structured tabular data.
Labels are encoded numerically for training, and sample weights are applied to emphasize recent seasons.

Each model is trained using the weighted training set and evaluated on the held-out test data.
The trained models are stored in the MODELS dictionary for later comparison and prediction.

In [8]:
MODELS = {}

# Logistic Regression
logreg_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(
        max_iter=2000,
        class_weight="balanced",
        random_state=42,
    )
)
logreg_model.fit(X_train, y_train, logisticregression__sample_weight=w_train)
print("LogReg accuracy:", logreg_model.score(X_test, y_test))
MODELS["logreg"] = logreg_model

# RandomForest
rf_model = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    class_weight="balanced",
    random_state=42,
)
rf_model.fit(X_train, y_train, sample_weight=w_train)
print("RandomForest accuracy:", rf_model.score(X_test, y_test))
MODELS["rf"] = rf_model

# XGBoost
if HAS_XGB:
    # labels to values
    label_to_int = {"A": 0, "D": 1, "H": 2}
    int_to_label = {v: k for k, v in label_to_int.items()}

    y_train_xgb = y_train.map(label_to_int).values
    y_test_xgb = y_test.map(label_to_int).values

    xgb_model = XGBClassifier(
        n_estimators=500,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="multi:softprob",
        eval_metric="mlogloss",
        random_state=42,
    )
    xgb_model.fit(X_train, y_train_xgb, sample_weight=w_train)

    y_pred_xgb = xgb_model.predict(X_test)
    xgb_acc = (y_pred_xgb == y_test_xgb).mean()
    print("XGBoost accuracy:", xgb_acc)

    MODELS["xgb"] = xgb_model
else:
    xgb_model = None
    label_to_int = None
    int_to_label = None

LogReg accuracy: 0.5098522167487685
RandomForest accuracy: 0.5788177339901478
XGBoost accuracy: 0.5492610837438424


## 6. Generating match predictions

This section includes the utilities required to generate match-outcome probabilities for any future fixture using the trained models.
- **Load latest team statistics** -  Retrieves the most recent team-level data and computes form indicators to ensure predictions rely on up-to-date performance information.
- **Construct feature inputs for a match** - Given a home and away team, prepares the full engineered feature vector (e.g., statistical differences) in the same structure used during model training.
- **Run predictions with a selected model**- Applies any trained model to the constructed feature vector and returns standardised probabilities for a home win, draw, or away win.
- **Ensemble prediction** - Combines the outputs of several models using predefined weights to produce a more stable and robust overall prediction.

These components enable real-time match forecasting and form the bridge between the trained models and the probability outputs used in analysis and visualisations.

In [9]:
def load_latest_team_stats(base_dir = "../CSV_files"):
    base = Path(base_dir)
    season_dirs = sorted(base.glob("Season_*"))
    if not season_dirs:
        raise ValueError(f"No Season_* folders found under {base}")
    latest = season_dirs[-1]
    stats_file = list(latest.glob("*squad_team_merged*.csv"))[0]
    stats = pd.read_csv(stats_file)
    stats = normalize_squad_column(stats)
    if "Last 5" in stats.columns:
        stats["form_score"] = stats["Last 5"].apply(compute_form_score)
    else:
        stats["form_score"] = 0.0
    return stats


LATEST_STATS = load_latest_team_stats()


def build_feature_row_for_prediction(home_team, away_team):
    home = LATEST_STATS[LATEST_STATS["Squad"] == home_team]
    away = LATEST_STATS[LATEST_STATS["Squad"] == away_team]

    if home.empty:
        raise ValueError(f"Home team '{home_team}' not found in latest stats.")
    if away.empty:
        raise ValueError(f"Away team '{away_team}' not found in latest stats.")

    home = home.reset_index(drop=True)
    away = away.reset_index(drop=True)

    data = {}
    for col in feature_cols:
        base = col[5:]
        if base in home.columns and base in away.columns:
            hv = home[base].iloc[0]
            av = away[base].iloc[0]
            if pd.api.types.is_numeric_dtype(type(hv)) and pd.api.types.is_numeric_dtype(type(av)):
                try:
                    data[col] = float(hv) - float(av)
                except Exception:
                    data[col] = np.nan
            else:
                data[col] = np.nan
        else:
            data[col] = np.nan

    row = pd.DataFrame([data], columns=feature_cols)
    return row


# Predicing a single match with a certain model.
def predict_match(home_team, away_team, model_name):

    if model_name not in MODELS:
        raise ValueError(f"Unknown model '{model_name}'. Valid: {list(MODELS.keys())}")

    if home_team == away_team or home_team is None or away_team is None:
        return "Insert correct inputs."

    row = build_feature_row_for_prediction(home_team, away_team)
    model = MODELS[model_name]

    probs = model.predict_proba(row)[0]
    classes = model.classes_

    if model_name == "xgb":
        class_labels = [int_to_label[c] for c in classes]
    else:
        class_labels = list(classes)

    prob_map = {cls: round(float(p), 3) for cls, p in zip(class_labels, probs)}

    return {
        "home_team": home_team,
        "away_team": away_team,
        "model": model_name,
        "probs": {
            "home_win": prob_map.get("H", 0.0),
            "draw": prob_map.get("D", 0.0),
            "away_win": prob_map.get("A", 0.0),
        },
    }

# Averages from current models with weights based on accuracy
def predict_ensemble(home_team, away_team):

    if home_team == away_team or home_team is None or away_team is None:
        return "Insert correct inputs."
    
    # Static weights for models based on accuracy, can be changed according to needs
    weights = {
        "logreg": 0.2,
        "rf": 0.5,
        "xgb": 0.3
    }
    
    models = ["logreg", "rf", "xgb"]
    probs_list = []
    w_list = []
    used_models = []

    for m in models:
        # If XGB is not available
        if m not in MODELS:
            continue

        pred = predict_match(home_team, away_team, m)
        probs_list.append(pred["probs"])
        w_list.append(weights[m])
        used_models.append(m)

    if len(probs_list) == 0:
        raise ValueError("No valid models available for ensemble prediction.")

    # Normalizing (in case weight inputs are entered differently and don't sum up to 1.0)
    w_arr = np.array(w_list, dtype=float)
    w_arr = w_arr / w_arr.sum()

    home_vals = np.array([p["home_win"] for p in probs_list], dtype=float)
    draw_vals = np.array([p["draw"] for p in probs_list], dtype=float)
    away_vals = np.array([p["away_win"] for p in probs_list], dtype=float)

    avg_home = round(float(np.sum(home_vals * w_arr)), 3)
    avg_draw = round(float(np.sum(draw_vals * w_arr)), 3)
    avg_away = round(float(np.sum(away_vals * w_arr)), 3)

    return {
        "home_team": home_team,
        "away_team": away_team,
        "models_used": used_models,
        "probs": {
            "home_win": avg_home,
            "draw": avg_draw,
            "away_win": avg_away
        }
    }

## Selecting Teams for Prediction

This section lets you specify any valid Premier League teams to generate a match prediction.
The model expects exact team names from the dataset, so the full list of available teams for the **2025/26 season** is provided below:

**Valid Team Names (2025/26 season):**
Arsenal, Aston Villa, Bournemouth, Brentford, Brighton, Burnley, Chelsea, Crystal Palace, Everton, Fulham,
Leeds United, Liverpool, Manchester City, Manchester Utd, Newcastle Utd, Nott’ham Forest,
Sunderland, Tottenham, West Ham, Wolves

You can set:
- **home_team** — the home side
- **away_team** — the away side

Once both teams are selected, you may generate predictions using any of the trained models:
- **RANDOM FOREST**
- **Logistic Regression**
- **XGBoost**
- Or use the **ensemble method**, which combines all models into a weighted final prediction

The resulting output displays the predicted probabilities for a home win, draw, and away win, allowing to compare how different models assess the same fixture and evaluate the confidence of each prediction.

In [10]:
home_team = "Manchester Utd"
away_team = "Manchester City"

## RANDOM FOREST

In [11]:
predict_match(home_team, away_team, model_name="rf")

{'home_team': 'Manchester Utd',
 'away_team': 'Manchester City',
 'model': 'rf',
 'probs': {'home_win': 0.314, 'draw': 0.308, 'away_win': 0.378}}

## Logistic Regression

In [12]:
predict_match(home_team, away_team, model_name="logreg")

{'home_team': 'Manchester Utd',
 'away_team': 'Manchester City',
 'model': 'logreg',
 'probs': {'home_win': 0.236, 'draw': 0.43, 'away_win': 0.334}}

## XGBoost

In [13]:
predict_match(home_team, away_team, model_name="xgb")

{'home_team': 'Manchester Utd',
 'away_team': 'Manchester City',
 'model': 'xgb',
 'probs': {'home_win': 0.485, 'draw': 0.391, 'away_win': 0.125}}

## Ensemble method(Weighted averages from predictions)

In [14]:
predict_ensemble(home_team, away_team)

{'home_team': 'Manchester Utd',
 'away_team': 'Manchester City',
 'models_used': ['logreg', 'rf', 'xgb'],
 'probs': {'home_win': 0.35, 'draw': 0.357, 'away_win': 0.293}}