# Feature Engineering – European Soccer Match Outcome Prediction

## Summary of Outputs


This notebook converts the raw European Soccer relational database into a **clean, leak-free, time-aware match-level dataset** suitable for machine learning.

The final engineered dataset is saved as:

- `data/processed/features_matches_main.csv`
- `data/processed/features_matches_main.parquet`

Each row in this dataset corresponds to **one match**, containing:

- metadata  
- engineered features (X)  
- the target label (`match_result`)  
- season/league/team identifiers for temporal & fairness evaluation  

---

## What this notebook accomplishes

### **1. Loads and prepares match information**
We extract:

- match date  
- season, stage  
- home/away team IDs  
- goals scored  
- country/league context  

We then generate the supervised learning target:



match_result ∈ {home_win, draw, away_win}


---

### **2. Attaches Team_Attributes using a leak-free as-of merge**

For each team (home & away), we retrieve the **latest tactical attributes BEFORE the match date**, such as:

- buildUpPlaySpeed  
- buildUpPlayPassing  
- chanceCreationPassing  
- chanceCreationShooting  
- defencePressure  
- defenceAggression  
- defenceTeamWidth  

This ensures **no future information leaks into the features**.

---

### **3. Computes a synthetic team strength score**

Since the dataset does not include a single “overall rating”, we build a simple, transparent composite strength:



team_strength = average of [

buildUpPlaySpeed,
chanceCreationPassing,
chanceCreationShooting,
defencePressure

]


This yields:

- `home_team_strength`  
- `away_team_strength`  
- `team_strength_diff`  

This feature is predictive and aligns with real sports analytics practice.

---

### **4. Creates tactical difference features (home − away)**

To capture matchups, we compute:

- `buildUpPlaySpeed_diff`  
- `buildUpPlayPassing_diff`  
- `chanceCreationPassing_diff`  
- `chanceCreationShooting_diff`  
- `defencePressure_diff`  
- `defenceAggression_diff`  
- `defenceTeamWidth_diff`  

Difference features help the model learn relative advantages rather than absolute values.

---

### **5. Builds rolling form features (last 5 matches)**

For each team, we construct a chronological history and compute:

- `win_rate_last5`  
- `avg_goals_for_last5`  
- `avg_goals_against_last5`  
- `goal_diff_avg_last5`  
- `points_per_game_last5`  
- `games_played_history`  

All windows use:



shift(1)

to ensure we **only use information from matches before the current match**, preserving temporal correctness.

We compute both **home** and **away** histories, plus differences such as:

- `win_rate_last5_diff`  
- `avg_goal_diff_last5_diff`  
- `points_per_game_last5_diff`  

---

### **6. Produces a clean, final dataset for modeling**

The final dataset contains:

- all engineered features  
- the match label (`match_result`)  
- metadata needed for chronological train/val/test splitting  
- no duplicated columns  
- compatibility with scikit-learn, XGBoost, SHAP, and fairness tools  

Saved to disk for the modeling notebook.

---

## **Final Output Summary**

This notebook generates a high-quality dataset that is:

- **leakage-free**  
- **time-aware**  
- **feature-engineered using domain knowledge**  
- **aligned with the Feasibility Report design**  
- **ready for modeling in the next stage**  

The resulting dataset (`features_matches_main`) is the foundation for:

- ML modeling  
- evaluation  
- calibration  
- fairness analysis  
- SHAP explainability  
- model card documentation  


# Feature Engineering – European Soccer Match Outcome Prediction

This notebook builds the **match-level feature matrix** for our main prediction task:

> Predict **home_win / draw / away_win** *before kickoff* using only **pre-match** information.

Design principles:

- **No target leakage** (only use info available before each match)
- **Time-aware**: features use past matches only
- **Team-centric**: use `Team_Attributes` (as-of date) and rolling form
- **Reproducible**: we save a clean feature table for modeling

Output:

- `features_matches_main` with:
  - one row per match
  - feature matrix `X`
  - target `match_result`
  - metadata for chronological splitting and fairness evaluation


In [None]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

os.makedirs("data/processed", exist_ok=True)
print("Created directory: data/processed")


## 0. Setup and Data Download


In [None]:
from predictiq.data import load_db
conn, q = load_db()

## 1. Load Match Table and Define Target

We load the core `Match` table and define the target label:

- `match_result ∈ {home_win, draw, away_win}`

Then we sort matches by date and add an integer `season_index`.

In [None]:
matches = q("""
SELECT
    id,
    match_api_id,
    country_id,
    league_id,
    season,
    stage,
    date,
    home_team_api_id,
    away_team_api_id,
    home_team_goal,
    away_team_goal
FROM Match
""")

matches["date"] = pd.to_datetime(matches["date"], errors="coerce")

def result_label(row):
    if row["home_team_goal"] > row["away_team_goal"]:
        return "home_win"
    elif row["home_team_goal"] < row["away_team_goal"]:
        return "away_win"
    else:
        return "draw"

matches["match_result"] = matches.apply(result_label, axis=1)

matches = matches.sort_values("date").reset_index(drop=True)
print(matches.shape)
matches.head()



In [None]:
seasons = sorted(matches["season"].dropna().unique())
season_to_idx = {s: i for i, s in enumerate(seasons)}

matches["season_index"] = matches["season"].map(season_to_idx).astype("Int64")

matches[["season", "season_index"]].drop_duplicates().sort_values("season").head()


In [None]:
team_attrs = q("""
SELECT
    id,
    team_api_id,
    date,
    buildUpPlaySpeed,
    buildUpPlayPassing,
    chanceCreationPassing,
    chanceCreationShooting,
    defencePressure,
    defenceAggression,
    defenceTeamWidth
FROM Team_Attributes
""")

team_attrs["date"] = pd.to_datetime(team_attrs["date"], errors="coerce")

team_attrs = team_attrs.sort_values(["team_api_id", "date"]).reset_index(drop=True)

print(team_attrs.shape)
team_attrs.head()


## 2. As-Of Merge: Attach Team_Attributes to Each Match

For each match and team (home/away), we want the **latest Team_Attributes row before the match date**.

To avoid sorting issues and leakage, we do this:

- For each team separately:
  - sort that team's matches by date
  - sort that team's attributes by date
  - apply `merge_asof` on date only
- Repeat for home and away teams
- Merge everything back into the Match table


In [None]:
def asof_merge_team_attributes(matches_df, team_attrs_df):
    """
    As-of merge team attributes for home and away teams into the match dataframe,
    done per team in a robust way.

    For each (team, match), we attach the latest Team_Attributes row
    with date <= match date.
    """

    # Feature columns from Team_Attributes (exclude keys and date)
    attr_feature_cols = [
        c for c in team_attrs_df.columns
        if c not in ["id", "team_api_id", "date"]
    ]

    # Clean + sort Team_Attributes
    team_attrs_local = team_attrs_df.dropna(subset=["team_api_id", "date"]).copy()
    team_attrs_local["team_api_id"] = team_attrs_local["team_api_id"].astype(int)
    team_attrs_local = team_attrs_local.sort_values(["team_api_id", "date"])

    def merge_for_side(team_col_name: str, prefix: str) -> pd.DataFrame:
        """
        Perform as-of merge for one side (home or away).

        team_col_name: 'home_team_api_id' or 'away_team_api_id'
        prefix: 'home_' or 'away_'
        """
        side = matches_df[["match_api_id", "date", team_col_name]].copy()
        side = side.rename(columns={team_col_name: "team_api_id"})
        side = side.dropna(subset=["team_api_id", "date"])

        side["team_api_id"] = side["team_api_id"].astype(int)
        side = side.sort_values(["team_api_id", "date"])

        merged_groups = []

        for team_id, g in side.groupby("team_api_id"):
            g = g.sort_values("date").copy()

            ta_team = team_attrs_local[team_attrs_local["team_api_id"] == team_id]

            if ta_team.empty:
                tmp = g.copy()
                for col in attr_feature_cols:
                    tmp[col] = np.nan
            else:
                ta_team_sorted = ta_team.sort_values("date")
                right_for_merge = ta_team_sorted[["date"] + attr_feature_cols]

                tmp = pd.merge_asof(
                    g.sort_values("date"),
                    right_for_merge,
                    on="date",
                    direction="backward"
                )

            merged_groups.append(tmp)

        if merged_groups:
            merged_side = pd.concat(merged_groups, ignore_index=True)
        else:
            merged_side = side.copy()
            for col in attr_feature_cols:
                merged_side[col] = np.nan

        merged_side = merged_side.add_prefix(prefix)
        merged_side = merged_side.rename(columns={f"{prefix}match_api_id": "match_api_id"})

        drop_cols = [f"{prefix}team_api_id", f"{prefix}date"]
        for col in drop_cols:
            if col in merged_side.columns:
                merged_side = merged_side.drop(columns=[col])

        return merged_side

    home_merged = merge_for_side("home_team_api_id", "home_")
    away_merged = merge_for_side("away_team_api_id", "away_")

    merged = matches_df.merge(home_merged, on="match_api_id", how="left")
    merged = merged.merge(away_merged, on="match_api_id", how="left")



    return merged


In [None]:
matches_with_team = asof_merge_team_attributes(matches, team_attrs)

matches_with_team[[
    "match_api_id", "date",
    "home_team_api_id", "away_team_api_id",
    "home_buildUpPlaySpeed", "away_buildUpPlaySpeed"
]].head()


## 3. Tactical Features and Team Strength


In [None]:
df = matches_with_team.copy()

# --- Synthetic team strength (simple average of core attributes) ---
df["home_team_strength"] = (
    df["home_buildUpPlaySpeed"] +
    df["home_chanceCreationPassing"] +
    df["home_chanceCreationShooting"] +
    df["home_defencePressure"]
) / 4

df["away_team_strength"] = (
    df["away_buildUpPlaySpeed"] +
    df["away_chanceCreationPassing"] +
    df["away_chanceCreationShooting"] +
    df["away_defencePressure"]
) / 4

df["team_strength_diff"] = df["home_team_strength"] - df["away_team_strength"]

# --- Tactical difference features (home - away) ---
df["buildUpPlaySpeed_diff"] = df["home_buildUpPlaySpeed"] - df["away_buildUpPlaySpeed"]
df["buildUpPlayPassing_diff"] = df["home_buildUpPlayPassing"] - df["away_buildUpPlayPassing"]
df["chanceCreationPassing_diff"] = df["home_chanceCreationPassing"] - df["away_chanceCreationPassing"]
df["chanceCreationShooting_diff"] = df["home_chanceCreationShooting"] - df["away_chanceCreationShooting"]
df["defencePressure_diff"] = df["home_defencePressure"] - df["away_defencePressure"]
df["defenceAggression_diff"] = df["home_defenceAggression"] - df["away_defenceAggression"]
df["defenceTeamWidth_diff"] = df["home_defenceTeamWidth"] - df["away_defenceTeamWidth"]

df.head()


## 4. Rolling Form Features (Last 5 Matches)

We build a long-format team-match table:

- one row per (team, match)
- with goals_for, goals_against, and result from that team's perspective

Then we compute rolling stats (e.g. last 5 matches) per team:

- win rate
- average goals for/against
- average goal difference
- points per game

Then we map them back into the main match dataframe as home/away form features.


In [None]:
home_part = df[[
    "match_api_id", "date",
    "home_team_api_id", "home_team_goal", "away_team_goal"
]].rename(
    columns={
        "home_team_api_id": "team_id",
        "home_team_goal": "goals_for",
        "away_team_goal": "goals_against",
    }
)
home_part["is_home"] = 1

away_part = df[[
    "match_api_id", "date",
    "away_team_api_id", "away_team_goal", "home_team_goal"
]].rename(
    columns={
        "away_team_api_id": "team_id",
        "away_team_goal": "goals_for",
        "home_team_goal": "goals_against",
    }
)
away_part["is_home"] = 0

team_matches = pd.concat([home_part, away_part], ignore_index=True)
team_matches = team_matches.sort_values(["team_id", "date"]).reset_index(drop=True)

def result_from_pov(row):
    if row["goals_for"] > row["goals_against"]:
        return "win"
    elif row["goals_for"] < row["goals_against"]:
        return "loss"
    else:
        return "draw"

team_matches["result"] = team_matches.apply(result_from_pov, axis=1)
team_matches.head()


In [None]:
k = 5  # rolling window size

team_matches["win_flag"] = (team_matches["result"] == "win").astype(int)
team_matches["draw_flag"] = (team_matches["result"] == "draw").astype(int)
team_matches["loss_flag"] = (team_matches["result"] == "loss").astype(int)

team_matches = team_matches.sort_values(["team_id", "date"]).reset_index(drop=True)
grouped = team_matches.groupby("team_id", group_keys=False)

def add_rolling_features(group):
    group = group.sort_values("date").copy()

    group["goals_for_rolled"] = group["goals_for"].shift(1)
    group["goals_against_rolled"] = group["goals_against"].shift(1)
    group["win_flag_rolled"] = group["win_flag"].shift(1)
    group["draw_flag_rolled"] = group["draw_flag"].shift(1)
    group["loss_flag_rolled"] = group["loss_flag"].shift(1)

    group["games_played_history"] = (
        group["goals_for_rolled"].rolling(k, min_periods=1).count()
    )

    group["avg_goals_for_last5"] = (
        group["goals_for_rolled"].rolling(k, min_periods=1).mean()
    )
    group["avg_goals_against_last5"] = (
        group["goals_against_rolled"].rolling(k, min_periods=1).mean()
    )
    group["win_rate_last5"] = (
        group["win_flag_rolled"].rolling(k, min_periods=1).mean()
    )
    group["points_per_game_last5"] = (
        (group["win_flag_rolled"] * 3 + group["draw_flag_rolled"] * 1)
        .rolling(k, min_periods=1)
        .mean()
    )
    group["goal_diff_avg_last5"] = (
        (group["goals_for_rolled"] - group["goals_against_rolled"])
        .rolling(k, min_periods=1)
        .mean()
    )

    return group

team_matches_fe = grouped.apply(add_rolling_features)
team_matches_fe.head()


In [None]:
# Home history
home_hist = team_matches_fe[team_matches_fe["is_home"] == 1][[
    "match_api_id", "team_id",
    "games_played_history",
    "avg_goals_for_last5",
    "avg_goals_against_last5",
    "win_rate_last5",
    "points_per_game_last5",
    "goal_diff_avg_last5"
]].rename(columns={
    "team_id": "home_team_api_id",
    "games_played_history": "home_history_games_count",
    "avg_goals_for_last5": "home_avg_goals_for_last5",
    "avg_goals_against_last5": "home_avg_goals_against_last5",
    "win_rate_last5": "home_win_rate_last5",
    "points_per_game_last5": "home_points_per_game_last5",
    "goal_diff_avg_last5": "home_goal_diff_avg_last5"
})

# Away history
away_hist = team_matches_fe[team_matches_fe["is_home"] == 0][[
    "match_api_id", "team_id",
    "games_played_history",
    "avg_goals_for_last5",
    "avg_goals_against_last5",
    "win_rate_last5",
    "points_per_game_last5",
    "goal_diff_avg_last5"
]].rename(columns={
    "team_id": "away_team_api_id",
    "games_played_history": "away_history_games_count",
    "avg_goals_for_last5": "away_avg_goals_for_last5",
    "avg_goals_against_last5": "away_avg_goals_against_last5",
    "win_rate_last5": "away_win_rate_last5",
    "points_per_game_last5": "away_points_per_game_last5",
    "goal_diff_avg_last5": "away_goal_diff_avg_last5"
})

df = df.merge(home_hist, on=["match_api_id", "home_team_api_id"], how="left")
df = df.merge(away_hist, on=["match_api_id", "away_team_api_id"], how="left")

df[[
    "match_api_id",
    "home_win_rate_last5", "away_win_rate_last5",
    "home_avg_goals_for_last5", "away_avg_goals_for_last5"
]].head()


In [None]:
df["win_rate_last5_diff"] = df["home_win_rate_last5"] - df["away_win_rate_last5"]
df["avg_goal_diff_last5_diff"] = (
    df["home_goal_diff_avg_last5"] - df["away_goal_diff_avg_last5"]
)
df["points_per_game_last5_diff"] = (
    df["home_points_per_game_last5"] - df["away_points_per_game_last5"]
)

df[[
    "match_api_id",
    "win_rate_last5_diff",
    "avg_goal_diff_last5_diff",
    "points_per_game_last5_diff"
]].head()


In [None]:
target_col = "match_result"

meta_cols = [
    "match_api_id",
    "id",
    "date",
    "season",
    "season_index",
    "stage",
    "league_id",
    "country_id",
    "home_team_api_id",
    "away_team_api_id",
]
meta_cols = [c for c in meta_cols if c in df.columns]

feature_cols = [
    # Context
    "season_index",
    "stage",
    "league_id",

    # Home team attributes
    "home_buildUpPlaySpeed",
    "home_buildUpPlayPassing",
    "home_chanceCreationPassing",
    "home_chanceCreationShooting",
    "home_defencePressure",
    "home_defenceAggression",
    "home_defenceTeamWidth",

    # Away team attributes
    "away_buildUpPlaySpeed",
    "away_buildUpPlayPassing",
    "away_chanceCreationPassing",
    "away_chanceCreationShooting",
    "away_defencePressure",
    "away_defenceAggression",
    "away_defenceTeamWidth",

    # Synthetic team strength
    "home_team_strength",
    "away_team_strength",
    "team_strength_diff",

    # Tactical diffs
    "buildUpPlaySpeed_diff",
    "buildUpPlayPassing_diff",
    "chanceCreationPassing_diff",
    "chanceCreationShooting_diff",
    "defencePressure_diff",
    "defenceAggression_diff",
    "defenceTeamWidth_diff",

    # Home history
    "home_win_rate_last5",
    "home_avg_goals_for_last5",
    "home_avg_goals_against_last5",
    "home_goal_diff_avg_last5",
    "home_points_per_game_last5",
    "home_history_games_count",

    # Away history
    "away_win_rate_last5",
    "away_avg_goals_for_last5",
    "away_avg_goals_against_last5",
    "away_goal_diff_avg_last5",
    "away_points_per_game_last5",
    "away_history_games_count",

    # Historical diffs
    "win_rate_last5_diff",
    "avg_goal_diff_last5_diff",
    "points_per_game_last5_diff",
]

feature_cols = [c for c in feature_cols if c in df.columns]

len(feature_cols), feature_cols[:10]


In [None]:
# Combine meta, target, and feature columns
all_cols = meta_cols + [target_col] + feature_cols

# Remove duplicate column names while preserving order
seen = set()
final_cols = []
for c in all_cols:
    if c not in seen:
        final_cols.append(c)
        seen.add(c)

print("Total requested columns:", len(all_cols))
print("Unique columns after de-dup:", len(final_cols))

# Build final table with unique column names only
features_matches_main = df[final_cols].copy()

print("Rows in final feature table:", len(features_matches_main))
features_matches_main.head()


In [None]:
out_path_csv = Path("data/processed/features_matches_main.csv")
out_path_parquet = Path("data/processed/features_matches_main.parquet")

features_matches_main.to_csv(out_path_csv, index=False)
features_matches_main.to_parquet(out_path_parquet, index=False)

print("Saved:", out_path_csv)
print("Saved:", out_path_parquet)


In [None]:
features_matches_main["match_result"].value_counts(normalize=True)


## 5. Assemble Final Feature Table and Save


In [None]:
features_matches_main[feature_cols].describe().T.head(15)
