# Building an Expected Goals Model

In this notebook, we will create a model for computing the probability of a shot being a goal. This probability is referred to as *Expected Goals* (xG) and it is a popular metric in football today to understand how good a team is at creating chances.

Based on the videos by [Prof. David Sumpter](https://uppsala.instructure.com/courses/28112/pages/2-statistical-models-of-actions), we will fit a *Logistic Regression* model to estimate xG. This model will have two input variables - distance of a shot from goal and angle of the shot to the width of the goal. To fit the model, we will use [event data](https://github.com/statsbomb/open-data) from La Liga (Spanish league) matches provided by Statsbomb.

## Imports

To load and inspect this data, we will need the `json` and `pandas` packages. We will need `numpy` for intermediate transformations and `statsmodels` for fitting the Logistic Regression model.

In [None]:
import json
import os
import pickle
from typing import List, Optional

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from mplsoccer import Standardizer
from sklearn.metrics import mean_absolute_error

## Data

We have already downloaded the data and placed it in `../data/statsbomb/data` directory. Event data is available per match in a JSON file. So, in order to fetch data for all available La Liga matches, we first need to get the competition ID of La Liga. Let us do so by loading `competitions.json` and inspecting it.

In [None]:
competitions_fn: str = "../data/statsbomb/data/competitions.json"
with open(competitions_fn, "r") as f:
    competitions: pd.DataFrame = pd.read_json(f)
competitions

We see that La Liga has the ID `11`. Let us now load the list of La Liga matches for which event data is available. In `./data/statsbomb/data/matches/11/`, we have one JSON file for every La Liga season. Each of the JSON files provides basic information about all matches in that season. We are interested in getting the ID of each match so that we can load event data for that match from `./data/statsbomb/data/events`. So, let us write function to get IDs of all matches for which we want to get event data.

In [None]:
def get_competition_match_ids(comp_id: int, data_dir: str = "../data/statsbomb/data/") -> list:
    """
    Get IDs of matches from all seasons of a competition e.g. La Liga.
    :param comp_id:  Competition ID from one of those provided in `competitions.json`
    :param data_dir: Path to directory containing `competitions.json` and `matches` directory
    from Statsbomb open data.
    :return: List of match IDs for all seasons.
    """
    comp_dir: str = os.path.join(data_dir, "matches", str(comp_id))
    season_fns: list = os.listdir(comp_dir)
    season_fn: str
    match: dict
    match_ids: list = []
    for season_fn in season_fns:
        with open(os.path.join(comp_dir, season_fn), "r") as jf:
            matches: list = json.load(jf)

        for match in matches:
            match_ids.append(match["match_id"])

    return match_ids

In [None]:
required_match_ids: list = get_competition_match_ids(11)

Let us write a quick test for our function. Let us pick one match ID at random from the following files:
- `1.json`
- `27.json`
- `42.json`

and verify that our list contains these IDs.

In [None]:
assert 9609 in required_match_ids
assert 266166 in required_match_ids
assert 303532 in required_match_ids

Having fetched the match IDs, let us now incrementally load events from them. For our first model, let us only consider shots to goal from open play. To do so, we use our learnings from the [previous exploratory notebook](./00_loading_investigating_world_cup_data.ipynb).

In [None]:
def load_match_events(match_id: int, data_dir: str = "../data/statsbomb/data/events") -> List[dict]:
    """
    Load event data of a match.
    :param match_id: ID of the match which matches the name of the JSON file from which
    to load events
    :param data_dir: Path to the `events` directory of Statsbomb open data.
    :return: A list of dictionaries with each dictionary denoting an event i.e. on-ball action.
    """
    with open(os.path.join(data_dir, f"{match_id}.json"), "r") as jf:
        return json.load(jf)

def frame_events(e: list, match_id: int) -> pd.DataFrame:
    """
    Convert a list of on-ball match events to a Pandas dataframe.
    :param e:        List of dictionaries with each dictionary denoting an event.
    :param match_id: ID of the match whose events are transformed.
    :return: Pandas dataframe of events.
    """
    return (pd.json_normalize(e, sep="_")
            .assign(match_id=match_id))

def filter_events(e: pd.DataFrame, event_type: str, event_type_filter: Optional[dict] = None) -> pd.DataFrame:
    """
    Filter events to include the specified actions. Supported event types are
    one of `["Shot", "Pass"]`. Further filters are supplied as key-value pairs
    with the key of the dictionary interpreted as the column name and the
    dictionary value as the value in the column that will be searched for.
    :param e:                 Dataframe of on-ball match events.
    :param event_type:        Type of events to filter out. Supported values are one
    of `["Shot", "Pass"]`.
    :param event_type_filter: A dictionary specifying filters specific to the event
    specified.
    :return: A Pandas dataframe of events of the specified type.
    """
    if not event_type_filter:
        event_type_filter = {}

    required_event: pd.DataFrame = e.loc[e["type_name"] == event_type].set_index("id")
    col: str
    val: str
    for col, val in event_type_filter.items():
        required_event = required_event.loc[required_event[col] == val]

    return required_event

m_id: int
match_wise_shots: list = []
for m_id in required_match_ids:
    match_events: list = load_match_events(m_id)
    events: pd.DataFrame = frame_events(match_events, m_id)
    match_shots: pd.DataFrame = filter_events(events, "Shot", {"shot_type_name": "Open Play"})
    match_wise_shots.append(match_shots)

la_liga_shots: pd.DataFrame = pd.concat(match_wise_shots)
la_liga_shots

## Baseline model

Let us now construct a baseline Logistic Regression model. To do so, we require the following information:
- X and Y coordinates from where the shot was taken
- Whether the shot resulted in a goal

From the X and Y coordinates, we will next create two more columns - one to compute the distance of the shot from goal and the other to compute the angle of the shot to the goal.

Looking at the data above, we see a column named `shot_statsbomb_xg`. We can use this column as a reference for our model results, but we won't consider those values to be the ground truth.

The X and Y coordinates are available as a list in the `location` column. We will need to extract them and put them in separate columns. The column `shot_outcome_id` tells us the outcome of the shot. Based on page 20 of the document `./data/statsbomb/doc/Open Data Events v4.0.0.pdf`, we can see that shots with `shot_outcome_id = 97` are goals while the others are not. So, we need to construct a boolean column from it accordingly. Let us define a function that will perform these steps.

In addition, the [tutorial](https://www.youtube.com/watch?v=wHOgINJ5g54) we follow for building this model uses Wyscout data and the two data providers measure the pitch and thus record shot coordinates in different units. As we will use the logic used for Wyscout data to compute shot distance and angle, we make use of the `Standardizer` class of `mplsoccer` package to transform our coordinates from Statsbomb to Wyscout units.

In [None]:
def create_shot_modelling_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Transform event data of shots for modelling. This includes:
    - Splitting the `location` column into two - one for representing X coordinate and the other Y.
    - Converting the coordinate values to Wyscout units as further computations assume
    the coordinates to be in Wyscout units.
    - Creating a boolean column representing if the shot resulted in a goal (1) or not (0).
    - Dropping all column except the coordinates, boolean indicator of goal, and Statsbomb's xG value.
    - Renaming Statsbomb's xG value column.
    :param df: Pandas dataframe of event data for shots.
    :return: A Pandas dataframe with columns - X and Y coordinate of the shot, boolean indicator of goal,
    and Statsbomb's xG.
    """
    statsbomb_to_wyscout = Standardizer(pitch_from="statsbomb", pitch_to="wyscout")
    return (df.assign(X=lambda x: [l[0] for l in x["location"]])
            .assign(Y=lambda x: [l[1] for l in x["location"]])
            .assign(X=lambda x: [round(statsbomb_to_wyscout.transform([xi], [yi])[0][0], 2)
                                 for xi, yi in zip(x["X"], x["Y"])])
            .assign(Y=lambda x: [round(statsbomb_to_wyscout.transform([xi], [yi])[1][0], 2)
                                 for xi, yi in zip(x["X"], x["Y"])])
            .assign(is_goal=lambda x: (x["shot_outcome_id"] == 97).astype(int))
            .filter(["X", "Y", "is_goal", "shot_statsbomb_xg"], axis=1)
            .rename(columns={"shot_statsbomb_xg": "statsbomb_xg"}))

la_liga_shots_model_data: pd.DataFrame = create_shot_modelling_data(la_liga_shots)
la_liga_shots_model_data

Let us now compute shot distance and angle.

In [None]:
def transform_coordinates_for_computation(x: float, y: float) -> (float, float):
    """
    Question: What does this transformation do?
    :param x: X coordinate of the shot taken.
    :param y: Y coordinate of the shot taken
    :return: Tuple of transformed X, Y coordinates.
    """
    m_x: float = 100 - x
    c: float = abs(y - 50)

    return (m_x * 105) / 100, (c * 65) / 100

def compute_shot_distance(x: float, y: float) -> float:
    """
    Computes distance of a shot from goal.
    :param x: X coordinate of the shot taken.
    :param y: Y coordinate of the shot taken.
    :return: Float denoting distance (in meters) of the shot from goal.
    """
    t_x: float
    t_y: float
    t_x, t_y = transform_coordinates_for_computation(x, y)

    return np.sqrt(t_x ** 2 + t_y ** 2)

def compute_shot_angle(x: float, y: float) -> float:
    """
    Computes the angle of the shot to the width of the goal post.
    :param x: X coordinate of the shot taken.
    :param y: Y coordinate of the shot taken
    :return: Float denoting the angle (in radians) of the shot.
    """
    t_x: float
    t_y: float
    t_x, t_y = transform_coordinates_for_computation(x, y)

    angle: float = np.arctan((7.32 * t_x) / (t_x ** 2 + t_y ** 2 - (7.32 / 2) ** 2))
    if angle < 0:
        angle = np.pi + angle

    return angle

la_liga_xg_model_data: pd.DataFrame = (la_liga_shots_model_data
                                       .assign(dist=lambda x: [compute_shot_distance(xc, yc)
                                                               for xc, yc in zip(x["X"], x["Y"])])
                                       .assign(angle=lambda x: [compute_shot_angle(xc, yc)
                                                                for xc, yc in zip(x["X"], x["Y"])]))
la_liga_xg_model_data

Let us now fit a Logistic Regression model using statsmodel's `glm()` method. As our output is binary, we specify the `family` argument to be `sm.families.Binomial()`. After fitting the model, let us print a summary of the model.

In [None]:
baseline_model = smf.glm(formula="is_goal ~ dist + angle", data=la_liga_xg_model_data,
                         family=sm.families.Binomial()).fit()
baseline_model.summary()

Looking at the model summary above, in particular the last table, we see that the probability of a shot becoming a goal decreases with increasing distance. We say this based on the negative value of the coefficient. Similarly, as the shot angle increases, i.e. as the shot is taken from between the goal posts, the probability of hitting the back of the net also increases. The near-zero P-values of both the coefficients suggest that we have sufficient evidence to reject the null hypothesis that the true coefficient value is zero.

Let us now save the model parameters in a `.pkl` file.

In [None]:
def save_model(params: pd.Series, model_fn: str, model_dir: str = "../models"):
    """
    Save parameters of Logistic Regression to a pickle file.
    :param params:    A Pandas Series of Logistic Regression parameters.
    :param model_fn:  Name of pickle file (with extension) to save the parameters to.
    :param model_dir: Directory to save the model to.
    """
    with open(os.path.join(model_dir, model_fn), "wb") as pf:
        pickle.dump(params, pf)

baseline_model_params: pd.Series = baseline_model.params
baseline_model_fn: str = "baseline_logistic_model.pkl"
save_model(baseline_model_params, baseline_model_fn)

## Serving the model

Let us now define a function that loads the parameters from a pickle file and computes the xG value.

In [None]:
def compute_xg(xc: float, yc: float, model_fn: str, model_dir: str = "../models") -> float:
    """
    Compute the xG of a shot given the (X, Y) coordinates of where it was taken from.
    :param xc:        X-coordinate of where the shot was taken from.
    :param yc:        Y-coordinate of where the shot was taken from.
    :param model_fn:  Name of the file containing Logistic Regression parameters.
    :param model_dir: Directory containing the model file.
    :return: A float representing xG value.
    """
    shot_dist: float = compute_shot_distance(xc, yc)
    shot_ang: float = compute_shot_angle(xc, yc)

    with open(os.path.join(model_dir, model_fn), "rb") as pf:
        model_params: pd.Series = pickle.load(pf)

    linear_sum: float = (model_params["Intercept"]
                         + (shot_dist * model_params["dist"])
                         + (shot_ang * model_params["angle"]))

    return 1 / (1 + np.exp(-1 * linear_sum))

la_liga_xg_model_data = (la_liga_xg_model_data
                         .assign(xg=lambda x: [compute_xg(xi, yi, baseline_model_fn) for xi, yi in zip(x["X"], x["Y"])]))
la_liga_xg_model_data

From the dataframe above, we see that our computed xG differs from the Statsbomb one. If we were to measure the difference in absolute terms, we see that the two xG values differ by about 7%.

In [None]:
mean_absolute_error(la_liga_xg_model_data["statsbomb_xg"], la_liga_xg_model_data["xg"])

## Improvements

If we look at the documentation of event information in `./data/statsbomb/doc/Open Data Events v4.0.0.pdf`, we can spot additional parameters that might improve the model. These include:
- Freeze-frame which tells us about the opposition players in the vicinity when a shot was taken. This can be an important variable as more players around means more pressure on the player which in turn can lead to a false shot.
- Open goal which tells us if the shot was taken in front of an open goal.
- Deflected which tells us if the shot was deflected.
- Technique whose values can be one of Backheel, Diving header, Half volley, Lob, Normal, Overhead kick, or volley.
- Body part which indicates if the shot was taken with the head, left foot, right foot, or other body part. This variable combined with information about which foot a player prefers can be useful in predicting if a shot will turn into a goal.

In his video [The Ultimate Guide to Expected Goals](https://www.youtube.com/watch?v=310_eW0hUqQ), Prof. Sumpter asserts that given the right variables, models more complex than Logistic Regression might not provide much better performance. Thus, improvements to the model can focus on adding more variables first and evaluating if they contribute to the model fit before experimenting with other models.