<h1><center> Football sequence data with continuous and categorical features </center></h1>

In this tutorial, we take advantage of the [socceraction](https://github.com/ML-KULeuven/socceraction) library and the [statsbomb open data](https://github.com/statsbomb/open-data). This code is highly inspired from this [tutorial](https://github.com/ML-KULeuven/socceraction/blob/master/public-notebooks/1-load-and-convert-statsbomb-data.ipynb).



In [None]:
import os
import pickle
import warnings
from random import sample
from typing import List, Tuple

import numpy as np
import pandas as pd
import progressbar
import socceraction.spadl as spadl
import tqdm
from fastcore.foundation import *
from socceraction.data.statsbomb import StatsBombLoader

pd.set_option("display.max_columns", None)
warnings.simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

Let's now map a folder in google drive where we can save some files when needed:

In [None]:
base_path = "."
data_path = os.path.join(base_path, "data")

# Get the raw data

In [None]:
free_open_data_remote = (
    "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
)
SBL = StatsBombLoader(root=free_open_data_remote, getter="remote")

# check available competitions
competitions = SBL.competitions()
set(competitions.competition_name)

{'Champions League',
 "FA Women's Super League",
 'FIFA World Cup',
 'La Liga',
 'NWSL',
 'Premier League',
 'UEFA Euro',
 "Women's World Cup"}

It is enough to focus on a few competitions here. It will provide us with enough games to create our sequences:

In [None]:
selected_comp_names = ["Champions League", "La Liga", "Premier League"]
selected_competitions = competitions[
    competitions.competition_name.isin(selected_comp_names)
].sort_values(["season_name", "competition_name"])
selected_competitions.head()

Unnamed: 0,season_id,competition_id,competition_name,country_name,competition_gender,season_name
14,76,16,Champions League,Europe,male,1999/2000
13,44,16,Champions League,Europe,male,2003/2004
37,44,2,Premier League,England,male,2003/2004
12,37,16,Champions League,Europe,male,2004/2005
35,37,11,La Liga,Spain,male,2004/2005


Now we are ready to extract the list of games available in each season/competition combination:

In [None]:
games = pd.concat(
    [
        SBL.games(row.competition_id, row.season_id)
        for row in selected_competitions.itertuples()
    ]
)
games[["home_team_id", "away_team_id", "game_date", "home_score", "away_score"]]

Unnamed: 0,home_team_id,away_team_id,game_date,home_score,away_score
0,129,256,2004-05-26 12:00:00,0,3
0,46,1,2004-02-07 16:00:00,1,3
1,1,46,2003-12-26 13:00:00,3,0
2,1,39,2004-03-28 17:05:00,1,1
3,1,22,2004-05-15 16:00:00,2,1
...,...,...,...,...,...
30,217,422,2020-11-29 14:00:00,4,0
31,217,552,2021-02-21 14:00:00,1,1
32,217,1042,2021-02-24 19:00:00,3,0
33,222,217,2021-04-25 16:15:00,1,2


We can see that we have 500+ games to work with. That's more than enough given that we can extract many sequences per game. 

Let's now use the `socceractions` functions to extract the data from and put them in the event data in *SPADL* format. More information about this standard format can be found in this [paper](https://dl.acm.org/doi/10.1145/3292500.3330758): 

In [None]:
# games_verbose = tqdm.tqdm(list(games.itertuples()), desc="Loading game data")
# teams, players, events, actions = [], [], [], []
# for game in games_verbose:
#    # load data
#    teams.append(SBL.teams(game.game_id))
#    players.append(SBL.players(game.game_id))
#    game_events = SBL.events(game.game_id)

#    # convert data
#    spadl_actions = spadl.statsbomb.convert_to_actions(game_events,
#                                                       game.home_team_id)
#    spadl_actions["game_id"] = game.game_id
#    events.append(game_events)
#    actions.append(spadl_actions)

# teams = pd.concat(teams).drop_duplicates(subset="team_id")
# players = pd.concat(players).drop_duplicates(subset="player_id")
# events = pd.concat(events)
# actions = pd.concat(actions)

# teams.to_pickle(os.path.join(data_path, "teams.pkl"))
# players.to_pickle(os.path.join(data_path,"players.pkl"))
# events.to_pickle(os.path.join(data_path,"events.pkl"))
# actions.to_pickle(os.path.join(data_path,"spadl_actions.pkl"))

The previous step may take some time, therefore it makes sense to save the objects and load them back when needed:

In [None]:
!ls {data_path}

'Copy of sequences.pkl'


In [None]:
teams = pd.read_pickle(os.path.join(data_path, "teams.pkl"))
players = pd.read_pickle(os.path.join(data_path, "players.pkl"))
events = pd.read_pickle(os.path.join(data_path, "events.pkl"))
actions = pd.read_pickle(os.path.join(data_path, "spadl_actions.pkl"))

# Prepare sequence data

## Enrich the `SPADL` actions

The standardization done when creating the `SPADL` `dataFrame` can benefit from some extra information available in the events data such as the possession number or the pattern of play.

On the other hand, we can also add some metadata to allow the user to better read visualise and understand the data.



In [None]:
## collect extra information from events
extra_info = (
    events[
        [
            "event_id",
            "possession",
            "possession_team_id",
            "play_pattern_id",
            "play_pattern_name",
            "under_pressure",
            "counterpress",
        ]
    ]
    .rename(columns={"event_id": "original_event_id"})
    .astype(
        {
            "possession": "int64",
            "possession_team_id": "int64",
            "play_pattern_id": "int64",
        }
    )
)

## first, we add some game metadata
full_actions = (
    actions.merge(
        games.drop(["game_day", "venue", "referee_id"], axis="columns"),
        how="left",
        on="game_id",
    )
    .merge(spadl.actiontypes_df(), how="left", on="type_id")
    .merge(spadl.results_df(), how="left", on="result_id")
    .merge(spadl.bodyparts_df(), how="left", on="bodypart_id")
    .merge(
        players[["player_id", "player_name", "nickname"]].drop_duplicates(
            subset="player_id"
        ),
        how="left",
        on="player_id",
    )
    .merge(teams, how="left", on="team_id")
    .merge(extra_info, how="left", on="original_event_id")
)
full_actions["is_home"] = full_actions["team_id"] == full_actions["home_team_id"]

## check that we d
assert actions.shape[0] == full_actions.shape[0]
full_actions.to_pickle(os.path.join(data_path, "spadl_full_actions.pkl"))

full_actions

Unnamed: 0,game_id,original_event_id,period_id,time_seconds,team_id,player_id,start_x,start_y,end_x,end_y,type_id,result_id,bodypart_id,action_id,season_id,competition_id,competition_stage,game_date,home_team_id,away_team_id,home_score,away_score,type_name,result_name,bodypart_name,player_name,nickname,team_name,possession,possession_team_id,play_pattern_id,play_pattern_name,under_pressure,counterpress,is_home
0,3752619,fa5fba91-0000-4373-8df2-e96d3a64f54e,1,0.0,129,25878,52.058824,34.430380,51.264706,36.668354,0,1,0,0,44,16,Final,2004-05-26 12:00:00,129,256,0,3,pass,success,foot,Ludovic Giuly,,AS Monaco,2.0,129.0,9.0,From Kick Off,False,False,True
1,3752619,d6b4ca02-934c-48c3-ae05-ef7198b9fdf3,1,1.0,129,20130,51.264706,36.668354,51.088235,36.410127,21,1,0,1,44,16,Final,2004-05-26 12:00:00,129,256,0,3,dribble,success,foot,Fernando Morientes Sánchez,Fernando Morientes,AS Monaco,2.0,129.0,9.0,From Kick Off,False,False,True
2,3752619,c850b82b-63a3-4b21-a0d9-710d051e8108,1,1.0,129,20130,51.088235,36.410127,41.029412,11.792405,0,1,0,2,44,16,Final,2004-05-26 12:00:00,129,256,0,3,pass,success,foot,Fernando Morientes Sánchez,Fernando Morientes,AS Monaco,2.0,129.0,9.0,From Kick Off,False,False,True
3,3752619,a7161093-9063-459c-a7e6-18d51a72bb52,1,3.0,129,26140,41.029412,11.792405,59.647059,7.058228,21,1,0,3,44,16,Final,2004-05-26 12:00:00,129,256,0,3,dribble,success,foot,Hugo Benjamín Ibarra,Hugo Ibarra,AS Monaco,2.0,129.0,9.0,From Kick Off,True,False,True
4,3752619,00b1730d-ada4-4021-b53e-cb81d030bfc9,1,6.0,129,26140,59.647059,7.058228,63.970588,9.382278,0,0,0,4,44,16,Final,2004-05-26 12:00:00,129,256,0,3,pass,fail,foot,Hugo Benjamín Ibarra,Hugo Ibarra,AS Monaco,2.0,129.0,9.0,From Kick Off,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1250802,3773477,4a798266-43d7-4c9a-91d5-276cc5228b69,2,2972.0,218,6851,68.823529,64.556962,37.676471,57.068354,0,1,0,2235,90,11,Regular Season,2020-11-07 16:15:00,217,218,5,2,pass,success,foot,Aitor Ruibal García,Aitor Ruibal,Real Betis,180.0,218.0,8.0,From Keeper,False,False,False
1250803,3773477,edd38889-aaed-45b8-8ad4-16718a477d66,2,2974.0,217,6826,42.000000,58.101266,42.000000,58.101266,8,1,0,2236,90,11,Regular Season,2020-11-07 16:15:00,217,218,5,2,foul,success,foot,Clément Lenglet,,Barcelona,180.0,218.0,8.0,From Keeper,False,False,True
1250804,3773477,6a5e2ead-5a23-4cc7-8f17-e2cc14e98a53,2,3002.0,218,41083,48.970588,62.577215,6.529412,34.688608,3,0,0,2237,90,11,Regular Season,2020-11-07 16:15:00,217,218,5,2,freekick_crossed,fail,foot,Rodrigo Sánchez Rodriguez,Rodri,Real Betis,181.0,218.0,3.0,From Free Kick,False,False,False
1250805,3773477,09fcb776-16cd-41fc-82f5-387115a4f97e,2,3006.0,217,20055,5.735294,35.463291,6.970588,31.503797,21,1,0,2238,90,11,Regular Season,2020-11-07 16:15:00,217,218,5,2,dribble,success,foot,Marc-André ter Stegen,Marc-André ter Stegen,Barcelona,181.0,218.0,3.0,From Free Kick,False,False,True


## Create the playing sequences

Now that we have the standardized playing events, we can start preparing the tuple of playing sequence and targets. 

The question we want to answer here is fairly simple: *What is the probability of a sequence of play to end up in a goal **at the end of a possession** ?* 

It is important to note few things at this stage:
* The target is known at the end of the possession and hence we do some sort of lookahead to assign the target (`1` for goal `0` otherwise).
* Elements in the sequence are not equally spaced in time. It could be useful to add some features related to the time until start of possession in the feature space (i.e a new column in the sequence).
* It is important to derive several training examples from one possession and not just pass the entire possession as one example. This may lead the model to associate 'goals' with a 'shot' as the final action to classify a sequence of play and miss important actions that happen beforehands.
* Similar to sentences in NLP, sequences of play have different length and we may have to use `padding` and even take advantage of the [SortedDL](https://docs.fast.ai/text.data.html#SortedDL) to form batches with similar lengths.

The strategy adapted to form training examples is described below:
* We iterate over all games and possessions
* For a given possession, we extract the target. As explained above, this is just a look ahead step to see if the final action of the possession resulted in a goal.
* We sample random points in the sequence. A training example is formed by taking all the playing events up to the selected point. We should always include the full possession as an example.
* Once the required number of examples is reached (for a given game/possession), we move to the next possession.

This problem is expected to be (very) unbalanced as goals are rare events. We may have to deal with this with some sampling strategy when we fit our leanrer.

In [None]:
## start by collecting all the combination of game_id/possession
game_poss = full_actions[["game_id", "possession"]].dropna().drop_duplicates()
game_poss.game_id = pd.to_numeric(game_poss.game_id, downcast="integer")

In [None]:
PlayingData = Tuple[pd.DataFrame, str]

## Columns to keep in sequence
KEEP_COLS = [
    "_id",
    "period_id",
    "time_seconds",
    "start_x",
    "start_y",
    "end_x",
    "end_y",
    "type_name",
    "bodypart_name",
    "player_id",
    "player_name",
    "team_id",
    "team_name",
    "is_home",
    "play_pattern_name",
    "under_pressure",
    "counterpress",
    "seconds_since_poss",
    "is_poss_team",
    "result_name",
]


def _compute_target(poss_df: pd.DataFrame) -> str:
    """Compute the sequence target"""
    tgt_df = poss_df[
        (poss_df.result_name == "owngoal")
        | (
            (poss_df["type_name"].str.contains("shot"))
            & (poss_df["result_name"] == "success")
        )
    ]
    return "goal" if tgt_df.shape[0] > 0 else "no_goal"


def _get_playing_seq(ti: float, poss_df: pd.DataFrame, target: str) -> PlayingData:
    """Format the playing sequence"""
    seq_def = poss_df[poss_df.time_seconds <= ti].copy()

    ## add an id for the sequence
    seq_def["_id"] = (
        str(seq_def["game_id"].values[0])
        + "_"
        + str(seq_def["original_event_id"].values[0])
        + "_"
        + str(seq_def["original_event_id"].values[-1])
    )

    return (seq_def[KEEP_COLS], target)


def _form_examples(
    game_id: int, poss_id: int, max_examples: int = 5
) -> List[PlayingData]:
    """
    Create Training example for given possession in a game

    Parameters
    ----------
    game_id: int
      game identifier
    poss_id: int
      possession number
    max_examples: int
      maximum total number of examples


    Returns
    -------
    List[PlayingData]
      Every tuple is a `pandas DataFrame` (giving the sequence info)
      and an 'str` (target `goal`/`no_goal`)

    """

    ## extract the fraction of data we care about
    _poss_df = full_actions[
        (full_actions.game_id == game_id) & (full_actions.possession == poss_id)
    ].sort_values(["time_seconds"])
    st_time, en_time, period_id, poss_team_id = (
        _poss_df["time_seconds"].min(),
        _poss_df["time_seconds"].max(),
        _poss_df.period_id.unique().tolist()[0],
        _poss_df.possession_team_id.unique().tolist()[0],
    )
    poss_df = full_actions[
        (full_actions.game_id == game_id)
        & (full_actions.period_id == period_id)
        & (full_actions.time_seconds >= st_time)
        & (full_actions.time_seconds <= en_time)
    ].copy()

    poss_df["seconds_since_poss"] = poss_df["time_seconds"] - st_time
    poss_df["is_poss_team"] = poss_df["team_id"] == poss_team_id

    ## compute target
    target = _compute_target(poss_df)

    ## sample the appropriate number of events
    time_events = poss_df["time_seconds"].unique().tolist()

    if len(time_events) < (max_examples + 2):
        time_events = time_events[2:]
    else:
        time_events = sample(time_events[2:], max_examples)

    ## create the list of tuples
    return L(_get_playing_seq(ti, poss_df, target) for ti in time_events)

In [None]:
poss_ex = _form_examples(game_id=303696, poss_id=19)
poss_ex[0]

(                                                       _id  period_id  \
 1132366  303696_0a675ea0-1fd4-4dd5-aa2d-e740554092db_fd...          1   
 1132367  303696_0a675ea0-1fd4-4dd5-aa2d-e740554092db_fd...          1   
 1132368  303696_0a675ea0-1fd4-4dd5-aa2d-e740554092db_fd...          1   
 1132369  303696_0a675ea0-1fd4-4dd5-aa2d-e740554092db_fd...          1   
 1132370  303696_0a675ea0-1fd4-4dd5-aa2d-e740554092db_fd...          1   
 
          time_seconds    start_x    start_y      end_x      end_y type_name  \
 1132366         508.0  13.941176  18.678481  13.941176  18.678481    tackle   
 1132367         508.0  13.941176  18.678481  31.852941   5.594937   dribble   
 1132368         512.0  31.852941   5.594937  36.882353  16.268354      pass   
 1132369         513.0  36.882353  16.268354  35.117647  16.096203   dribble   
 1132370         513.0  35.117647  16.096203  28.676471  17.387342      pass   
 
         bodypart_name  player_id           player_name  team_id  \
 113

We can see that every sequence is formed of 3 types (actually 2 as binary features can also be seen as categorical).

+ **continuous** features:
  + `seconds_since_poss`: *float* time in seconds since the start of the possession
  + `time_seconds`: *float* time in seconds since the start of the half
  + `start_x`, `start_y`, `end_x`, `end_y`: *float* start and end x,y location of an event.
+ **binary** features:
  + `period_id` *int* either first(1) or second (2) period of a game.
  + `is_home`: *bool* is the team in possession playing at home?
  + `under_pressure`: *bool* is the action peformed under pressure?
  + `counterpress`: *bool* is the action part if counter-press?
  + `is_poss_team`: *bool* is the player performing the action part of the team in possession?
+ **categorical** (non binary) features:
  + `type_name`: *str* type of action peformed
  + `bodypart_name`
  + `player_id`/`player_name`: *int*/*str*: player id and name
  + `team_id`/`team_name`: *int*/*str*: player's team id and name
  + `play_pattern_name`: *str* pattern of play


Now that we have the function to create the Training tuple (although we may need to pre-process it before training), we can loop over all the couples game/possession and save the results in a large list. Note that every game/sequence produces a number of examples.

The data is stored in 2 seperate dictionaries:


*   `sequence_data`: a `DataFrame` describing the playing sequence
*   `target_data` `str` whether the sequence ended with a `goal` or `no_goal`
Note that both dictionaries will use the key ids which is unique for every 
example.

In order to reduce computation time, we will only compute sequences we didn't 
compute before.

In [None]:
if os.path.isfile(os.path.join(data_path, "sequences.pkl")):
    sequence_data = pickle.load(open(os.path.join(data_path, "sequences.pkl"), "rb"))
    target_data = pickle.load(open(os.path.join(data_path, "targets.pkl"), "rb"))
else:
    sequence_data = {}
    target_data = {}

## get computed game_ids
computed_gm_ids = L(int(k.split("_")[0]) for k in sequence_data.keys()).unique()
game_poss = game_poss[~game_poss.game_id.isin(computed_gm_ids)]
print("Working on: " + str(game_poss.game_id.unique().shape[0]) + " games ...")

Working on: 282 games ...


In [None]:
for i, game in enumerate(progressbar.progressbar(list(game_poss.itertuples()))):
    poss_ex = _form_examples(game_id=game.game_id, poss_id=game.possession)
    for seq in poss_ex:
        _id = seq[0]["_id"].values[0]
        sequence_data[_id] = seq[0]
        target_data[_id] = seq[1]

    if (i % 100 == 0) & (i > 0):
        pickle.dump(sequence_data, open(os.path.join(data_path, "sequences.pkl"), "wb"))
        pickle.dump(target_data, open(os.path.join(data_path, "targets.pkl"), "wb"))

100% (51227 of 51227) |##################| Elapsed Time: 6:05:41 Time:  6:05:41
