<a href="https://colab.research.google.com/github/rosslogan702/expected_goals_model/blob/master/first_xg_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Building an xG (Expected Goals) Model**

The following notebook walks through building an expected goals model from scratch using a dataset provided by SciSports.   

The dataset is an artificial dataset however this is enough to walk through the process of attempting to build an xG model from scratch.

#### **What is xG?**

The best definition I can find of xG after a quick google search is the following from the official Bundesliga website which states that:

*   Expected goals (xG) is a predictive model used to asess every goal-scoring chance and give a prediction on the likelihood of scoring
*   An xG model computes for each chance the probability to score based on a number of features that we know about the chance e.g distance from goal, was the shot taken by a foot or a head
*  The higher the xG, the higher the probability of scoring. With 1 being the maximum value assigned to a shot and 0 being the minimum.

The predictive model that assigns a probability of scoring to each chance is what we are going to create in this notebook.

[Bundesliga definition](https://www.bundesliga.com/en/bundesliga/news/expected-goals-xg-model-what-is-it-and-why-is-it-useful-sportec-solutions-3177)

### **Download Dataset**

Download the initial dataset.

In [0]:
!wget "https://github.com/JanVanHaaren/how-to-expected-goals/raw/master/shots.parquet" -O "shots.parquet"

--2020-04-24 09:30:20--  https://github.com/JanVanHaaren/how-to-expected-goals/raw/master/shots.parquet
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JanVanHaaren/how-to-expected-goals/master/shots.parquet [following]
--2020-04-24 09:30:28--  https://raw.githubusercontent.com/JanVanHaaren/how-to-expected-goals/master/shots.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7793234 (7.4M) [application/octet-stream]
Saving to: ‘shots.parquet’


2020-04-24 09:30:29 (71.1 MB/s) - ‘shots.parquet’ saved [7793234/7793234]



### **Import Libraries**

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve
# Import SciPy function
from scipy.spatial import distance

### **View Dataset**

View the dataset downloaded and have a look at some of the features in the dataset.

In [0]:
shots_df = pd.read_parquet('shots.parquet')
print('Number of shots in dataset: {}'.format(shots_df.shape[0]))

Number of shots in dataset: 127643


In [0]:
shots_df.head(5)

Unnamed: 0,action_game_id,action_team_id,action_player_id,action_period,action_seconds,action_type_id,action_type_name,action_body_part_id,action_result,action_start_x,action_start_y,action_end_x,action_end_y,action1_game_id,action1_team_id,action1_player_id,action1_period,action1_seconds,action1_type_id,action1_type_name,action1_body_part_id,action1_result,action1_start_x,action1_start_y,action1_end_x,action1_end_y,action2_game_id,action2_team_id,action2_player_id,action2_period,action2_seconds,action2_type_id,action2_type_name,action2_body_part_id,action2_result,action2_start_x,action2_start_y,action2_end_x,action2_end_y
0,0,6,5703,1,188.23,11,shot,0,0,98.75,40.17,105.0,40.85,0.0,6.0,3726.0,1.0,187.75,1.0,cross,0.0,1.0,99.28,49.54,99.28,40.7,0.0,6.0,4070.0,1.0,185.62,0.0,pass,0.0,1.0,93.53,65.36,98.78,49.04
1,0,6,3726,1,609.31,11,shot,0,0,84.2,48.48,105.0,27.4,0.0,6.0,5797.0,1.0,607.57,0.0,pass,0.0,1.0,97.48,60.35,83.83,48.11,0.0,6.0,3726.0,1.0,606.29,0.0,pass,0.0,1.0,89.66,61.61,98.06,60.93
2,0,281,2690,1,969.28,11,shot,0,0,92.68,46.52,105.0,41.08,0.0,281.0,6886.0,1.0,967.0,0.0,pass,0.0,1.0,88.35,62.71,92.55,46.39,0.0,281.0,2690.0,1.0,963.64,0.0,pass,0.0,1.0,72.3,60.37,88.05,62.41
3,0,6,4070,1,1193.44,11,shot,0,0,94.4,46.14,104.9,37.3,0.0,6.0,3726.0,1.0,1191.37,0.0,pass,1.0,1.0,89.3,64.65,94.55,46.29,0.0,6.0,5748.0,1.0,1189.19,0.0,pass,0.0,1.0,73.04,15.18,88.79,64.14
4,0,6,6514,1,1359.36,11,shot,1,0,93.52,27.95,105.0,40.87,0.0,6.0,6514.0,1.0,1357.96,21.0,dribble,0.0,1.0,96.79,28.07,93.64,28.07,0.0,6.0,1015.0,1.0,1356.12,3.0,freekick_crossed,0.0,1.0,79.67,6.67,96.47,27.75


In [0]:
print("Features included in the dataset are: \n{}".format(shots_df.columns))

Features included in the dataset are: 
Index(['action_game_id', 'action_team_id', 'action_player_id', 'action_period',
       'action_seconds', 'action_type_id', 'action_type_name',
       'action_body_part_id', 'action_result', 'action_start_x',
       'action_start_y', 'action_end_x', 'action_end_y', 'action1_game_id',
       'action1_team_id', 'action1_player_id', 'action1_period',
       'action1_seconds', 'action1_type_id', 'action1_type_name',
       'action1_body_part_id', 'action1_result', 'action1_start_x',
       'action1_start_y', 'action1_end_x', 'action1_end_y', 'action2_game_id',
       'action2_team_id', 'action2_player_id', 'action2_period',
       'action2_seconds', 'action2_type_id', 'action2_type_name',
       'action2_body_part_id', 'action2_result', 'action2_start_x',
       'action2_start_y', 'action2_end_x', 'action2_end_y'],
      dtype='object')


There are a number of features in the dataset which are all explained below.   

The definitions of the features come from https://github.com/JanVanHaaren/how-to-expected-goals/blob/master/how-to-expected-goals.ipynb



### **Feature Definitions**

For each shot, the dataset contains the following information for the shot as well as the two actions immediately preceding the shot:



*   game_id: a unique identifier of the game;
*   team_id: a unique identifier of the team who performed the action;
*   player_id: a unique identifier of the player who performed the action;
*   period: 1 for the first half and 2 for the second half;
*   seconds: the time elapsed in seconds since the start of the half;
*   type_id: the identifier for the type of action;
*   type_name: the name for the type of action;
*   body_part_id: 0 for foot, 1 for head, 2 for other body part;
*   result: the result of the action: 0 for failure, 1 for success;
*   start_x: the x coordinate for the location where the action started, ranges from 0 to 105;
*   start_y: the y coordinate for the location where the action started, ranges from 0 to 68;
*   end_x: the x coordinate for the location where the action ended, ranges from 0 to 105;
*   end_y: the y coordinate for the location where the action ended, ranges from 0 to 68.

The prefix action refers to the shot, whereas the prefixes action1 and action2 refer to the last and one-but-last action prior to the shot.

The mapping between the type_id and type_name values is as follows:

*   0: pass
*   1: cross
*   2: throw in
*   3: freekick crossed
*   4: freekick short
*   5: corner crossed
*   6: corner short
*   7: take on
*   8: foul
*   9: tackle
*  10: interception
*  11: shot
*  12: shot penalty
*  13: shot freekick
*  14: keeper save
*  18: clearance
*  21: dribble
*  22: goalkick

#### **Create a Baseline Model**

Firstly, we are going to create a very basic and standard baseline model for predicting xG values for shots.  

This model is going to be extremely basic and will contain a limited feature set without doing anything complicated.  

The purpose of creating such a model is that it can serve as a barometer for more complicated models further down the line. That is, more complicated models are only worth if they outperform the simple model.

The very first model in fact that we will create will be "dumb" in a sense as we will just create a python method that always returns 1. That is, it will always predict that the shot taken was scored.

In [0]:
# Sanity check to ensure that we are dealing with all shots
shots_df.action_type_name.unique()

array(['shot'], dtype=object)

In [0]:
# Of the shots, how many ended up in a success (indicated by 1, ie a goal) and how many ended up in a failure state (indicated by 0)
shots_df.action_result.value_counts()

0    114102
1     13541
Name: action_result, dtype: int64

We can see from the above query that there are more shots that result in failure stats than success states in the dataset. This is OK but we will need to be careful when splitting the data into a training & testing set.

#####**Train/Test Split**

Split the data into a training & test set to enable us to evaluate the model performance fairly.  

In [0]:
shot_features_df = shots_df.drop(columns=['action_result'])
shot_features_df.shape

(127643, 38)

In [0]:
# X_ variables are the features and y_ are the results
X_train, X_test, y_train, y_test = train_test_split(shot_features_df, 
                                                    shots_df['action_result'],
                                                    train_size=0.8, 
                                                    stratify=shots_df['action_result'],
                                                    random_state=42)

In [0]:
# Let's check how many successes (shots that led to a goal) & failures (shots that did not lead to a goal) are in the dataset
print('y_train: {}'.format(y_train.value_counts()))
# Let's check the same for y_test
print('y_test: {}'.format(y_test.value_counts()))

y_train: 0    91281
1    10833
Name: action_result, dtype: int64
y_test: 0    22821
1     2708
Name: action_result, dtype: int64


#### **Baseline Predictions**

As we now have our first train/test split, we are now going to calculate our first set of predictions.  

The very first model that we are going to use is going to be a dumb model that just always returns 1 (success or a goal) without taking into account any features.  

This as we mentioned above, is to provide a baseline for more complicated models in the future to be compared against.



In [0]:
class AlwaysGoalClassifier(BaseEstimator):
  def fit(self, X, y=None):
    pass

  def predict(self, X):
    return np.ones((len(X), 1), dtype=int)

In [0]:
always_goal_clf = AlwaysGoalClassifier()

#### **Evaluation Metrics**

We will now evaluate the performance of our 'AlwaysGoalClassifier' which as the name suggests always predicts a goal.

##### **Accuracy**

In [0]:
cross_val_score(always_goal_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.10576415, 0.10679241, 0.10570539])

##### **Confusion Matrix**

In [0]:
 y_train_pred = cross_val_predict(always_goal_clf, X_train, y_train, cv=3)

In [0]:
confusion_matrix(y_train, y_train_pred)

array([[    0, 91281],
       [    0, 10833]])

Reading the confusion matrix, each row represents a label. In our case this means goal or no goal. Each column represents a prediction, again goal or no goal.  

Reading the sklearn documentation when we don't provide labels it finds the labels that appear in the y_train/y_train preds and uses them in sorted order. In our case this means the no goal (0) will appear first in the rows and on the columns.

Looking at the numbers from the confusion matrix this means that we have



*   When the actual label was no goal, we predicted no goal 0 times
*   When the actual label was no goal, we predicted a goal 91281 times
*   When the actual label was goal, we predicted no goal 0 times
*   When the actual label was goal, we predicted goal 10833 times

These numbers tie up with our "dumb" classifier which always predicts a goal. As you can see we never predict the case where this not a goal and always predict goal regardless.



##### **Precision**

Precision is the accuracy of positive predictions.  

$precision = \frac{TP}{TP + FP}$

In [0]:
precision_score(y_train, y_train_pred)

0.10608731417827134

##### **Recall**

Recall (also called sensitivity or the true positive rate) is the ration of positive instances that are correctly detected by the classifier.

$recall = \frac{TP}{TP + FN}$

In [0]:
recall_score(y_train, y_train_pred)

1.0

##### **F1-Score**

The F1 score is the harmonic mean of precision and recall. 

$F_1 = \frac{TP}{TP + \frac{FN + FP}{2}}$

The classifier will only get a high F1 score if both the recall and precision are correct.

In [0]:
f1_score(y_train, y_train_pred)

0.19182448405004116

Precision and Recall is a trade-off that needs to be made dependent on the problem that we are solving.

There are some additional ways that we can evaluate the performance of a classifer (ROC Curve) however with a simple model we will put that on hold for the moment until we train a better model.

Currently guessing that a goal will always be scored we end up with fairly low scores across precision & F1-Score. A high score on recall but that is because we are always guessing that a goal will be scored without any real sensible logic.

#### **Model V1**

Now that we have assessed a very "dumb" classifier which just guesses that every shot ended up in a goal each time we can move onto to trying to create something that is a little more intelligent.

We are going to borrow some feature engineering from this notebook to get us started to create our first V1 model.

https://github.com/JanVanHaaren/how-to-expected-goals/blob/master/how-to-expected-goals.ipynb

#### **Normalize Location Features**

The first step of feature engineering that we are going to borrow is to normalize the location features of the dataset.

In [0]:
for action in ['action', 'action1', 'action2']:
  for side in ['start', 'end']:
    # Normalize the X location
    key_x = '{}_{}_x'.format(action, side)
    shots_df[key_x] = shots_df[key_x] / 105
    
    # Normalize the Y location
    key_y = '{}_{}_y'.format(action, side)
    shots_df[key_y] = shots_df[key_y] / 68

#### **Compute Additional Features**

In [0]:
# Normalized location for the center of the opposing goal
goal = (1, 0.5)

In [0]:
# Compute distance to goal for each action's start location
for action in ['action', 'action1', 'action2']:
  key_start_x = '{action}_start_x'.format(action=action)
  key_start_y = '{action}_start_y'.format(action=action)
  key_start_distance = '{action}_start_distance'.format(action=action)
  
  shots_df[key_start_distance] = shots_df.apply(lambda s: distance.euclidean((s[key_start_x], s[key_start_y]), goal), axis=1)

In [0]:
# Determine body part used for each action
for action in ['action', 'action1', 'action2']:
  key_body_part_id = '{action}_body_part_id'.format(action=action)
  
  key_is_foot = '{action}_is_foot'.format(action=action)
  key_is_head = '{action}_is_head'.format(action=action)
  key_is_other = '{action}_is_other'.format(action=action)
  
  shots_df[key_is_foot] = shots_df[key_body_part_id] == 0
  shots_df[key_is_head] = shots_df[key_body_part_id] == 1
  shots_df[key_is_other] = shots_df[key_body_part_id] == 2

#### **Retrieve Features**

We are now going to select a number of initial features to use to train a model on to predict whether or not a shot resulted in a goal.

In [0]:
# Features
columns_features = [
    'action_start_x',
    'action_start_y',
    'action_is_foot',
    'action_is_head',
    'action_start_distance',
    'action1_start_distance',
    'action2_start_distance'
]

# Label: 1 if a goal, 0 otherwise
column_target = 'action_result'

In [0]:
X = shots_df[columns_features]
y = shots_df[column_target]

#####**Train/Test Split**

Split the data into a training & test set to enable us to evaluate the model performance fairly.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size=0.8, 
                                                    stratify=y,
                                                    random_state=42)