<a id='top'></a>

# 3) CatBoost Gradient Boosted Chance Quality Model from Shots Data
##### Notebook to further improve the performance of the Chance Quality Model (CQM) model created in the previous notebook, from a provided sample of just under 11,000 shots, through the application of the CatBoost Gradient Boosting algorithm.

### By [Edd Webster](https://www.twitter.com/eddwebster)
Notebook first written: 02/05/2021<br>
Notebook last updated: 02/05/2021

![title](../../img/expected_goals_visual.png)

Photo credit to David Sumpter ([@Soccermatics](https://twitter.com/Soccermatics?)).

---

## <a id='introduction'>Introduction</a>
This notebook is a short walk-through of how to create an Expected Goals (xG) model using a just under 11,00 shots, in [Python](https://www.python.org/), using [pandas](http://pandas.pydata.org/) DataFrames, [scikit-learn](https://scikit-learn.org/stable/) and [CatBoost](https://catboost.ai/) for Machine Learning, [matplotlib](https://matplotlib.org/contents.html?v=20200411155018) visualisations, and [SHAP](https://shap.readthedocs.io/en/latest/) for feature importance.

For more information about this notebook and the author, I am available through all the following channels:
*    [eddwebster.com](https://www.eddwebster.com/);
*    edd.j.webster@gmail.com;
*    [@eddwebster](https://www.twitter.com/eddwebster);
*    [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
*    [github/eddwebster](https://github.com/eddwebster/); and
*    [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster).

![title](../../img/edd_webster/fifa21eddwebsterbanner.png)

The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/mcfc_submission/) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/mcfc_submission/blob/main/notebooks/chance_quality_modelling/Creating%20a%20Chance%20Quality%20Model%20from%20Shots%20Data.ipynb).

___

## <a id='notebook_contents'>Notebook Contents</a>
1.      [Notebook Dependencies](#section1)<br>
2.      [Project Brief](#section2)<br>
3.      [Introduction to CatBoost](#section3)<br>
4.      [Data Sources](#section4)<br>
        1.    [Data Dictionary](#section4.1)<br>
        2.    [Creating the DataFrame](#section4.2)<br>
        3.    [Initial Data Handling](#section4.3)<br>    
5.      [Initial Modeling](#section5)<br>
6.      [k-fold Cross Validation using CatBoost](#section6)<br>
7.      [Feature Importance with CatBoost](#section7)<br>
8.      [Visualisation of CatBoost Tree](#section8)<br>
9.      [Hyperparameter Optimisation](#section9)<br>
10.     [Final Optimised CatBoost Model](#section10)<br>
11.     [Performance comparison of CatBoost with XGBoost and Logistic Regression](#section11)<br>
12.     [Assessment of the Performance of the Teams in Game 2 of the Metrica Sports Shot Data](#section12)<br>
13.     [Summary](#section13)<br>
14.     [Next Steps](#section14)<br>
15.     [References and Further Reading](#section15)<br>

---

## <a id='#section1'>1. Notebook Dependencies</a>
This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;
*    [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations; and
*    [`scikit-learn`](https://scikit-learn.org/stable/index.html) for Machine Learning.

All packages used for this notebook except can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [None]:
%load_ext autoreload
%autoreload 2

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
import math
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd
import pandas_profiling as pp
import os
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os
from os.path import basename

# Working with JSON
import json
from pandas.io.json import json_normalize

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import patches
from matplotlib.patches import Arc
from matplotlib.colors import ListedColormap
import plotly
import plotly.graph_objects as go
import ruamel.yaml
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno
#from xgboost import plot_tree
#import graphviz

# Downloading data sources
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
from zipfile import ZipFile, is_zipfile
from tqdm import tqdm

# Football libraries
#import FCPython
#from FCPython import createPitch
#import matplotsoccer

# Machine Learning
import scipy as sp
from scipy.spatial import distance
import sklearn
from sklearn.ensemble import RandomForestClassifier, IsolationForest
#from sklearn.inspection import permutation_importance
import sklearn.metrics as sk_metrics
from sklearn.metrics import log_loss, brier_score_loss, roc_auc_score , roc_curve, average_precision_score, accuracy_score
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve, plot_calibration_curve
import pickle
import shap
#import lightgbm as lgb
#import xgboost as xgb
#from xgboost import XGBClassifier, cv
import catboost
from catboost import CatBoostClassifier, Pool, cv, MetricVisualizer

# Display in Jupyter
from IPython.display import Image, Video, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

In [None]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))
print('Plotly: {}'.format(plotly.__version__))
print('CatBoost: {}'.format(catboost.__version__))

### Defined Variables

In [None]:
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

### Defined Filepaths

In [None]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..')
data_dir = os.path.join(base_dir, 'data')
data_dir_shots = os.path.join(base_dir, 'data', 'shots')
data_dir_metrica = os.path.join(base_dir, 'data', 'metrica-sports')
models_dir = os.path.join(base_dir, 'models')
models_dir_shots = os.path.join(base_dir, 'models', 'shots')
scripts_dir = os.path.join(base_dir, 'scripts')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
fig_shots_dir = os.path.join(base_dir, 'img', 'fig', 'shots')
video_dir = os.path.join(base_dir, 'video')

### Custom Functions

In [None]:
# Simple timer function to time algorithms
def timer(start_time=None):
    if not start_time:
        start_time = datetime.datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

### Notebook Settings

In [None]:
# Display all columns of pandas DataFrames
pd.set_option('display.max_columns', None)

---

## <a id='#section2'>2. Project Brief</a>

### <a id='#section2.1'>2.1. About this notebook</a>
In most real-world machine-learning tasks, a well-calibrated gradient booster such as [CatBoost](https://catboost.ai/), [LightGBM](https://lightgbm.readthedocs.io/en/latest/) or [XGBoost](https://xgboost.readthedocs.io/en/latest/) outperforms Logistic Regression when using the same set of features.

Often, smartly, hand-crafted features can bring a Logistic Regression to an (almost) similiar performance as when using Gradient Boosting algorithms. This is especially true when dealing with rather simple problems with a relatively small number of features.

Using the engineered dataset in the first notebook, the following sections build upon this modeling, using gradient boosting algorithms, to try and further reduce the Log Loss of the model and improve upon the potential predictions made upon the Metrica Sports data, to analyse which team was more deserving to with the game based solely on the chances created.

This notebook builds upon the work conducted in the first Chance Quality notebook, created using a trained Logistic Regression model, and to further improve the performance using [CatBoost](https://catboost.ai/). The model uses the engineered dataset of 11,000 shots dervied in the previous notebook, using [pandas](http://pandas.pydata.org/) DataFrames for data manipulation, [matplotlib](https://matplotlib.org/contents.html?v=20200411155018) for data visualisation, and [SHAP](https://shap.readthedocs.io/en/latest/) for feature importance.

**Notebook Conventions**:<br>
*    Variables that refer a `DataFrame` object are prefixed with `df_`.
*    Variables that refer to a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`.

### <a id='#section2.2'>2.2. Challenge</a>
Defined in the previous notebook but included here as a reminders of the big picture.

<b>Step 1:</b>
We have attached to this email a sample of just under 11,000 shots (ShotData.csv, and the associated description in ShotData.txt). We would like you to use this data to build a chance quality model that calculates the probability of a shot resulting in a goal (i.e. P[goal|shot,situation]) using whichever situational variables in the data you think are informative. We ask that you provide a description of the method that you chose, including any metrics and plots that you have used to understand and assess the performance of your model. This description may take the form of a short written report (no more than one page of text plus additional room for figures & tables) or a slide pack (PowerPoint, Google slides, etc; no more than a total of 10 slides).

<b>Step 2:</b>
In the second step we ask you to work with the tracking data for a single game to analyse the shooting opportunities that each team created. In this github repository* you will find the tracking data for two matches, along with a description of the data. Using the data for sample game 2 in the repository, identify the shots in this game and write a short report describing the major chances that each team created during the game, making use of the chance quality model that you developed in Step 1 and any other information that you think is relevant. Based solely on the quality of chances that each team created, which team do you think deserved to win the game? Your report may take the form of a document (1 page plus additional room for figures & tables) or presentation (no more than 10 slides).

This notebook is concerned with <b>Step 1</b> - building a Chance Quality Model, using [CatBoost](https://catboost.ai/).

### <a id='#section2.3'>2.3. What is CatBoost?</a>
CatBoost is ...

This notebook aims to hopefully serve as an explanatory guide of [CatBoost](https://catboost.ai/) for beginners and those with previous experience alike and goes into much more detail about this algorithm in the follow sections

### <a id='#section2.4'>2.4. Modelling Approach</a>
This model will build upon the work done in the first notebook in the series, that builds a Chance Quality Model from Logistic Regression and looks to improve on the results using the popular XGBoost algorithm. The approach taken in this notebook can be defined as the following:

*    <b>Introduction to CatBoost</b>: introduction to the concept of XGBoost ([section 3](#section3));
*    <b>Data Sources</b>:  ([section 4](#section4));
*    <b>Initial Modeling</b>: First model created as a baseline for which iterations of improvement are based. ([section 5](#section5));
*    <b>k-fold Cross Validation using CatBoost</b>: ([section 8](#section8));
*    <b>Feature Importance with CatBoost</b>: ([section 9](#section9));
*    <b>Visualisation of CatBoost Tree</b>: ([section 10](#section10));
*    <b>Hyperparameter Optimisation</b>: ([section 11](#section11));
*    <b>Final Optimised CatBoost Model</b>: ([section 12](#section12));
*    <b>Performance Comparison of CatBoost with XGBoost and Logistic Regression</b>: ([section 13](#section13));
*    <b>Assessment of the Performance of the Teams in Game 2 of the Metrica Sports Shot Data</b>: ([section 14](#section14));
*    <b>Summary</b>: ([section 15](#section15));
*    <b>Next Steps</b>: ([section 16](#section16)); and
*    <b>References and Further Reading</b>: ([section 17](#section17)).

---

## <a id='#section3'>3. Introduction to CatBoost</a>
*    
*    
*    

---

## <a id='#section4'>4. Data Sources</a>
The following cells read in the engineered Shots data as a CSV file, created in the first Chance Quality Model notebook [[link](https://github.com/eddwebster/mcfc_submission/blob/main/notebooks/chance_quality_modelling/Creating%20a%20Chance%20Quality%20Model%20from%20Shots%20Data.ipynb)]. The original dataset of just under 11,000 shots was provided by [Laurie Shaw](https://twitter.com/EightyFivePoint) from [City Football Group](https://www.cityfootballgroup.com/) (see docs [[link](https://github.com/eddwebster/mcfc_submission/blob/main/documentation/shots/ShotData.txt)]).

### <a id='#section4.1'>4.1. Data Dictionary</a>
The following information is as per the definition in the `ShotData.txt` documentation, provided with the data [[link](https://github.com/eddwebster/mcfc_submission/blob/main/documentation/shots/ShotData.txt)].

The engineered shots DataFrame the following features:

| Feature                           | Original/Engineered?     | Variables Type     | Data Type    | Description    |
|-----------------------------------|--------------------------|--------------------|--------------|----------------------------------------------------|
| `match_minute`                    | original                 | continuous         | int64        | Minute of the match in which the shot was taken     |
| `match_second`                    | original                 | continuous         | int64        | Second of match_minute in which the shot was taken     |
| `position_x`                      | original                 | continuous         | float64      | Position of the shot on the pitch in meters (x-coordinate)     |
| `position_y`                      | original                 | continuous         | float64      | Position of the shot on the pitch in meters (y-coordinate)     |
| `play_type`                       | original                 | categorical        | object       | Game situation in which the shot was taken (open play, penalty, direct free kick, direct from a corner)     |
| `BodyPart`                        | original                 | categorical        | object       | Body part with which shot was taken (left foot, right foot, head, other)                        |
| `Number_Intervening_Opponents`    | original                 | discrete           | int64        | The number of opposing players that were obscuring the goal at the instant of the shot (from the perspective of the shot-taker)     |
| `Number_Intervening_Teammates`    | original                 | discrete           | int64        | The number of teammates that are obscuring the goal at the instant of the shot (from the perspective of the shot-taker)     |
| `Interference_on_Shooter`         | original                 | categorical        | object       | The degree of direct interference exerted on the shot-taker from defenders (Low - no or minimal interference, Medium - a single defender was in close proximity to the shot-taker; High - multiple defenders in close proximity and interfering with the shot).     |
| `outcome`                         | original                 | discrete           | object       | The outcome of the shot (blocked, missed, goal frame (post or bar), saved, goal or own goal).     |
| `position_xM`                     | engineered               | discrete           | float64      | Converted position of the shot along the x-axis, derived from the `position_x` feature for x(-53, +53) dimensions.     |
| `position_yM`                     | engineered               | discrete           | float64      | Converted position of the shot along the y-axis, derived from the `position_y` feature for y(-34, +34) dimensions.     |
| `isGoal`                          | engineered               | discrete           | object       | Indicates whether the resulting shot was a goal, or not. Derived from the `outcome` feature. Used as the target variable.     |
| `distance_to_goalM`               | engineered               | continuous         | float64      | Distance in which the shot is from the goal, in meters. Detemined from the `position_xM` and `position_yM` features.    |
| `distance_to_centerM`             | engineered               | continuous         | float64      | Distance in which the shot is from the center of the pitch, in meters. Detemined from the `position_yM` feature.     |
| `angle`                           | engineered               | discrete           | float64      | Angle in which the taken shot is to the goal, in degrees. Detemined from the `position_xM` and `position_yM` features.     |
| `isFoot`                          | engineered               | discrete           | int64        | Indicates whether the resulting shot was taken with the foot - yes (1) or no (0).     |
| `isHead`                          | engineered               | discrete           | int64        | Indicates whether the resulting shot was taken with the head - yes (1) or no (0).     |
| `header_distance_to_goalM`        | engineered               | continuous         | float64      | Distance in which the headed shot is from the goal, in meters. Detemined from the `position_xM` and `position_yM` feature.     |
| `High`                            | engineered               | discrete           | int64        | One-hot/dummy encoding of the `Interference_on_Shooter` feature. Indicates whether the shot experiences a high level of intereference (multiple defenders in close proximity and interfering with the shot) - yes (1) or no (0).      |
| `Medium`                             | engineered               | discrete           | int64        | One-hot/dummy encoding of the `Interference_on_Shooter` feature. Indicates whether the shot experiences a medium level of intereference (a single defender was in close proximity to the shot-taker) - yes (1) or no (0).      |
| `Low`                          | engineered               | discrete           | int64        | One-hot/dummy encoding of the `Interference_on_Shooter` feature. Indicates whether the shot experiences a low level of intereference (no or minimal interference) - yes (1) or no (0).      |

The dataset also contains the following columns, that are only used for data visualisation purposes only and not considered for the model (some of these may be deleted from the dataset later on after this notebook is further tidied): `position_yM`, `position_xM_r`, `position_yM_r`, `position_xM_std`, `position_yM_std`, `position_xM_std_r`, `position_yM_std_r`, `Interference_on_Shooter_Code`, and `BodyPartCode`.

Each row consists of a single shot event.

The following data engineering of the raw shots dataset took place in the previous notebook:
*    Dataset of 10,925 shots filtered for Open Play shots only, leaving 10,269 Open-Play shots before data engineering (656 shots removed, corresponding to 156 goals).
*    Filtered out shots where `BodyPart` is 'Other' to prevent confusion when using this trained model with the Metrica Sports data where the body part with which the shot is taken is always known. 58 shots removed from the Open Play dataset, corresponding to 12 goals.
*    Filtered out shots where `Interference_on_Shooter` is 'Unknown' to prevent confusion when using this trained model with the Metrica Sports data where the Interference on the shooter is always known as it is calculated. 25 shots removed from the Open Play dataset, corresponding to 9 goals.
*    As part of the outlier removal, replaced the values of shots taken from goal to no goal, where the `distance_to_goalM` greater than 35m (38.27 yards) or where the `distance_to_goalM` is greater than 20m (21.87 yards) and where the where the `angle` to the goal is also greater than 35 degrees. 33 shots were replaced in the Open Play dataset (2.71% of all goals).

This leaves a remaining 10,165 shots for analysis, corresponding to 1,142 goals (11.2%).

### <a id='#section4.2'>4.2. Read in CSV as pandas DataFrame</a>
The following cell read the the `CSV` file as a pandas `DataFrame`.

In [None]:
# Read data directory
print(glob.glob(os.path.join(data_dir_shots, 'engineered/*')))

In [None]:
# Read in engineered Shots CSVs as a pandas DataFrames
df_shots = pd.read_csv(os.path.join(data_dir_shots, 'engineered', 'complete_shots_engineered.csv'))
#df_shots_train = pd.read_csv(os.path.join(data_dir_shots, 'engineered', 'train_shots_engineered.csv'))
#df_shots_test = pd.read_csv(os.path.join(data_dir_shots, 'engineered', 'test_shots_engineered.csv'))

### <a id='#section4.3'>4.3. Initial Data Handling</a>

#### <a id='#section4.3.1'>4.3.1. Summary Report</a>
Initial step of the data handling and Exploratory Data Analysis (EDA) is to create a quick summary report of the dataset using [pandas Profiling Report](https://github.com/pandas-profiling/pandas-profiling).

In [None]:
# Summary of the data using pandas Profiling Report
pp.ProfileReport(df_shots)

#### <a id='#sectio4.3.2'>4.3.2. Further Inspection</a>
The following commands go into more bespoke summary of the dataset. Some of the commands include content covered in the [pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) summary above, but using the standard [pandas](https://pandas.pydata.org/) functions and methods that most peoplem will be more familiar with.

First check the quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.

In [None]:
# Display the first 5 rows of the engineered DataFrame, df_shots
df_shots.head()

In [None]:
# Display the last 5 rows of the engineered DataFrame, df_shots
df_shots.tail()

In [None]:
# Print the shape of the engineered DataFrame, df_shots
print(df_shots.shape)

In [None]:
# Print the column names of the engineered DataFrame, df_shots
print(df_shots.columns)

The dataset has thirty features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1).

In [None]:
# Data types of the features of the engineered DataFrame, df_shots
df_shots.dtypes

Full details of these attributes and their data types can be found in the [Data Dictionary](#section6.1).

In [None]:
# Print statements about the dataset

## Assign variables
count_shots = len(df_shots)
count_goals = len(df_shots[(df_shots['outcome'] == 'Goal') | (df_shots['outcome'] == 'owngoal')])
cols = list(df_shots)
vals_play_type = df_shots['play_type'].unique()
vals_body_part = df_shots['BodyPart'].unique()
vals_interference = df_shots['Interference_on_Shooter'].unique()
vals_outcome = df_shots['outcome'].unique()

## Print statements
print(f'The engineered shots DataFrame contains {count_shots:,} shots and {count_goals:,} goals ({round(100*count_goals/count_shots,1)}%).\n')
print(f"The dataset contains the following columns: {cols}\n")
print(f"Unique values in the 'play_type' column: {vals_play_type}\n")    
print(f"Unique values in the 'BodyPart' column: {vals_body_part}\n")    
print(f"Unique values in the 'Interference_on_Shooter' column: {vals_interference}\n")    
print(f"Unique values in the 'outcome' column: {vals_outcome}\n")    

In [None]:
# Shot outcomes types and their frequency - 'Goal' count inaccurate as some outlier goals where changed using the 'isGoal' attribute (see following command)
df_shots.groupby(['outcome']).outcome.count()

'Goal' count inaccurate as some outlier goals where changed via the `isGoal` attribute from 1 to 0 (see following command):

In [None]:
# Shot outcomes types and their frequency
df_shots.groupby(['isGoal']).outcome.count()

This is the accurate Goal count - 1,142 goals.

In [None]:
# Shot interference types and their frequency
df_shots.groupby(['Interference_on_Shooter']).outcome.count()

In [None]:
# Info for the raw DataFrame, df_shots
df_shots.info()

In [None]:
# Description of the DataFrame, df_shots, showing some summary statistics for each numerical column in the DataFrame
df_shots.describe()

In [None]:
# Plot visualisation of the missing values for each feature of the engineered DataFrame, df_shots
msno.matrix(df_shots, figsize = (30, 7))

In [None]:
# Counts of missing values
null_value_stats = df_shots.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

The visualisation shows us very quickly that there are no missing values as they were treated in the first notebook, which is good as unlike [XGBoost](https://xgboost.readthedocs.io/en/latest/), [CatBoost](https://catboost.ai/) does not handle NULL values.

If the dataset did contain NULL values, these would be required to be filled, which can be done easily using the following code::

In [None]:
# Fill NULL values with placeholders - 999
#df_shots.fillna(-999, inplace=True)
#null_value_stats[null_value_stats != 0]

---

## <a id='#section5'>5. Data Engineering</a>

### <a id='#section5.1'>5.1. Create `head_foot` attribute

In [None]:
# Define dictionary of BodyPart codes
dict_head_foot = {'Left': 'Foot',
                  'Right': 'Foot',
                  'Head': 'Head',
                 }

# Map BodyPartCode to DataFrame
df_shots['head_foot'] = df_shots['BodyPart'].map(dict_head_foot)

---

## <a id='#section6'>6. Initial Modeling</a>
First model created as a baseline for which interations of improvement are based.

### <a id='#section6.1'>6.1. Declare Feature Vector and Target Variable</a>

In [None]:
# Define features, as determined in the final Logistic Regression notebook
features = ['distance_to_goalM',
            'angle',
            'Number_Intervening_Opponents',
            'Number_Intervening_Teammates',
            'head_foot',
            'Interference_on_Shooter',
            'header_distance_to_goalM'
           ]

"""
# Alternative feature list: features prepared in the previous notebook that one-hot encoded categorical features, however, CatBoost can deal with categorical features
features = ['distance_to_goalM',
            'angle',
            'Number_Intervening_Opponents',
            'Number_Intervening_Teammates',
            'isFoot',
            'High',
            'Low',
            'header_distance_to_goalM'
           ]

"""

X = df_shots[features]
y = df_shots['isGoal']

Now, let's take a look at feature vector (`X`) and target variable (`y`):

In [None]:
# Display feature vector
X.head()

In [None]:
# Display target variable
y.head()

### <a id='#section6.2'>6.2. State the Categorical Features
Requirement for CatBoost.

In [None]:
# Define categorical variables
cat_features = ['Number_Intervening_Opponents',
                'Number_Intervening_Teammates',
                'head_foot',
                'Interference_on_Shooter',
               ]

# Alternative method - all variables that aren't floats
#cat_features = np.where(X_train[features].dtypes != np.float)[0]

# See categories
cat_features

### <a id='#section6.3'>6.3. Look at the Label Balance of the Dataset

In [None]:
print(f'Labels: {set(y)}')
print(f'Zero count = {len(y) - sum(y)}, One count = {sum(y)}')

The dataset is unbalanced. This observation is considered as part of the decision making for the model.

### <a id='#section6.4'>6.4. Correlation Matrix

In [None]:
## Assign data to be used
df = X.corr()
top_corr_features = df.index

## Set background colour
background = 'aliceblue'

## Create figure 
fig, ax = plt.subplots(figsize=(12, 8))
fig.set_facecolor(background)

## Seaborn heat map
sns.heatmap(df , 
            xticklabels=df.columns,
            yticklabels=df.columns,
            annot=True,
            cmap='RdYlGn',
           #vmin=0,
           #vmax=0.5
           )

### <a id='#section6.5'>6.5. Store Dataset as a Pool Class
There are several ways of passing dataset to training - using X, y or using Pool class. Pool class is the class for storing the dataset (see the next few commands).

You can use Pool class if the dataset has more than just X and y (for example, it has sample weights or groups) or if the dataset is large and it takes long time to read it into python.

In [None]:
pool = Pool(data=X, label=y, cat_features=cat_features)

### <a id='#section6.6'>6.6. Split Data into Separate Training and Test Set 

In [None]:
# Split X and y into Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Define Train pool
train_pool = Pool(data=X_train, 
                  label=y_train, 
                  cat_features=cat_features
                 )

# Define Test pool
test_pool = Pool(data=X_test, 
                 label=y_test, 
                 cat_features=cat_features
                )

### <a id='#section6.7'>6.7. Train the Baseline CatBoost
When selection the objective function, for binary classification there are two options:
1.    `Logloss` for binary target.
2.    `CrossEntropy` for probabilities in target.    

All available evaluations metrics: [eval_metric](https://catboost.ai/docs/concepts/python-reference_utils_eval_metric.html)

In [None]:
# Declare initial parameters - will later be tuned
params = {'iterations': 1_000,
          'learning_rate': 0.5,                   # default learning_rate for binary classification
          'cat_features': cat_features,           # unique to CatBoost, parameter to state the categorical features
          'depth': 2,
         #'eval_metric': 'Logloss',               # 'AUC',
         #'custom_metric': ['Logloss', 'AUC'],
          'custom_loss': ['Logloss', 'AUC'],
          'verbose': 10,    # True
         }

# Display parameters
params

In [None]:
# Instantiate the classifier 
catboost_clf = CatBoostClassifier(**params)

# Fit the classifier to the training data
catboost_clf.fit(X_train,
                 y_train,
                 eval_set=(X_test, y_test),
                 use_best_model=True,
                 plot=True
                )

# Print statements
print(f'Model is fitted: {catboost_clf.is_fitted()}')
print(f'Model params:\n{catboost_clf.get_params()}')
print(f'Tree count: {catboost_clf.tree_count_}')

In [None]:
# Other way to visualise
#MetricVisualizer(['catboost_clf']).start()

### <a id='#section5.7'>5.7. Predictions on Test Data and Log Loss

In [None]:
# Predict the probabilities of Test set
y_pred = catboost_clf.predict_proba(X_test)

print(f'Log Loss of the initial CatBoost model: {sk_metrics.log_loss(y_test, y_pred):.5f}')
#print(f'AUC of the initial CatBoost model: {sk_metrics.roc_auc_score(y_test, y_pred)*100:.2f}%')

In [None]:
# Convert the Probability Predictions array to a pandas DataFrame
df_predictions_initial = pd.DataFrame(y_pred, columns = ['prob_no_goal', 'prob_goal'])

# Join the Probability Predictions back onto the original Test DataFrame
df_test_predictions_initial = pd.merge(X_test, df_predictions_initial, left_index=True, right_index=True)

# Display DataFrame
df_test_predictions_initial.head()

### <a id='#section5.8'>5.8. Feature Importance
The features will be analysed in more detail using the [SHAP](https://shap.readthedocs.io/en/latest/) later in the notebook.

In [None]:
feat_import = [t for t in zip(features, catboost_clf.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df

### <a id='#section5.9'>5.9. Visualise Initial CatBoost Test Predictions

In [None]:
# TO DO

### <a id='#section5.10'>5.10. Evaluation
We have the starting [CatBoost](https://catboost.ai/) model, which is performing worse than the final Logistic Regression model, with a Log Loss on test set of 0.29954. To beat the performance of the existing Logistic Regression model, we're looking to reduce this Log Loss to ~0.280. Improvement of the Log Loss is done through k-fold Cross Validation and Hyperparameter Optimisation, conducted in the following sections.

## <a id='#section6'>6. Cross-validation
Documentation [[link](https://catboost.ai/docs/concepts/python-reference_cv.html)].

In [None]:
# Declare initial parameters - will later be tuned
params = {'iterations': 1_000,
          'learning_rate': 0.5,
          'cat_features': cat_features,
          'depth': 2,
          'loss_function': 'Logloss',
          'custom_loss': 'AUC',
          'verbose': 10    # True
         }

# Display parameters
params

In [None]:
cv_data = cv(params=params,
             pool=train_pool,
             fold_count=10,
             shuffle=True,
             stratified=True,    # stratified by default, important for an unbalanced dataset
             partition_random_seed=42,
             plot=True,
             verbose=False
            )

In [None]:
cv_data.head(10)

In [None]:
best_value = np.min(cv_data['test-Logloss-mean'])
standard_dev = cv_data['test-Logloss-std'][best_iter]
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print(f'Best validation Log Loss score, not stratified: {best_value:.4f}±{standard_dev:.4f} on step {best_iter}.')

## <a id='#section7'>7. Feature Importance
The features will be analysed in more detail using the [SHAP](https://shap.readthedocs.io/en/latest/) later in the notebook.

In [None]:
feat_import = [t for t in zip(features, catboost_clf.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df

*    We can see that the feature `angle` has been given the highest importance score among all the features, closely followed by `distance_to_goalM`.
*    Based upon this importance score, we can select the features with highest importance score and discard the redundant ones.
*    Thus CatBoost also gives us a way to do feature selection.

## <a id='#section8'>8. Hyperparameter Optimisation
Currently developed a simple baseline CatBoost model.
    
The next step is to tune these hyperparameters to improve the model and take full advantage of the CatBoost library.

### <a id='#section8.1'>8.1. Grid Search
Documentation [[link](https://catboost.ai/docs/concepts/python-reference_catboost_grid_search.html)].

In [None]:
from catboost import CatBoost

In [None]:
catboost_clf = CatBoostClassifier(**params)

In [None]:
grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]
       }

In [None]:
grid_search_result = catboost_clf.grid_search(grid, 
                                              X=train_pool, 
                                              y=None,
                                              cv=10,
                                              partition_random_seed=42,
                                              calc_cv_statistics=True,
                                              search_by_train_test_split=True,
                                              refit=True,
                                              shuffle=True,
                                              stratified=True,
                                              verbose=True,
                                              plot=True
                                             )

Parameters giving the best value of the loss function:

In [None]:
grid_search_result['params']

Available cross-validation statistics

In [None]:
grid_search_result['cv_results'].keys()

Quality estimated using cross-validation:

In [None]:
grid_search_result['cv_results']['test-Logloss-mean'][-1]

Model is ready to use after searching:

In [None]:
predicted = catboost_clf.predict_proba(test_pool)
predicted[:3]

### <a id='#section8.2'>8.2. Randomised Search
Documentation [[link](https://catboost.ai/docs/concepts/python-reference_catboost_randomized_search.html)].

In [None]:
# Declare initial parameters - will later be tuned
tuned_params = {'iterations': 1_000,
                'learning_rate': 0.1,
                'l2_leaf_reg': 9,
                'cat_features': cat_features,
                'depth': 4,
                'loss_function': 'Logloss',
                'custom_loss': 'AUC',
                'verbose': 10    # True
               }

# Display parameters
tuned_params

## <a id='#section9'>9. Overfitting Detector</a>

In [None]:
model_with_early_stop = CatBoostClassifier(eval_metric='Logloss',    # Logloss by default
                                           iterations=1_000,
                                           learning_rate=0.1,
                                           early_stopping_rounds=20
                                          )

model_with_early_stop.fit(train_pool,
                          eval_set=test_pool,
                          verbose=True,
                          plot=True
                         )

In [None]:
print(model_with_early_stop.tree_count_)

## <a id='#section10'>10. Final Optimised CatBoost Model</a>

---

## <a id='#section11'>11. Performance Comparison of CatBoost with XGBoost and Logistic Regression</a>

Final reported Log Loss for each model:
*    CatBoost: ....
*    XGBoost: 0.28600
*    Logistic Regression: 0.28924

This is a reduction of 0.00324 when using XGBoost.

## <a id='#section12'>12. Assessment of the Performance of the Teams in Game 2 of the Metrica Sports Shot Data
The next stage of this analysis to take the Chance Quality Model derived and apply it to the DataFrame of identified shots from game 2 of the sample Metrica Sports data. to determine which team deserved to win the game, based solely on the quality of chances that each team created.

### <a id='#section14.1'>14.1. Apply the Trained Chance Quality Model from XGBoost to the Metrica Sports Data

##### Import Metrica Sports Shots data for game 2

In [None]:
# Read data directory
print(glob.glob(os.path.join(data_dir_metrica, 'engineered/*')))

In [None]:
# Read in exported Metrica Sports game 2 shots CSV as a pandas DataFrame
df_metrica = pd.read_csv(os.path.join(data_dir_metrica, 'engineered', 'game_2_shots_with_xg.csv'))

In [None]:
# Embed video of the 24 shots in game 2 of the Metrica Sports sample data
Video('../../video/fig/metrica-sports/tracking_shots_all.mp4', width=770, height=530)

In [None]:
# Rename xG column derived in previous model using Logistic Regression (later compared to value derived from XGBoost) and drop previous prediction columns
df_metrica = (df_metrica
                  .rename(columns={'xG': 'xG_LR'})
                  .drop(['prob_no_goal', 'prob_goal'], axis=1)
             )

# Display DataFrame
df_metrica.head()

##### Subset Metrica Sports data to be compatible with trained Chance Quality Model

In [None]:
# Define function to transform both the Chance Quality model and the Metrica Sports data to be compatible with trained Chance Quality Model
def transformation_metrica(df):
    """
    Function performs all transformation steps that took place in the first CQM notebook
    """
    
    df = df.copy()
    
    # Compute the distance to the center - distance_to_centerM
    df['distance_to_centerM'] = np.abs(df['position_yM'])

    # Compute header distance to goal
    df['header_distance_to_goalM'] = df['isHead'] * df['distance_to_goalM']
    
    # One-hot encode the Interference on the Shooter
    
    ## One-hot encoding
    df_dummy_interference_on_shooter = pd.get_dummies(df['Interference_on_Shooter'])
    
    # Attach separate columns back onto the dataset
    df = pd.concat([df, df_dummy_interference_on_shooter], axis=1)
    
    return df

In [None]:
# Transform and subset Metrica Sports data to be compatible with trained Chance Quality Model

## Transform data
df_metrica_trans = transformation_metrica(df_metrica)

## Define Features
features = ['distance_to_goalM',
            'angle',
            'Number_Intervening_Opponents',
            'Number_Intervening_Teammates',
            'isFoot',
            'High',     # 'Interference_on_Shooter'
            'Low',       # 'Interference_on_Shooter'
            'header_distance_to_goalM'
           ]

## Transform Metrica Sports data 
df_metrica_trans = df_metrica_trans[features]

# Display DataFrame
df_metrica_trans.head()

##### Probability Predictions

In [None]:
# Predict the probabilities of MetricaSports shot data
y_pred_metrica = xgb_clf_final.predict_proba(df_metrica_trans[features])

In [None]:
# Convert the Probability Predictions array to a pandas DataFrame
df_metrica_predictions = pd.DataFrame(y_pred_metrica, columns=['prob_no_goal', 'prob_goal'])

# Join the Probability Predictions back onto the original test DataFrame
df_metrica_predictions_final = pd.merge(df_metrica, df_metrica_predictions, left_index=True, right_index=True)

# Create xG column and assign all penalties not accounted for in trained model an xG of 0.75
df_metrica_xg = df_metrica_predictions_final
df_metrica_xg['xG'] = df_metrica_xg['prob_goal']
df_metrica_xg['xG'] = np.where(df_metrica_xg['isPenalty'] == 1, 0.76, df_metrica_xg['xG'])

# Display DataFrame
df_metrica_xg.head()

### <a id='#section14.2'>14.2. Assessment of the Performance of the Teams in Game 2 of the Metrica Sports Shot Data

##### xG Race Chart of XGBoost model predictions

In [None]:
# Create xG Race Chart of XGBoost model with revised xG for penalties

## Assign DataFrame
df = df_metrica_xg

## Create four lists to plot the different xG values - home, away, xG, and minutes. We start these with zero so our charts will start at 0
a_xG = [0]
h_xG = [0]
a_min = [0]
h_min = [0]

## Define team names from the DataFrame
hteam = 'Home'
ateam = 'Away'

## For loop to append the xG and minute for both the Home and Away teams
for x in range(len(df['xG'])):
    if df['team'][x]==ateam:
        a_xG.append(df['xG'][x])
        a_min.append(df['match_minute'][x])
    if df['team'][x]==hteam:
        h_xG.append(df['xG'][x])
        h_min.append(df['match_minute'][x])
        
## Function we use to make the xG values be cumulative rather than single shot values. Foes through the list and adds the numbers together
def nums_cumulative_sum(nums_list):
    return [sum(nums_list[:i+1]) for i in range(len(nums_list))]

## Apply defned nums_cumulative_sum function to the home and away xG lists
a_cumulative = nums_cumulative_sum(a_xG)
h_cumulative = nums_cumulative_sum(h_xG)

## Find the total xG. Create a new variable from the last item in the cumulative list
alast = round(a_cumulative[-1],2)
hlast = round(h_cumulative[-1],2)

## Set background colour
background = 'aliceblue'

## Create figure 
fig, ax = plt.subplots(figsize=(15, 6))
fig.set_facecolor(background)
ax.patch.set_facecolor(background)

# Set up our base layer
mpl.rcParams['xtick.color'] = 'midnightblue'
mpl.rcParams['ytick.color'] = 'midnightblue'

## Create xG Race Chart
plt.xticks([0, 15, 30, 45, 60, 75, 90])
plt.xlabel('Minute', fontfamily='DejaVu Sans', color='midnightblue', fontsize=16)
plt.ylabel('xG', fontfamily='DejaVu Sans', color='midnightblue', fontsize=16)

# Plot the step graphs
ax.step(x=a_min, y=a_cumulative, color='blue', label=ateam, linewidth=5, where='post')
ax.step(x=h_min, y=h_cumulative, color='red', label=ateam, linewidth=5, where='post')

## Set Gridlines 
ax.grid(linewidth=0.25, color='midnightblue', axis='y', zorder=1)
spines = ['top','bottom','left','right']
for x in spines:
    if x in spines:
        ax.spines[x].set_visible(False)

## Set title
ax.set_title('xG Race Chart (XGBoost)',
             loc='center',
             color='midnightblue', 
             fontweight='bold',
             fontfamily='DejaVu Sans',
             fontsize=22,
            )

## Show Legend
#plt.legend()

## Save figure
if not os.path.exists(fig_shots_dir + '/metrica_shots_race_chart_xg_xgboost.png'):
    plt.savefig(fig_shots_dir + '/metrica_shots_race_chart_xg_xgboost.png', bbox_inches='tight', dpi=300)
else:
    pass

## Show figure
plt.tight_layout()
plt.show()

##### xG Race Chart of Logistic Regression model predictions (derived in the previous CQM notebook)

In [None]:
# Create xG Race Chart of Logistic Regression model with revised xG for penalties

## Assign DataFrame
df = df_metrica_xg

## Create four lists to plot the different xG values - home, away, xG, and minutes. We start these with zero so our charts will start at 0
a_xG = [0]
h_xG = [0]
a_min = [0]
h_min = [0]

## Define team names from the DataFrame
hteam = 'Home'
ateam = 'Away'

## For loop to append the xG and minute for both the Home and Away teams
for x in range(len(df['xG_LR'])):
    if df['team'][x]==ateam:
        a_xG.append(df['xG_LR'][x])
        a_min.append(df['match_minute'][x])
    if df['team'][x]==hteam:
        h_xG.append(df['xG_LR'][x])
        h_min.append(df['match_minute'][x])
        
## Function we use to make the xG values be cumulative rather than single shot values. Foes through the list and adds the numbers together
def nums_cumulative_sum(nums_list):
    return [sum(nums_list[:i+1]) for i in range(len(nums_list))]

## Apply defned nums_cumulative_sum function to the home and away xG lists
a_cumulative = nums_cumulative_sum(a_xG)
h_cumulative = nums_cumulative_sum(h_xG)

## Find the total xG. Create a new variable from the last item in the cumulative list
alast = round(a_cumulative[-1],2)
hlast = round(h_cumulative[-1],2)

## Set background colour
background = 'aliceblue'

## Create figure 
fig, ax = plt.subplots(figsize=(15, 6))
fig.set_facecolor(background)
ax.patch.set_facecolor(background)

# Set up our base layer
mpl.rcParams['xtick.color'] = 'midnightblue'
mpl.rcParams['ytick.color'] = 'midnightblue'

## Create xG Race Chart
plt.xticks([0, 15, 30, 45, 60, 75, 90])
plt.xlabel('Minute', fontfamily='DejaVu Sans', color='midnightblue', fontsize=16)
plt.ylabel('xG', fontfamily='DejaVu Sans', color='midnightblue', fontsize=16)

# Plot the step graphs
ax.step(x=a_min, y=a_cumulative, color='blue', label=ateam, linewidth=5, where='post')
ax.step(x=h_min, y=h_cumulative, color='red', label=ateam, linewidth=5, where='post')

## Set Gridlines 
ax.grid(linewidth=0.25, color='midnightblue', axis='y', zorder=1)
spines = ['top','bottom','left','right']
for x in spines:
    if x in spines:
        ax.spines[x].set_visible(False)

## Set title
ax.set_title('xG Race Chart (Logistic Regression)',
             loc='center',
             color='midnightblue', 
             fontweight='bold',
             fontfamily='DejaVu Sans',
             fontsize=22,
            )

## Show Legend
#plt.legend()

## Save figure
if not os.path.exists(fig_shots_dir + '/metrica_shots_race_chart_xg_lr.png'):
    plt.savefig(fig_shots_dir + '/metrica_shots_race_chart_xg_lr.png', bbox_inches='tight', dpi=300)
else:
    pass

## Show figure
plt.tight_layout()
plt.show()

Comparison of the xG Race Charts for both the XGBoost and Logistic Regression Chance Quality Models both predict that, the Away team (blue) is the team that accumlates the greatest amount of xG during the course of the 90 minutes. However, in the XGBoost model, the gap is much narrower, further confirming that these basic models need to be tested with more data and analysed more thoroughly.

### <a id='#section14.3'>14.3. Comparison of the XGBoost and Logistic Regression Predictions

In [None]:
# Rename xG column derived from XGBoost (used to compared to previous calculated xG value in Logistic Regression model) and create a variance column
df_metrica_xg = df_metrica_xg.rename(columns={'xG': 'xG_XGB'})

In [None]:
# Compare the two xG columns of the LR and XGB models

## Difference in xG between LR and XGB models
df_metrica_xg['diff_xG'] = df_metrica_xg['xG_LR'] - df_metrica_xg['xG_XGB']

## Percentage variance in xG between LR and XGB models
df_metrica_xg['percentage_variance_xG'] = (((df_metrica_xg['xG_XGB'] - df_metrica_xg['xG_LR']) / df_metrica_xg['xG_LR'])*100).round(2)

# Display DataFrame
df_metrica_xg

In [None]:
# Description of the selected columns
df_metrica_xg[['xG_LR', 'xG_XGB', 'diff_xG', 'percentage_variance_xG']].describe()

The `diff_xG` and `percentage_variance_xG` show the difference between the predictions made by the Logistic Regression and XGBoost models. 8 of the 24 shots have a variance greater than 50%. I won't go into too much detail here about these variances at this stage (I might come back to this later on), but to summarise, these simple models show that much more testing that is required before the predictions can be confirmed as robust.

### <a id='#section14.4'>14.4. Export the Final Dataset

In [None]:
# Export the final dataset
df_metrica_xg.to_csv(os.path.join(data_dir_metrica, 'engineered', 'game_2_shots_with_xg_lr_xgb.csv'), index=None, header=True)

## <a id='#section15'>15. Summary</a>
To summarise, this notebook builds upon the initial Chance Quality Model created using Logistic Regression from shots data, this time using XGBoost. This notebook then compares the variances between the two models

The steps to create this model can be summarised as the following:
1.     Set up the notebook for an environment in which to apply the XGBoost know Machine Learning algorithm to a dataset of shots data.
2.    Re-explained the challenge to create a Chance Quality Model and defined the key proxy in which to determine this - <b>Expected Goals</b>.
3.     Imported the provided CSV data file imported as a pandas DataFrame and conducted a basic Exploratory Data Analysis.
4.     Introduction to XGBoost.
5.     Explanation of the difference between Bagging Vs. Boosting.
6.     Theory Behind the XGBoost Algorithm.
7.     Imported the engineered shots dataset created in the first Chance Quality Model notebook.
8.     Created an initial model using the data and features that were immediately available from the starting data, before any model optimisation, to determine a baseline figure for the model. This data was split into a training and a test set, which were kept separate during the entire modelling process and the test data was never incorporated into the training data.
9.     k-fold Cross Validation using XGBoost.
10.    Feature Importance with XGBoost.
11.    Visualisation of XGBoost Tree.
12.    Hyperparameter Optimisation.
13.    Final Optimised XGBoost Model - training of the final model using the parameters found via Grid Search.
14.    Performance comparison of XGBoost with Logistic Regression - final XGBoost model has a resulting Log Loss of 0.28600, where the final Logistic Regression had a Log Loss of 0.28924 - reduction of 0.00324.
15.    Assessment of the teams performance in the Metrica Sports data to determine who deserved to win the game, based solely on the quality of chances that each team created - the <b>Away</b> (blue) team. The predicted xG values for the shots determined by the two Chance Quality Models were compared.

---

## <a id='#section16'>16. Next Steps</a>

Proposals and ideas for further work can be divided into two sections - Expected Goals Models and Tracking data. Current suggestions are as follows:

#### Expected Goals Models
*    The focus of my approach to answer this data challenge was to not build the absolute best performing ML model, with the best performance metrics and fanciest algorithm. My objective was to conduct and end-to-end process for building a model, including all the key stages such as feature engineering, univariate and multivariate analysis, and iterated performance assessment and improvement – for this reason, for the submission of the data task, a simple Chance Quality Model was built using Logistic Regression. However, Gradient Boosting algorithms lead to improved performance and for this reason, this second notebook (of currently two) creates a Chance Quality Model using [XGBoost](https://xgboost.readthedocs.io/en/latest/). Potential further models that can be deployed to try and further improve the performance of the Chance Quality model include other Gradient Boosting algorithms, such as [LightGBM](https://lightgbm.readthedocs.io/en/latest/) or [Catboost](https://catboost.ai/). More detail about these modeling approaches can be found in my the Data Science pack that I submitted as part of my initial application (see: https://docs.google.com/presentation/d/16stYbJoI8aYqtn_grJHSdTbMA1xwo-Iy9g9JJruMOBQ/edit?usp=sharing). 

*    Application of a full Event data set, such as those from StatsBomb and Wyscout, to create an Expected Goals model with more features. Such features that were not possible to include in this model but that could be added with Event and/or Tracking data include: strong/weak foot, flag for counter attack, flag for smart pass, determine whether a shot had been immediately taken before, whether the shot was from a cross. This is discussed in more detail in the Feature Engineering section (section 9) of the first Chance Quality Model notebook [[link](https://nbviewer.jupyter.org/github/eddwebster/mcfc_submission/blob/main/notebooks/chance_quality_modelling/Creating%20a%20Chance%20Quality%20Model%20from%20Shots%20Data.ipynb)]. A comparison of the features used in the respected xG models by [Sam Green](https://twitter.com/aSamGreen) and [Michael Caley](https://twitter.com/MC_of_A) can be found in the index.

*    To see really test the performance of this model, it would be great to quality test performance and xG prediction with those of other providers such as StatsBomb and observe the level of variance in predictions between this basic model and a professional one created using a much larger dataset with much more features.

*    Creation of separate Expected Goals models for Direct Free Kicks and Corners. Currently, only Open-Play shots considered and a xG value for penalties was taken from StatsBomb/FBref [[link](https://fbref.com/en/expected-goals-model-explained/)]. 

*    Add fake shots to the shots data – see David Sumpter’s tweet for the benefits of including fake data in an Expected Goals model [[link](https://twitter.com/Soccermatics/status/1260598182624575490)].

#### Tracking data
*    This current analysis only focuses on the shots taken by the teams. The next stage of this analysis would be to apply this Tracking data, not to just a Shots dataset, but to a full Events dataset, taking the basic concepts of analysis and feature extraction observed in this submission and then start to apply more sophisticated modeling approaches such as Pitch Control or Expected Possession Value (EPV) models such as the VAEP model by SciSports and KULeuven, or the Expected Threat (xT) model by Karun Singh. This can be taken on further, by combining these two modelling approaches to analyse value that certain actions of interest brought to the team during a particular play in the match and determine the Expected Value-Added. This was unfortunately not possible to do in this analysis as the Event data provided only included Shot data, but it would be something I would like to take on and do in the future, using publicly available Event data from StatsBomb and Wyscout, with the sample Tracking data from Metrica Sports. More detail about these models can be found in my the Data Science pack that I submitted in my initial application (see: https://docs.google.com/presentation/d/16stYbJoI8aYqtn_grJHSdTbMA1xwo-Iy9g9JJruMOBQ/edit?usp=sharing). 

*    Further enrich the Event data through Tracking data, adding further detail and specificity, which again can be used to further improve the Expected Goals model. This was observed in this analysis with addition of the Intervening and Interfering teammates and opponents. Features that were not considered in this analysis, include aspects such as the goalkeeper and defender positions in the moment of the shot e.g.: how much of the goal was covered by the goalkeeper? are the defenders in position? These are attributes that can be derived from the Tracking data to gain additional insight previously not possible.

---

## <a id='#section17'>17. References and Further Reading</a>
Please see my [`football_analytics`](https://github.com/eddwebster/football_analytics) repository for my attempt to create as concise a list of possible of publicly available resources published by the football analytics community.

The follow resources are those that were specifically used to inform and create my submission for the CFG Junior Data Scientist Data Challenge, specifically focusing on Expected Goals and Tracking data. I have also included links to other topics related to the role such as the application of Reinforcement Learning in football. Credits to all those cited below.

This list of is also available in the project GitHub repo [[link](https://github.com/eddwebster/mcfc_submission)].

### Football Analytics

#### Tutorials
*    Friends of Tracking YouTube channel [[link](https://www.youtube.com/channel/UCUBFJYcag8j2rm_9HkrrA7w)] and Mathematical Modelling of Football course by Uppsala University [[link](https://uppsala.instructure.com/courses/28112)]. The GitHub repo with all code featured can be found at the following [[link](https://github.com/Friends-of-Tracking-Data-FoTD)]. Lectures of note include:
     +    Laurie Shaw's Metrica Sports Tracking data series for #FoT - [Introduction](https://www.youtube.com/watch?v=8TrleFklEsE), [Measuring Physical Performance](https://www.youtube.com/watch?v=VX3T-4lB2o0), [Pitch Control modelling](https://www.youtube.com/watch?v=5X1cSehLg6s), and [Valuing Actions](https://www.youtube.com/watch?v=KXSLKwADXKI). See the following for code [[link](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking)];
     +    David Sumpter's Expected Goals webinars for #FoT - [How to Build An Expected Goals Model 1: Data and Model](https://www.youtube.com/watch?v=bpjLyFyLlXs), [How to Build An Expected Goals Model 2: Statistical fitting](https://www.youtube.com/watch?v=wHOgINJ5g54), and [The Ultimate Guide to Expected Goals](https://www.youtube.com/watch?v=310_eW0hUqQ). See the following for code [3xGModel](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/3xGModel.py), [4LinearRegression](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/4LinearRegression.py), [5xGModelFit.py](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/5xGModelFit.py), and [6MeasuresOfFit](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/6MeasuresOfFit.py);
     +    Peter McKeever's ['Good practice in data visualisation'](https://www.youtube.com/watch?v=md0pdsWtq_o) webinar for Friends of Tracking. See the following for code [[link](https://github.com/petermckeeverPerform/friends-of-tracking-viz-lecture)];
*    [Soccer Analytics Handbook](https://github.com/devinpleuler/analytics-handbook) by [Devin Pleuler](https://twitter.com/devinpleuler). See tutorial notebooks (also available in Google Colab) that notably include: [3. Logistic Regression](https://github.com/devinpleuler/analytics-handbook/blob/master/notebooks/logistic_regression.ipynb), and [7. Data Visualization](https://github.com/devinpleuler/analytics-handbook/blob/master/notebooks/data_visualization.ipynb):
*    [FC Python](https://twitter.com/fc_python) tutorials [[link](https://fcpython.com/)];
*    DataViz, Python, and matplotlib tutorials by Peter McKeever [[link](http://petermckeever.com/)] - I think his website is currently in redevelopment, with many of the old tutorials not currently available (28/02/2021). Check out his revamped [How to Draw a Football Pitch](http://petermckeever.com/2020/10/how-to-draw-a-football-pitch/) tutorial;
*    [McKay Johns YouTube channel](https://www.youtube.com/channel/UCmqincDKps3syxvD4hbODSg);
*    [Tech how-to: build your own Expected Goals model](https://www.scisports.com/tech-how-to-build-your-own-expected-goals-model/) by [Jan Van Haaren](https://twitter.com/JanVanHaaren) and [SciSports](https://twitter.com/SciSportsNL).
*    [Fitting your own football xG model](https://www.datofutbol.cl/xg-model/) by [Dato Fútbol](https://twitter.com/DatoFutbol_cl) (Ismael Gómez Schmidt). See GitHub repo [[link](https://github.com/Dato-Futbol/xg-model)];
*    [Python for Fantasy Football series](http://www.fantasyfutopia.com/python-for-fantasy-football-introduction/) by [Fantasy Futopia](https://twitter.com/FantasyFutopia) ([Thomas Whelan](https://twitter.com/tom_whelan)).  See the following posts:
     +    [Introduction to Machine Learning](http://www.fantasyfutopia.com/python-for-fantasy-football-introduction-to-machine-learning/)
     +    [Addressing Class Imbalance in Machine Learning](http://www.fantasyfutopia.com/python-for-fantasy-football-addressing-class-imbalance-in-machine-learning/)
     +    [Addressing Class Imbalance Part 2](http://www.fantasyfutopia.com/python-for-fantasy-football-addressing-class-imbalance-part-2/)
     +    [Understanding Random Forests](http://www.fantasyfutopia.com/python-for-fantasy-football-understanding-random-forests/)
     +    [Feature Engineering for Machine Learning](http://www.fantasyfutopia.com/python-for-fantasy-football-feature-engineering-for-machine-learning/)
*    [Building an Expected Goals Model in Python](https://web.archive.org/web/20200301071559/http://petermckeever.com/2019/01/building-an-expected-goals-model-in-python/) by [Peter McKeever](https://twitter.com/petermckeever) (using WayBackMachine);
*    [An xG Model for Everyone in 20 minutes (ish)](https://differentgame.wordpress.com/2017/04/29/an-xg-model-for-everyone-in-20-minutes-ish/ ) by [Football Fact Man](https://twitter.com/footballfactman) (Paul Riley).
*    [How to Draw a Football Pitch](http://petermckeever.com/2020/10/how-to-draw-a-football-pitch/) by Peter McKeever
*    [How To Create xG Flow Charts in Python](https://www.youtube.com/watch?v=bvoOOYMQkac) by [McKay Johns](https://twitter.com/mckayjohns). For code, see [[link](https://github.com/mckayjohns/Viz-Templates)]

#### Libaries and GitHub Repos
*    [`Friends-of-Tracking-Data-FoTD`](https://github.com/Friends-of-Tracking-Data-FoTD);
*    [`SoccermaticsForPython`](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython) - repo by David Sumpter dedicated for people getting started with Python using the concepts derived from the book Soccermatics;
*    [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking) by [Laurie Shaw](https://twitter.com/EightyFivePoint) - Python code for working with Metrica tracking data; and
*    [`Expected Goals Thesis`](https://github.com/andrewRowlinson/expected-goals-thesis) by [Andrew Rowlinson](https://twitter.com/numberstorm). See both his thesis [[link](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/FOOTBALL%20SHOT%20QUALITY%20-%20Visualizing%20the%20Quality%20of%20Football%20Soccer%20Goals.pdf)] and the following notebooks:
     +    [Explore Data Quality Overlap](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/00-explore-data-quality-overlap.ipynb);
     +    [Expected Goals Model](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/01-expected-goals-model.ipynb);
     +    [Expected Goals Calculate xG and Shap](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/02-expected-goals-calculate-xg-and-shap.ipynb);
     +    [Visualise Models](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/03-visualize-models.ipynb);
     +    [kernel Density Probability Scoring](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/04-kernel-density-probability-scoring.ipynb);
     +    [Simulate Match Results from xG](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/05-simulate-match-results-from-xg.ipynb);
     +    [Freeze Frame Examples](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/06-freeze_frame-example.ipynb);
     +    [Red Zone Heatmap](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/07-red-zone-heatmap.ipynb);
     +    [Shots Follow Poisson Distribution](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/08-shots_follow_poisson_distribution.ipynb); and
     +    [Angle Features](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/notebooks/09_figure3_angle_features.ipynb).
*    [`expected_goals_deep_dive`](https://github.com/andrewsimplebet/expected_goals_deep_dive) by [Andrew Puopolo](https://twitter.com/andrew_puopolo). See the following notebooks:
     +    [Setting Our Data Up](https://github.com/andrewsimplebet/expected_goals_deep_dive/blob/master/0.%20Setting%20Our%20Data%20Up.ipynb)
     +    [Random Forest Cross Validation And Hyperparameter Tuning](https://github.com/andrewsimplebet/expected_goals_deep_dive/blob/master/1.%20Random%20Forest%20Cross%20Validation%20And%20Hyperparameter%20Tuning.ipynb)
     +    [Comparing Logistic Regression and Random Forest For Expected Goals](https://github.com/andrewsimplebet/expected_goals_deep_dive/blob/master/2.%20Basic%20Logistic%20Regression%20and%20Comparison%20To%20Random%20Forests.ipynb)
     +    [Calibrating Expected Goals Models](https://github.com/andrewsimplebet/expected_goals_deep_dive/blob/master/3.%20Calibrating%20Expected%20Goals%20Models.ipynb)
     +    [Sanity Checking Our Expected Goals Model and Final Thoughts](https://github.com/andrewsimplebet/expected_goals_deep_dive/blob/master/4.%20Sanity%20Checking%20Our%20Expected%20Goals%20Models%20And%20Final%20Thoughts.ipynb)
*    [`soccer_analytics`](https://github.com/CleKraus/soccer_analytics) by [Kraus Clemens](https://twitter.com/CleKraus). See the following notebooks:
     +    [Expected goal model with logistic regression](https://github.com/CleKraus/soccer_analytics/blob/master/notebooks/expected_goal_model_lr.ipynb)
     +    [Challenges using gradient boosters](https://github.com/CleKraus/soccer_analytics/blob/master/notebooks/challenges_with_gradient_boosters.ipynb)
*    [`xg-model`](https://github.com/Dato-Futbol/xg-model)] by [Dato Fútbol](https://twitter.com/DatoFutbol_cl) (Ismael Gómez Schmidt)
*    [`soccer-xg`](https://pypi.org/project/soccer-xg/) by [Jesse Davis](https://twitter.com/jessejdavis1) and [Pieter Robberechts](https://twitter.com/p_robberechts) - a Python package for training and analyzing expected goals (xG) models in soccer (not used this this assignment but referenced here); and
*    [`Google Research Football`](https://github.com/google-research/football). See the Kaggle Competition alongside Manchester City [[link](https://www.kaggle.com/c/google-football) (ended October 2020).

#### Written Pieces
For a full list of Expected Goals literature, see the following [[link](https://docs.google.com/document/d/1OY0dxqXIBgncj0UDgb97zOtczC-b6JUknPFWgD77ng4/edit)].

##### Papers
The following Shiny App from Lars Maurath is a great tool for looking up publications [[link](https://larsmaurath.shinyapps.io/soccer-analytics-library/)].
*    [Routine Inspection: A Playbook for Corner Kicks](https://www.springerprofessional.de/en/routine-inspection-a-playbook-for-corner-kicks/18671052) (2020) by [Laurie Shaw](https://twitter.com/EightyFivePoint) and Sudarshan 'Suds' Gopaladesikan.  Accompanying talk - [2020 Harvard Sports Analytics Lab](https://www.youtube.com/watch?v=yfPC1O_g-I8)];
*    [Dynamic Analysis of Team Strategy in Professional Football](https://static.capabiliaserver.com/frontend/clients/barca/wp_prod/wp-content/uploads/2020/01/56ce723e-barca-conference-paper-laurie-shaw.pdf) (2019) by [Laurie Shaw](https://twitter.com/EightyFivePoint) and [Mark Glickman](https://twitter.com/glicko). Accompanying talks - [NESSIS 2019](https://www.youtube.com/watch?v=VU4BOu6VfbU), [2020 Google Sports Analytics Meetup](https://www.youtube.com/watch?v=aQ9L6IkWI8U);
*    [Football Shot Quality: Visualising the Quality of Soccer/Football Shots](https://github.com/andrewRowlinson/expected-goals-thesis/blob/master/FOOTBALL%20SHOT%20QUALITY%20-%20Visualizing%20the%20Quality%20of%20Football%20Soccer%20Goals.pdf) by [Andrew Rowlinson](https://twitter.com/numberstorm). See his GitHub repo for code [[link](https://github.com/andrewRowlinson/expected-goals-thesis)]; and
*    [Game Plan: What AI can do for Football, and What Football can do for AI](https://arxiv.org/pdf/2011.09192.pdf) (2020) by Karl Tuyls, Shayegan Omidshafiei, Paul Muller, Zhe Wang, Jerome Connor, Daniel Hennes, Ian Graham, Will Spearman, Tim Waskett, and Dafydd Steele, Pauline Luc, Adria Recasens, Alexandre Galashov, Gregory Thornton, Romuald Elie, Pablo Sprechmann, Pol Moreno, Kris Cao, Marta Garnelo, Praneet Dutta, Michal Valko, Nicolas Heess, Alex Bridgland, Julien Perolat, Bart De Vylder, Ali Eslami, Mark Rowland, Andrew Jaegle, Remi Munos, Trevor Back, Razia Ahamed, Simon Bouton, Nathalie Beauguerlange, Jackson Broshear, Thore Graepel, and Demis Hassabis;
*    [Google Research Football: A Novel Reinforcement Learning Environment](https://arxiv.org/pdf/1907.11180.pdf) (2020) by Karol Kurach, Anton Raichuk, Piotr Stańczyk, Michał Zając, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, Sylvain Gelly. See the GitHub repo [[link](https://github.com/google-research/football)];
*    [A Framework for the Fine-Grained Evaluation of the Instantaneous Expected Value of Soccer Possessions](https://arxiv.org/abs/2011.09426) (2020) by Javier Fernández, Luke Bornn and Daniel Cervone;
*    [Decomposing the Immeasurable Sport: A deep learning expected possession value framework for soccer](https://www.semanticscholar.org/paper/Decomposing-the-Immeasurable-Sport%3A-A-deep-learning-Fern%C3%A1ndez/fc78b144a531a8ffdf3216a677f3a65e70dad3c7) (2019) by [Javier Fernández](https://twitter.com/JaviOnData), [Bornn](https://twitter.com/LukeBornn), and [Dan Cervone](https://twitter.com/dcervone0). Accompanying talks - [SSAC19](https://www.youtube.com/watch?v=JIa7Td3YXxI), [StatsBomb conference](https://www.youtube.com/watch?v=nfPEEbKJbpM);
*    [Ready Player Run: Off-ball run identification and classification](https://static.capabiliaserver.com/frontend/clients/barca/wp_prod/wp-content/uploads/2020/01/40ba07f4-ready-player-run-barcelona.pdf) (2020) by [Sam Gregory](https://twitter.com/GregorydSam);
*    [Beyond Expected Goals](https://www.researchgate.net/profile/William_Spearman/publication/327139841_Beyond_Expected_Goals/links/5b7c3023a6fdcc5f8b5932f7/Beyond-Expected-Goals.pdf) (2018) by [Will Spearman](https://twitter.com/the_spearman);
*    [Wide Open Spaces: A statistical technique for measuring space creation in professional soccer](https://www.researchgate.net/publication/324942294_Wide_Open_Spaces_A_statistical_technique_for_measuring_space_creation_in_professional_soccer) (2018) by [Javier Fernandez](https://twitter.com/JaviOnData) and [Luke Bornn](https://twitter.com/LukeBornn);
*    [“The Leicester City Fairytale?”: Utilizing New Soccer Analytics Tools to Compare Performance in the 15/16 & 16/17 EPL Seasons (2017)](https://userpages.umbc.edu/~nroy/courses/fall2018/cmisr/papers/soccer_analytics.pdf) by Hector Ruiz, Paul Power, Xinyu Wei, and Patrick Lucey;
*    [Not all passes are created equal: objectively measuring the risk and reward of passes in soccer from tracking data](http://library.usc.edu.ph/ACM/KKD%202017/pdfs/p1605.pdf) (2017) by Paul Power, Hector Ruiz, Xinyu Wei, and Patrick Lucey. See Paul Power's talk [[link](https://dl.acm.org/action/downloadSupplement?doi=10.1145%2F3097983.3098051&file=power_tracking_data.mp4&download=true)] (downloadable MP4), and the webpage [[link](https://dl.acm.org/doi/10.1145/3097983.3098051)];
*    [“Quality vs Quantity”: Improved Shot Prediction in Soccer using Strategic Features from Spatiotemporal Data](https://s3-us-west-1.amazonaws.com/disneyresearch/wp-content/uploads/20150308192147/Quality-vs-Quantity%E2%80%9D-Improved-Shot-Prediction-in-Soccer-using-Strategic-Features-from-Spatiotemporal-Data-Paper.pdf) (2015) by Patrick Lucey, Alina Bialkowski, Mathew Monfort, Peter Carr, and Iain Matthews; and
*    [A Framework for Tactical Analysis and Individual Offensive Production Assessment in Soccer Using Markov Chains](http://nessis.org/nessis11/rudd.pdf) (2011) by [Sarah Rudd](https://twitter.com/srudd_ok). Accompanying NESSIS talk on Metacafe [[link](https://www.metacafe.com/watch/7337475/2011_nessis_talk_by_sarah_rudd/)].

##### Blogs
*    [Sam Green](https://twitter.com/aSamGreen)'s [xG model](https://www.optasportspro.com/news-analysis/assessing-the-performance-of-premier-league-goalscorers/);
*    [Michael Caley](https://twitter.com/MC_of_A)'s [xG model](https://cartilagefreecaptain.sbnation.com/2014/9/11/6131661/premier-league-projections-2014#methoderology);
*    [Using Data to Analyse Team Formations](https://eightyfivepoints.blogspot.com/2019/11/using-data-to-analyse-team-formations.html) by [Laurie Shaw](https://twitter.com/EightyFivePoint);
*    [Structure in football: putting formations into context](https://eightyfivepoints.blogspot.com/2020/12/structure-in-football-putting.html) by [Laurie Shaw](https://twitter.com/EightyFivePoint);
*    [xG explained](https://fbref.com/en/expected-goals-model-explained/) by [FBref](https://twitter.com/fbref);
*    [What are expected Goals?](https://www.americansocceranalysis.com/explanation) by [American Soccer Analysis](https://twitter.com/AnalysisEvolved);
*    [David Sumpter](https://twitter.com/Soccermatics)'s Expected Goals pieces:
     +    [Should you write about real goals or expected goals? A guide for journalists](https://soccermatics.medium.com/should-you-write-about-real-goals-or-expected-goals-a-guide-for-journalists-2cf0c7ec6bb6);
     +    [Football’s magical equation?](https://soccermatics.medium.com/footballs-magical-equation-bfe212ce7d4a)
     +    [The Geometry of Shooting](https://soccermatics.medium.com/the-geometry-of-shooting-ae7a67fdf760).
*    [Michael Caley](https://twitter.com/MC_of_A)'s Expected Goals pieces:
     +    [Shot Matrix I: Shot Location and Expected Goals](https://cartilagefreecaptain.sbnation.com/2013/11/13/5098186/shot-matrix-i-shot-location-and-expected-goals)
     +    [Let's talk about expected goals](https://cartilagefreecaptain.sbnation.com/2015/4/10/8381071/football-statistics-expected-goals-michael-caley-deadspin)
*    [Jesse Davis](https://twitter.com/jessejdavis1) and [Pieter Robberechts](https://twitter.com/p_robberechts)' Expected Goals pieces for KU Leuven;
     +    [How Data Avilability Affects the Ability to learn Good xG Models](https://dtai.cs.kuleuven.be/sports/blog/how-data-availability-affects-the-ability-to-learn-good-xg-models)
     +    [Illustrating the Interplay between Features and Models in xG](https://dtai.cs.kuleuven.be/sports/blog/illustrating-the-interplay-between-features-and-models-in-xg)
     +    [How Data Quality Affects xG](https://dtai.cs.kuleuven.be/sports/blog/how-data-quality-affects-xg)
*    [Will Gürpinar-Morgan](https://twitter.com/WillTGM)'s Expected Goals pieces:
     +    [Unexpected goals](https://2plus2equals11.com/2015/12/31/unexpected-goals/) on [2+2=11](https://2plus2equals11.com/);
     +    [Great Expectations](https://2plus2equals11.com/2015/05/31/great-expectations/) on [2+2=11](https://2plus2equals11.com/);
     +    [On single match expected goal totals](https://2plus2equals11.com/2015/12/16/on-single-match-expected-goal-totals/) on [2+2=11](https://2plus2equals11.com/);
     +    [How StatsBomb Data Helps Measure Counter-Pressing](https://statsbomb.com/2018/05/how-statsbomb-data-helps-measure-counter-pressing/) for StatsBomb
*    [Martin Eastwoood](https://twitter.com/penaltyblog) (Pena.lt/y)'s Expected Goals pieces [[link](https://pena.lt/y/category/expected-goals.html)];
     +    [Expected Goals For All.](https://pena.lt/y/2014/02/12/expected-goals-for-all)
     +    [Actual Goals Versus Expected Goals](https://pena.lt/y/2014/02/15/actual-goals-versus-expected-goals);
     +    [Expected Goals Updated](https://pena.lt/y/2014/03/01/expected-goals-updated);
     +    [Expected Goals: The Y Axis](https://pena.lt/y/2014/04/16/expected-goals-the-y-xis);
     +    [Expected Goals And Exponential Decay](https://pena.lt/y/2014/04/22/expected-goals-and-exponential-decay);
     +    [Expected Goals: Foot Shots Versus Headers](https://pena.lt/y/2014/08/28/expected-goals-foot-shots-versus-headers);
     +    [Expected Goals And Support Vector Machines](https://pena.lt/y/2015/07/13/expected-goals-svm);
     +    [Expected Goals and Uncertainty](https://pena.lt/y/2016/04/29/expected-goals-and-uncertainty); and
     +    [Sharing xG Using Multi-touch Attribution Modelling](https://pena.lt/y/2019/11/23/multitouch-attributed-xg).
*    [Garry Gelade](https://twitter.com/GarryGelade)'s Expected Goals pieces:
     +    [Expected Goals and Unexpected Goals](https://web.archive.org/web/20200724125157/http://business-analytic.co.uk/blog/expected-goals-and-unexpected-goals/) (using WayBackMachine);
     +    [Assessing Expected Goals Models. Part 1: Shots](https://web.archive.org/web/20200724125157/http://business-analytic.co.uk/blog/evaluating-expected-goals-models/) (using WayBackMachine);
     +    [Assessing Expected Goals Models. Part 2: Anatomy of a Big Chance](https://web.archive.org/web/20200724125157/http://business-analytic.co.uk/blog/assessing-expected-goals-models-part-2-anatomy-of-a-big-chance/) (using WayBackMachine);
*    [Introducing xGChain and xGBuildup](https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/) by [Thom Lawrence](https://twitter.com/lemonwatcher);
*    [Quantifying finishing skill](https://statsbomb.com/2017/07/quantifying-finishing-skill/) by [Marek Kwiatkowski](https://twitter.com/statlurker);
*    [The Dual Life of Expected Goals (Part 1)](https://statsbomb.com/2018/05/the-dual-life-of-expected-goals-part-1/) by [Mike L. Goodman](https://twitter.com/TheM_L_G);
*    [A close look at my new Expected Goals Model](https://web.archive.org/web/20200320193539/http://11tegen11.net/2015/08/14/a-close-look-at-my-new-expected-goals-model/) by by [11tegen](https://twitter.com/11tegen11) ([Sander IJtsma](https://twitter.com/IJtsma)] (using WayBackMachine);
*    [An analysis of different expected goals models](https://www.pinnacle.com/en/betting-articles/Soccer/expected-goals-model-analysis/MEP2N9VMG5CTW99D) by [Benjamin Cronin](https://twitter.com/PinnacleBen);
*    [Expected Goals 3.0 Methodology](https://www.americansocceranalysis.com/home/2015/4/14/expected-goals-methodology) by [Matthias Kullowatz](https://twitter.com/mattyanselmo);
*    [Explaining and Training Shot Quality](https://statsbomb.com/2016/04/explaining-and-training-shot-quality/) by [Ted Knutson](https://twitter.com/mixedknuts);
*    [A simple Expected Goals model](https://cricketsavant.wordpress.com/2017/01/21/a-simple-expected-goals-model/) by Cricket Savant;
*    [How we calculate Expected Goals (xG)](https://www.fantasyfootballfix.com/blog-index/how-we-calculate-expected-goals-xg/) by Fantasy Football Fix; and
*    [Una mirada al Soccer Analytics usando R — Parte III](https://medium.com/datos-y-ciencia/una-mirada-al-soccer-analytics-usando-r-parte-iii-3bdff9cd3752) by [Dato Fútbol](https://twitter.com/DatoFutbol_cl) (Ismael Gómez Schmidt).

##### News Articles
*    [Liverpool sign up for StatsBomb 360: Ted Knutson explains why this stats revolution will change the game](https://www.skysports.com/football/news/11669/12248621/liverpool-sign-up-for-statsbomb-360-ted-knutson-explains-why-this-stats-revolution-will-change-the-game) (18/03/2021) by Adam Bate for Sky Sports News;
*    [Man City’s Big Winter Signing Is a Former Hedge Fund Brain](https://www.bloombergquint.com/markets/man-city-s-big-winter-signing-is-a-former-hedge-fund-brain) (31/01/2021) by David Dellier and Adam Blenford for Bloomberg;
*    [Man City land big signing in quest to be the best in data science](https://trainingground.guru/articles/man-city-land-big-signing-in-quest-to-be-the-best-in-data-science) (17/01/2021) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/);
*    [Man City launch AI football competition with Google](https://trainingground.guru/articles/man-city-launch-ai-football-competition-with-google) (13/10/2020) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/);
*    [Manchester City hire Huddersfield Town recruitment co-ordinator](https://trainingground.guru/articles/manchester-city-hire-huddersfield-recruitment-co-ordinator) (05/08/2020) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/);
*    [Manchester City appoint Sisman to new role of Performance Physicist](https://trainingground.guru/articles/manchester-city-appoint-sisman-to-new-role-of-performance-physicist) (20/07/2020) by [Training Ground Guru](https://trainingground.guru/);
*    [Prestidge promoted to top data science job at Man City](https://trainingground.guru/articles/prestidge-promoted-to-top-data-science-job-at-man-city) (02/01/2020) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/);
*    [Man City Head of Data Insights leaves after six years](https://trainingground.guru/articles/man-city-head-of-data-insights-leaves-after-six-years) (10/11/2019) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/); and
*    [Manchester City create new first team data science role](https://trainingground.guru/articles/manchester-city-create-new-first-team-data-science-role) (20/06/2019) by [Simon Austin](https://twitter.com/sport_simon) for [Training Ground Guru](https://trainingground.guru/).

##### Books
*    [The Numbers Game](https://www.amazon.co.uk/Numbers-Game-Everything-About-Football/) by [Chris Anderson](https://twitter.com/soccerquant) and [David Sally](https://twitter.com/DavidSally6);
*    [Football Hackers](https://www.amazon.co.uk/Football-Hackers-Science-Data-Revolution/) by [Christoph Biermann](https://twitter.com/chbiermann);
*    [Soccermatics](https://www.amazon.co.uk/Soccermatics-Mathematical-Adventures-Pro-Bloomsbury/dp/1472924142/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=) by [David Sumpter](https://twitter.com/Soccermatics);
*    [Soccernomics](https://www.amazon.co.uk/Soccernomics-England-Germany-France-Finally/) by Simon Kuper and [Stefan Szymanski](https://twitter.com/sszy);
*    [Money and Football: A Soccernomics Guide ](https://www.amazon.co.uk/dp/B06XCKCVQR/) by Simon Kuper and [Stefan Szymanski](https://twitter.com/sszy); and
*    [Data Analytics in Football](https://www.amazon.co.uk/Data-Analytics-Football-Daniel-Memmert/) by [Daniel Memmert](https://twitter.com/DMemmert) and Dominik Raabe.

#### Videos
For a YouTube playlist of videos collated around the topics of Expected Goals, see [[link](https://www.youtube.com/playlist?list=PL38nJNjpNpH_VPRZJrkaPZOJfyuIaZHUY)]. For a Tracking data in Football specific playlist, see [[link](https://www.youtube.com/playlist?list=PL38nJNjpNpH-UX0YVNu7oN5gAWQc2hq8F)].

##### Webinars and Lectures
*    Laurie Shaw's Metrica Sports Tracking data series for [Friends of Tracking](https://www.youtube.com/channel/UCUBFJYcag8j2rm_9HkrrA7w) (see the following for code [[link](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking)]):
     +    [Introduction](https://www.youtube.com/watch?v=8TrleFklEsE);
     +    [Measuring Physical Performance](https://www.youtube.com/watch?v=VX3T-4lB2o0);
     +    [Pitch Control modelling](https://www.youtube.com/watch?v=5X1cSehLg6s); and
     +    [Valuing Actions](https://www.youtube.com/watch?v=KXSLKwADXKI).
*    [Demystifying Tracking data Sportlogiq webinar](https://www.youtube.com/watch?v=miEWHSTYvX4) by Sam Gregory and Devin Pleuler;
*    [Will Spearman's masterclass in Pitch Control](https://www.youtube.com/watch?v=X9PrwPyolyU&list=PL38nJNjpNpH-l59NupDBW7oG7CmWBgp7Y) for Friends of Tracking;
*    [How Tracking Data is Used in Football and What are the Future Challenges](https://www.youtube.com/watch?v=kHTq9cwdkGA) with Javier Fernández, Sudarshan 'Suds' Gopaladesikan, Laurie Shaw, Will Spearman and David Sumpter for Friends of Tracking.
*    [Introduction to tracking data in football](https://www.youtube.com/watch?v=fYqEnoOV9Po) by David Sumpter for Friends of Tracking;
*    [Learning to Watch Football: Self-Supervised Representations](https://vimeo.com/398489039/80d8dcfb58) for Tracking Data by Karun Singh. See accompanying blog post [[link](https://karun.in/blog/ssr-tracking-data.html)];
*    David Sumpter's Expected Goals webinars for [Friends of Tracking](https://www.youtube.com/channel/UCUBFJYcag8j2rm_9HkrrA7w) (see the following for code [3xGModel](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/3xGModel.py), [4LinearRegression](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/4LinearRegression.py), [5xGModelFit.py](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/5xGModelFit.py), and [6MeasuresOfFit](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/6MeasuresOfFit.py)):
     +    [How to Build An Expected Goals Model 1: Data and Model](https://www.youtube.com/watch?v=bpjLyFyLlXs);
     +    [How to Build An Expected Goals Model 2: Statistical fitting](https://www.youtube.com/watch?v=wHOgINJ5g54); and
     +    [The Ultimate Guide to Expected Goals](https://www.youtube.com/watch?v=310_eW0hUqQ).
*    ['Good practice in data visualisation'](https://www.youtube.com/watch?v=md0pdsWtq_o) webinar by Peter McKeever for Friends Of Tracking. See the following for code [[link](https://github.com/petermckeeverPerform/friends-of-tracking-viz-lecture)];
*    [The Ultimate Guide to Expected Goals](https://www.youtube.com/watch?v=310_eW0hUqQ) David Sumpter for Friends of Tracking;
*    [How to explain Expected Goals to a football player](https://www.youtube.com/watch?v=Xc6IG9-Dt18) by David Sumpter;
*    [What is xG?](https://www.youtube.com/watch?v=zSaeaFcm1SY) by [Tifo Football](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ);
*    [Opta Expected Goals](https://www.youtube.com/watch?v=w7zPZsLGK18) by [The Analyst](https://www.youtube.com/user/optasports) (formally Opta);
*    [What are Expected Goals?](https://www.youtube.com/watch?v=Xc6IG9-Dt18) by [David Sumpter](https://twitter.com/Soccermatics) and Axel Pershagen;
*    [Anatomy of a Goal](https://www.youtube.com/watch?v=YJuHC7xXsGA) by [Numberphile](https://twitter.com/numberphile) [Brady Haran](https://twitter.com/BradyHaran));
*    [Sam Green OptaPro Interview](https://www.youtube.com/watch?v=gHIY-MgDh_o);
*    [How Did These Goals Go In? - We Explain How Goal Probability Works](https://www.youtube.com/watch?v=_vGhocyvKhA) by the Bundesliga;
*    [Soccer Analytics: Expected Goals](https://www.youtube.com/watch?v=3rsDCxszCD0) by [Dan Altman](https://twitter.com/NYAsports); and
*    [Anatomy of an Expected Goal](https://www.youtube.com/watch?v=mgHIx0LSrqM) by [11tegen](https://twitter.com/11tegen11) ([Sander IJtsma](https://twitter.com/IJtsma));
*    [Anatomy of a Goal (with Sam Green)](https://www.youtube.com/watch?v=YJuHC7xXsGA) by Numberphile: 
*    ["Is Our Model Learning What We Think It Is?" Estimating the xG Impact of Actions in Football](https://www.youtube.com/watch?v=i7Ra4Qv4_m4) by [Tom Decroos](https://twitter.com/TomDecroos) from the 2019 StatsBomb Innovation in Football Conference;
*    [Statsbomb Data Launch - Beyond Naive xG](https://www.youtube.com/watch?v=_AYY9XlWEB0) by [Ted Knutson](https://twitter.com/mixedknuts);
*    [Karol Kurach - Google Research Football](https://www.youtube.com/watch?v=Va5dIxejqx0);
*    [Karol Kurach (Google Brain) "Google Research Football: Learning to Play Football with Deep RL](https://www.youtube.com/watch?v=lsN5y2frNig);
*    [Google Research Football](https://www.youtube.com/watch?v=esQvSg2qeS0) by Piotr Stanczyk;
*    [Google's AI Plays Football…For Science!](https://www.youtube.com/watch?v=Uk9p4Kk98_g) by Two Minute Papers;
*    [Changing the soccer transfer market with big data](https://www.youtube.com/watch?v=UMeDP-lIBD8) by [Giels Brouwer](https://twitter.com/gielsbrouwer);
*    [Soccermatics: how maths explains football](https://www.youtube.com/watch?v=Nv7JYtVbzvI) by [David Sumpter](https://twitter.com/Soccermatics);
*    [Data Robot Opening Remarks & Keynote: Making Better Decisions, Faster](https://www.datarobot.com/recordings/ai-experience-emea-on-demand/ai-experience-opening-remarks-keynote/watch/090a6990db580257e9e6046fc48ab035/) with [Brian Prestidge](https://twitter.com/brianprestidge);

##### Miscellaneous
*    [Jeff Stelling xG rant](https://facebook.com/SoccerAM/videos/1740454985978128/); and
*    [Craig Burley xG rant](https://www.youtube.com/watch?v=JBWKGij9Y5A).

##### YouTube Channels
*    [Friends of Tracking](https://www.youtube.com/channel/UCUBFJYcag8j2rm_9HkrrA7w) with [David Sumpter](https://twitter.com/Soccermatics), [Javier Fernández](https://twitter.com/JaviOnData), [Laurie Shaw](https://twitter.com/EightyFivePoint), [Sudarshan 'Suds' Gopaladesikan](https://twitter.com/suds_g), [Pascal Bauer](https://twitter.com/pascal_bauer), and [Fran Peralta](https://twitter.com/PeraltaFran23);
*    [McKay Johns](https://www.youtube.com/channel/UCmqincDKps3syxvD4hbODSg);
*    [Mark Glickman](https://www.youtube.com/channel/UC-gtC2WYRAr_4eYRIUb4ovg) – for NESSIS talks, uploaded to his personal channel. Old talks are available on his [Metacafe channel](https://www.metacafe.com/channels/Mark%20Glickman/). See the official website [[link](http://www.nessis.org/)];
*    [42 Analytics](https://www.youtube.com/user/42analytics) – for SSAC conferences;
*    [StatsBomb](https://www.youtube.com/channel/UCmZ2ArreL9muPvH49Gaw0Bw);
*    [Opta](https://www.youtube.com/user/optasports) - including Opta Pro Forum talks; and
*    [Tifo Football](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ).

#### Podcasts 
List of notable episodes:
*    [All Stats Aren't We](https://open.spotify.com/show/22eR0UCjDdVXY2JTtjD3OI?si=kt_lY1m2QKukOvKvmWpsPA):
     +    [Bonus Episode: David Sumpter - The Ten Equations that Rule the World](https://open.spotify.com/episode/2aWNiGHVH29qnXdrw12Iet?si=gU5__QfvRsCCxjE7XjcRhQ)
*    [Analytics FC Podcast](https://analyticsfc.co.uk/podcast/):
     +    [Episode 27: David Sumpter](https://open.spotify.com/episode/6gG4VY5hRlIio0smhgTnWh?si=meS7GqPxR4WXf2PGMPATZw)
*    [The Conor J Show](https://open.spotify.com/show/2VeRpUoHzC7KN9zxB5N2iz?si=oSMPSpwbR7-IgxSzqlk6Ig):
     +    [The Role Mathematics plays in Sports and Politics (Part 1) | David Sumpter | TCJS #11 (1/2)](https://open.spotify.com/episode/7BBAToNN9Mt0Ol8c9bDtGS?si=5YNtRObXQMu7xYBEgLgmbQ)
*    [Expected Value](https://open.spotify.com/show/5xFeWbaaLFepY5n73SfWwr?si=yn23mqUpQa-mvcL6CYWpgA)
     +    [NESSIS, Part 2 - Laurie Shaw & Sam Gregory](https://open.spotify.com/episode/42z1UFcfgpx17acCCg5rip?si=Pyu8gFJxRiej9fE15Gs89A)
*    [The Football Collective Podcast](https://open.spotify.com/show/3fqNuhWi6hkagJ1U0UDJfe?si=e10JT2ACS86A3JXyO1AzGQ):
     +    [S3 E1 | Sarthak Mondal speaks to Laurie Shaw about the advent of Data Science in Football The Football Collective Podcast](https://open.spotify.com/episode/1gJXuovD1L6VMimN5BtukS?si=Y-Ot43T8TluU7UEiSvReyg)
*    [The Football Fanalytics Podcast](https://open.spotify.com/show/6JwWRPMaHfGicFBtl7nI3V?si=IwQ00tyTRPaBcW-0XLwS4w&nd=1)
     +   [#1: What Did You Expect?](https://open.spotify.com/episode/3CkvTYcsLmNmD5BCIZhpvi?si=NaeVt2zOStm9EJ56n4EozQ)
*    [The Football Pod](https://open.spotify.com/show/3QhwCTOvJN3AZqNalgjtnO?si=173ZCWfsTs-jktoS7Bz9XQ):
     +    [Episode 3 with David Sumpter](https://open.spotify.com/episode/4mnDHbUo097JuC2lQiFijo?si=7abgc4_vRM21jSKff04rWg)
*    [Football Today](https://open.spotify.com/show/1WRaXZgVlksph0IjsTNBaG?si=0zyUX59sTKqCRnq92SEylQ&nd=1)
     +    [Manchester City Enters Data Arms Race With Liverpool](https://open.spotify.com/episode/311rLza8goz2b2SBORBORn?si=aqZX6ooOSfGkY3312jJozA)
*    [Pinnacle Podcast](https://open.spotify.com/show/091oYrS0glFhP81fq32bpE?si=TmR79XGnRAm5987Zj33ImA):
     +    [Serious About Betting: David Sumpter](https://open.spotify.com/episode/6Bt5L7wXDGvrSezYL2FU2O?si=IgQB1TArR9CBTnecZxb6qA)
*    [The Scouted Football Podcast](https://open.spotify.com/show/4qYVKC8RlHCJrwrRCx0w6H?si=M6xgCGtdTjiy0wEl1e2CJw)
     +    #56: Dominic Calvert-Lewin & Explaining Expected Goals - [Spotify](https://open.spotify.com/episode/37SlOJmtoviAKgNanq7Fxq?si=AAnRaCUOTw6FaVkreD5Rzg) and [YouTube](https://www.youtube.com/watch?v=EE_m3VBcASU) by [The Scouted Football Podcast](https://open.spotify.com/show/4qYVKC8RlHCJrwrRCx0w6H?si=M6xgCGtdTjiy0wEl1e2CJw).
*    [Squawka Talker Football Podcast](https://open.spotify.com/show/7xqylrPDX54uo01n4erZQZ?si=XpMNQ43aQxKUa6QuB0dp2w):
     +    [BIG interview with David Sumpter: Putting GPS trackers on under 10s football teams](https://open.spotify.com/episode/6c6vOWNhvUah7Mz01oiCgt?si=t7Sc0W8WRUS0ZnVWTaGt0g)
*    [Tifo Podcast](https://open.spotify.com/show/06QIGhqK31Qw1UvfHzRIDA?si=eJzpmtMeSPWUDP9fQ-5pqA):
     +    The Future of Stats: xG, xA - [Spotify](https://open.spotify.com/episode/7fPpKZSt2o9SSNynayROwd?si=WxuV2PFCQ7yRdNSE-QOZ6g) and [YouTube](https://www.youtube.com/watch?v=sNCeA27sDvI)
*    [Trademate Sports](https://open.spotify.com/show/2LPzUrtsWvz5iSayEGeEQK?si=prrlKqiwQ7-bKIUtbemkeQ):
     +    [Ep 87: Mathematics Professor David Sumpter & Trademate CEO Marius Norheim - Using Mathematics in...](https://open.spotify.com/episode/6jYhTHxujga5D9j37uQbLt?si=3gwKy27dRcOgVij5bYpGzA)
*    [UCN/USF Sport Management - Sports Business Podcast](https://soundcloud.com/user-736114890):
     +    [Kenneth Cortsen talks to Laurie Shaw from Harvard University](https://soundcloud.com/user-736114890/sport-data-analytics-in-football-kenneth-cortsen-talks-to-laurie-shaw-harvard-university)

#### Tweets
*    The benefits of including fake data in an Expected Goals model by David Sumpter [[link](https://twitter.com/Soccermatics/status/1260598182624575490)].


### Data Science

#### Mathematics
*    [Find if a point lies inside a Circle](https://www.geeksforgeeks.org/find-if-a-point-lies-inside-or-on-circle/) by Utkarsh Trivedi
*    [How to know if a point is inside a circle?](https://math.stackexchange.com/questions/198764/how-to-know-if-a-point-is-inside-a-circle)
*    [Check whether a given point lies inside a triangle or not](https://www.geeksforgeeks.org/check-whether-a-given-point-lies-inside-a-triangle-or-not/)

#### Classification Metrics

##### General
*    [Confusion Matrix, Accuracy, Specificity, Precision, and Recall](https://www.coursera.org/lecture/supervised-learning-classification/confusion-matrix-accuracy-specificity-precision-and-recall-e9U0e)
*    [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

##### Overview
*    [scikit-learn documentation classification metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

##### ROC AUC
*    [Receiver operating characteristic Wiki](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
*    [Understanding AUC-ROC](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)
*    [How to Use ROC Curves and Precision-Recall Curves for Classification in Python](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/) by Jason Brownlee
*    [scikit-learn ROC AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
*    [Intuition behind ROC-AUC score](https://towardsdatascience.com/intuition-behind-roc-auc-score-1456439d1f30)

##### Log Loss
*    [Log Loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)
*    [Accuracy, Recall, Precision, F-Score & Specificity, which to optimise on?](https://towardsdatascience.com/accuracy-recall-precision-f-score-specificity-which-to-optimize-on-867d3f11124)
*    [Understanding binary cross-entropy / log loss: a visual explanation](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a)
*    [Intuition behind Log-loss score](https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a)
*    [Log Loss or Cross-Entropy Cost Function in Logistic Regression](https://www.youtube.com/watch?v=MztgenIfGgM)
*    [Lecture 6.4 — Logistic Regression | Cost Function]](https://www.youtube.com/watch?v=HIQlmHxI6-0) by Andrew Ng
*    [Binary Cross Entropy/Log Loss for Binary Classification](https://www.analyticsvidhya.com/blog/2021/03/binary-cross-entropy-log-loss-for-binary-classification/)
*    [A Gentle Introduction to Cross-Entropy for Machine Learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) by Jason Brownlee
*    [What is Log Loss?](https://www.kaggle.com/dansbecker/what-is-log-loss) by Dan Becker

#### Modeling

##### Logistic Regression
*    [scikit-learn Logistic Regression official docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
*    [Logistic Regression wiki](https://en.wikipedia.org/wiki/Logistic_regression#:~:text=Logistic%20regression%20is%20a%20statistical,a%20form%20of%20binary%20regression)
*    [Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8) by StatQuest (Josh Starmer)
*    [Logistic Regression for Machine Learning](https://machinelearningmastery.com/logistic-regression-for-machine-learning/) by Jason Brownlee
*    [Logistic Regression Tutorial for Machine Learning](https://machinelearningmastery.com/logistic-regression-tutorial-for-machine-learning/) by Jason Brownlee

##### XGBoost
*    [Greedy Function Approximation: A Gradient Boosting Machine](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf) by Jerome H. Friedman
*    [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/abs/1603.02754) by Tianqi Chen and Carlos Guestrin (the authors of XGBoost);
*    [XGBoost GitHub repo](https://github.com/dmlc/xgboost);
*    [Awesome XGBoost repo](https://github.com/dmlc/xgboost/tree/master/demo);
*    [XGBoost official documentation](https://xgboost.readthedocs.io/en/latest/index.html). See the tutorials [[link](https://xgboost.readthedocs.io/en/latest/tutorials/index.html)]:
     +    [Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)
     +    [General Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
*    [Gradient Boosting wiki](https://en.wikipedia.org/wiki/Gradient_boosting);
*    [A Gentle Introduction to XGBoost for Applied Machine Learning](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/) by Jason Brownlee
*    [How to Develop Your First XGBoost Model in Python](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/) by Jason Brownlee
*    [How to Configure XGBoost for Imbalanced Classification](https://machinelearningmastery.com/xgboost-for-imbalanced-classification/) by Jason Brownlee
*    [How to Visualize Gradient Boosting Decision Trees With XGBoost in Python](https://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/) by Jason Brownlee
*    [Story and Lessons Behind the Evolution of XGBoost](https://sites.google.com/site/nttrungmtwiki/home/it/data-science---python/xgboost/story-and-lessons-behind-the-evolution-of-xgboost) - brief history and backstory to the creation of XGBoost by Tianqi Chen;
*    [XGBoost A Scalable Tree Boosting System](https://www.youtube.com/watch?v=Vly8xGnNiWs) - talk by Tianqi Chen at the LA Machine Learning Meetup Group on 02/06/2016;
*    [Kaggle Winning Solution Xgboost Algorithm](https://www.youtube.com/watch?v=ufHo8vbk6g4) - talk by Tong He (author of the R XGBoost package) at the NYC Data Science Academy. Mo
*    [Gradient Boosting Machine Learning](https://www.youtube.com/watch?v=wPqtzj5VZus) - talk by Professor Trevor Hastie
*    [XGBoost](https://www.kaggle.com/alexisbcook/xgboost) lessson, part of the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course by [Kaggle](https://www.kaggle.com/)
*    [Hyperparameter Optimization for Xgboost](https://www.youtube.com/watch?v=9HomdnM12o4) by Krish Naik
*    [XGBoost in Python from Start to Finish](https://www.youtube.com/watch?v=GrJP9FLV3FE) by StatQuest (Josh Starmer)
*    [XGBoost + k-fold CV + Feature Importance](https://www.kaggle.com/prashant111/xgboost-k-fold-cv-feature-importance) by Prashant Banerjee
*    [A Guide on XGBoost hyperparameters tuning](https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning) by Prashant Banerjee
*    [Hyperparameter Grid Search with XGBoost](https://www.kaggle.com/tilii7/hyperparameter-grid-search-with-xgboost)
*    [Hyperopt the Xgboost model](https://www.kaggle.com/yassinealouini/hyperopt-the-xgboost-model) by Yassine Alouini
*    [Using XGBoost in Python](https://www.datacamp.com/community/tutorials/xgboost-in-python) by Manish Pathak
*    [Getting started with XGBoost](https://blog.cambridgespark.com/getting-started-with-xgboost) by Kevin Lemagnen
*    [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7) by George Seif
*    [Boosting your Machine Learning Models Using XGBoost](https://heartbeat.fritz.ai/boosting-your-machine-learning-models-using-xgboost-d2cabb3e948f) by Derrick Mwiti
*    [XGBoost Algorithm: Long May She Reign!](https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d) by Vishal Morde
*    [Gradient Boosting and XGBoost](https://medium.com/@gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5) by Gabriel Tseng
*    [Gradient Boosting from scratch](https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d) by Prince Grover
*    [HyperParameter Tuning — Hyperopt Bayesian Optimization for (Xgboost and Neural network)](https://medium.com/analytics-vidhya/hyperparameter-tuning-hyperopt-bayesian-optimization-for-xgboost-and-neural-network-8aedf278a1c9) by Tinu Rohith D
*    [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) by Aarshay Jain


##### CatBoost
*    [CatBoost.ai](https://catboost.ai/)
*    [Official CatBoost GitHub repo](https://github.com/catboost)
*    [Mastering gradient boosting with CatBoost](https://www.youtube.com/watch?v=usdEWSDisS0) by [Anna Veronika Dorogush](https://ru.linkedin.com/in/anna-veronika-dorogush-08739637) at PyData London 2019
*    [Mastering Fast Gradient Boosting on Google Colaboratory with free GPU](https://towardsdatascience.com/mastering-fast-gradient-boosting-on-google-colaboratory-with-free-gpu-65c1dd47d1c5) by [Anna Veronika Dorogush](https://ru.linkedin.com/in/anna-veronika-dorogush-08739637)

#### Feature Interpretation

##### SHAP
*    [Official SHAP GitHub repo](https://github.com/slundberg/shap)
*    [Official documentation](https://shap.readthedocs.io/en/latest/tabular_examples.html)
     +    [Linear models](https://shap.readthedocs.io/en/latest/tabular_examples.html#linear-models)
     +    [Tree-based models](https://shap.readthedocs.io/en/latest/tabular_examples.html#tree-based-models)
*    [A Unified Approach to Interpreting Model Predictions](https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf) by Scott M. Lundberg and Su-In Lee
*    [True to the Model or True to the Data?](https://arxiv.org/pdf/2006.16234.pdf) by Hugh Chen, Joseph D. Janizek, Scott Lundberg, and Su-In Lee
*    [Interpretable Machine Learning with XGBoost](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27) by Scott Lundberg (creator the of SHAP library)
*    [Explain Your Model with the SHAP Values](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d) by Dr. Dataman
*    [SHAP Values Explained Exactly How You Wished Someone Explained to You](https://towardsdatascience.com/shap-explained-the-way-i-wish-someone-explained-it-to-me-ab81cc69ef30)

#### Visualisation
*    [How to Make a Plot with Two Different Y-axis in Python with Matplotlib](https://cmdlinetips.com/2019/10/how-to-make-a-plot-with-two-different-y-axis-in-python-with-matplotlib/)
*    https://stackoverflow.com/questions/39409866/correlation-heatmap
*    [Control color in seaborn heatmaps](https://www.python-graph-gallery.com/92-control-color-in-seaborn-heatmaps)

---

***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***

[Back to the top](#top)