### So, what is xG? Why do we need it? Let's dig deeper.

Put simply, xG or expected goals is the probability of any shot going into goal based on the varaibles pertaining to that shot. In theory, it might sound like a redundant stat as the xG can only be calculated after the shot and its outcome are already known, but in practice, it has a lot of useful applications.


### What are they?

##### 1. It is a predictive statistic: 
What this implies is that expected goals is a better predictor of future goals scored and assisted than Total Shots Ratio (the ratio of shots for and against) and even goals ratio (goals scored vs goals conceded). check out [this article by Michael Caley](https://cartilagefreecaptain.sbnation.com/2014/2/28/5452786/shot-matrix-tottenham-hotspur-stats-analysis-expected-goals) for a detailed analysis of how it helps anaalyse future performance of teams better.

##### 2. It can help determine the finishing ability and playmaking ability of players:
For years, a player's finishing ability was determined by the number of goals they score and a player's playmaking ability by the number of assists they provide. However, this is an intuitively flawed notion. This is because in every goal scored, both abilities are at play. A bad pass can lead to a goal through good finishing, and a good pass can be missed due to bad finishing. This is where xG can help us circumvent these flaws.

#### Analysis 1 - Finishing Ability:

In [73]:
#import necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
%matplotlib inline
pd.options.display.max_columns = 15
plt.style.use('fivethirtyeight')

In [74]:
df2 = pd.read_csv('prelimdat.csv')
df1 = pd.read_csv('xgdata.csv')
df = pd.read_csv('finalxgdata.csv')

df.head()

Unnamed: 0,shotDist,shotAng,isOnTarget,isGoal,isHeader,isBigChance,isCounter,isTapIn,isThroughball,isGround,goalLoc,minute
0,8.0,90.0,1,1,0,0,0,0,0,0,s11,53
1,2.236068,63.434949,1,1,0,0,0,0,0,1,s17,54
2,4.472136,63.434949,1,0,0,0,0,0,0,1,s21,54
3,4.123106,75.963757,1,0,0,0,0,0,0,1,s16,55
4,3.0,90.0,1,0,0,0,0,0,1,1,s21,55


Let's get down to the machine learning pipeline:

In [75]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [76]:
X = df.loc[:,df.columns!='isGoal']
y=df['isGoal']

#intializing scaler and encoder
ohe=OneHotEncoder(sparse=False)
ss = StandardScaler()

#selecting categorical and numerical columns
cat_columns = ['goalLoc']
num_columns = ['shotDist','shotAng','isOnTarget','isHeader','isBigChance','isCounter','isTapIn','isThroughball','isGround','minute']

transformers = [('cat',ohe,cat_columns),('num',ss,num_columns)]

ct = ColumnTransformer(transformers,remainder='passthrough') 
X_t = ct.fit_transform(X)

In [77]:
from joblib import load
xgmodel = load('best_logit.joblib')

In [78]:
predictedXg = xgmodel.predict_proba(X_t)
df1['predicted_xg']=predictedXg[:,1] #the positive class probabilities are our xG
df1.head()

Unnamed: 0,indx,N,Category,Start,Click,End,Descriptors,...,prevD6,prevCat,isThroughball,isGround,goalLoc,minute,predicted_xg
0,9,1.0,R Goal,53:47:04,53:49:04,53:50:04,,...,,L Unsucc Pass,0,0,s11,53,0.507344
1,12,2.0,R Goal,54:11:03,54:13:03,54:14:03,,...,,R Grounded Pass,0,1,s17,54,0.473017
2,17,1.0,L Shot On,54:34:03,54:36:03,54:37:03,,...,,R Grounded Pass,0,1,s21,54,0.403839
3,21,1.0,R Shot On,55:06:24,55:08:24,55:09:24,,...,,L Grounded Pass,0,1,s16,55,0.376687
4,25,2.0,R Shot On,55:20:07,55:22:07,55:23:07,,...,,R Grounded Pass,1,1,s21,55,0.65442


In [84]:
col1 = df1['indx'].tolist()
col2 = df1['predicted_xg'].tolist()
colDict = {}

for key in col1:
    for val in col2:
        colDict[key] = val
        col2.remove(val)
        break

df2['xg_predicted'] = df2.apply(lambda row: colDict[row.indx] if row.indx in colDict.keys() else 0, axis=1)
df2['isShot'] = df2.apply(lambda row: 1 if 'Shot' in row.Category or 'Goal' in row.Category else 0, axis=1)
df2['isGoal'] = df2.apply(lambda row: 1 if 'Goal' in row.Category else 0, axis=1)
df2.head()


# test code -
# dftest = df2[df2['isShot'] == 1]
# dftest.head()


Unnamed: 0,indx,N,Category,Start,Click,End,Descriptors,...,Des_3,Des_4,Des_5,Des_6,xg_predicted,isShot,isGoal
9,9,1.0,R Goal,53:47:04,53:49:04,53:50:04,,...,s11,Long Range,,,0.507344,1,1
12,12,2.0,R Goal,54:11:03,54:13:03,54:14:03,,...,s17,,,,0.473017,1,1
17,17,1.0,L Shot On,54:34:03,54:36:03,54:37:03,,...,s21,,,,0.403839,1,0
21,21,1.0,R Shot On,55:06:24,55:08:24,55:09:24,,...,s16,,,,0.376687,1,0
25,25,2.0,R Shot On,55:20:07,55:22:07,55:23:07,,...,s21,,,,0.65442,1,0


In [85]:
df_player_level = df2[['Des_1','isGoal','xg_predicted']]
df_player_level=df_player_level.groupby(['Des_1']).sum()
df_player_level['xG_Diff']=df_player_level['isGoal']-df_player_level['xg_predicted']

df_player_level_xG=df_player_level.sort_values(by=['isGoal'],ascending=False)
df_player_level_xG=df_player_level_xG.round(1)
df_player_level_xG.columns=['goals','xG_sum','xG_Diff']
df_player_level_xG.head(10)

Unnamed: 0_level_0,goals,xG_sum,xG_Diff
Des_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TG,79,57.0,22.0
TB,65,53.4,11.6
B3,16,17.9,-1.9
B2,9,10.4,-1.4
B6,9,8.4,0.6
G4,9,11.2,-2.2
B4,8,14.9,-6.9
B5,8,10.9,-2.9
G3,8,10.4,-2.4
G5,6,7.9,-1.9


##### What does this mean?

Essentially, xG says "the average player would have scored this shot 'xG'% of the time" for any given shot. 
* So when a player (take for example, TG or TB) scores more goals than expected, they can be said to be good finishers.
* Conversely, when a player (B4, for example) scores far fewer goals than their expected goals, they can be said to be bad finishers.
* Finally, players whose goals and expected goals fall in line with each other (like B6), can be deemed average finishers.

#### Why should we care?
With players who have good finishing but low xG, coaches can focus their training on movement to get into better scoring positions. Whereas, with players who have good xG but poor finishing, coaches can focus their training on finishing drills.

With behind the scenes decision making, such as transfers and wages - when comparing players like B2 and G4, based on the needs of your team (i.e, if you are in need of players that can finish tough shots or if you need a player that can exploit open spaces better), by looking at both xG and goals scored instead of looking at just the goals a player has scored, the team can make better transfer and wage decisions.

#### Analysis 2 - Playmaking Ability/Expected Assists: 

In [87]:
df2['prevPlr'] = df2['Des_1'].shift(1).astype(str)
df2['prevCat'] = df2['Category'].shift(1).astype(str)

In [88]:
df2['xA'] = df2.apply(lambda row: row.xg_predicted if "Pass" in row.prevCat else 0, axis=1)
df2['Assists'] = df2.apply(lambda row: 1 if "Pass" in row.prevCat and row.isGoal == 1 else 0, axis=1)

dftest = df2[df2['isShot'] == 1]
dftest.head()

Unnamed: 0,indx,N,Category,Start,Click,End,Descriptors,...,xg_predicted,isShot,isGoal,prevPlr,prevCat,xA,Assists
9,9,1.0,R Goal,53:47:04,53:49:04,53:50:04,,...,0.507344,1,1,B3,L Unsucc Pass,0.507344,1
12,12,2.0,R Goal,54:11:03,54:13:03,54:14:03,,...,0.473017,1,1,B2,R Grounded Pass,0.473017,1
17,17,1.0,L Shot On,54:34:03,54:36:03,54:37:03,,...,0.403839,1,0,TB,R Grounded Pass,0.403839,0
21,21,1.0,R Shot On,55:06:24,55:08:24,55:09:24,,...,0.376687,1,0,TG,L Grounded Pass,0.376687,0
25,25,2.0,R Shot On,55:20:07,55:22:07,55:23:07,,...,0.65442,1,0,B5,R Grounded Pass,0.65442,0


In [90]:
df_player_level1 = df2[['prevPlr','Assists','xA']]
df_player_level1 = df_player_level1.groupby(['prevPlr']).sum()
df_player_level1['xA_Diff'] = df_player_level1['Assists']-df_player_level1['xA']

df_player_level_xA = df_player_level1.sort_values(by=['Assists'],ascending=False)
df_player_level_xA = df_player_level_xA.round(1)
df_player_level_xA.columns = ['Assists','xA','xA_Diff']
df_player_level_xA.head(10)

Unnamed: 0_level_0,Assists,xA,xA_Diff
prevPlr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TB,34,36.5,-2.5
B3,23,20.6,2.4
B5,18,16.0,2.0
B4,17,19.7,-2.7
B6,16,14.0,2.0
TG,14,20.5,-6.5
G3,9,7.0,2.0
G5,9,6.4,2.6
B2,8,5.8,2.2
G1,7,6.0,1.0


##### What does this mean?

* Players like TG and B4 are unlucky as they created chances which were not done justice to by their teammates.
* Whereas players like G5 and B3 are lucky, as their teammates scored goals worth more than the chances they created.


#### Why should we care?
In conventional models a player like B5 would be valued higher than a player like B4, and would probably get more playing time/demand a higher transfer fee. However, using expected goals in their analysis would allow for decision makers to make better on and off-pitch decisions, especially when considering the previous use case of how it is a better predictor of future goals than actual goals scored.


##### 3. It can be used by fans for predictions in fantasy leagues:

#### 3.1 - Fantasy Football:
The key point here is that fantasy football works on the system of price rise and price drops, case in point:
* Mohammed Salah's 2018-19 season:

<img src="files/2018 epl 1st 10.png">

After the end of September in the 2018-19 season, Mohammed Salah led the league in xG but was underperforming quite massively. In the famous game FPL, when a player performs poorly over a period of time, their price drops by 0.1m, and when they perform better it rises by 0.1m.

<img src="files/2018 epl final.png">

As we see above, Salah finished the season as the joint top goalscorer in the league and had a very respectable 8 assists to boot. If players used xG, they could predict a burst in form and make a move for him when his price dropped due to underperformance.

However there are risks involved here, such as:
* players can be dropped to the bench before regressing to the mean.
* sometimes the burst in form can happen in the next season. Eg: Danny Ings' burst in form in 2019-20.

I hope that these applications were fun and informative and changed the way you watch and enjoy the beautiful game, at least slightly.