# **Expected Goals Classifier**

# Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [Expected Goals Classifier]()

# Data Cleaning Notebook

Continued from expected_goals_data_extraction_notebook

*Notebook 2 of 7*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data cleaned in [expected_goals_data_cleaning_notebook]()
3. Data explored in [expected_goals_data_exploration_notebook]()
4. Features engineered in [expected_goals_feature_engineering_notebook]()
5. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
6. Modeling in [expected_goals_model_fitting_notebook]()
7. Conclusions in [expected_goals_model_assessment_notebook]()

<a id = 'packages'></a>
# Packages

In [51]:
# rpy2 to run R
%load_ext rpy2.ipython

# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# PyPy to improve speed
!apt-get install pypy

# warnings to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Pandas for dataframes
import pandas as pd

# Numpy and math for mathematical functions
import numpy as np

import math
from math import atan2

# ProfileReport, SweetViz, and AutoViz for exploratory data analysis

!pip install http://github.com/pandas-profiling/pandas-profiling/archive/master.zip
from pandas_profiling import ProfileReport as pr

!pip install sweetviz
import sweetviz as sv

!pip install autoviz
from autoviz.AutoViz_Class import AutoViz_Class as av

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading package lists... Done
Building dependency tree       
Reading state information... Done
pypy is already the newest version (5.10.0+dfsg-3build2).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
Collecting http://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached http://github.com/pandas-profiling/pandas-profiling/archive/master.zip


# Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [2]:
# Import extracted_data from expected_goals_data_extraction_notebook

extracted_data = pd.read_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/extracted_data.parquet')

In [3]:
extracted_data.head()

Unnamed: 0,id,index_x,period_x,timestamp_x,minute_x,second_x,type_x,possession_x,possession_team_x,play_pattern_x,team_x,player_x,position_x,location_x,duration_x,under_pressure_x,related_events_x,match_id_x,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_technique,shot_outcome,shot_type,shot_body_part,shot_freeze_frame,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,out_x,shot_redirect,shot_deflected,off_camera_x,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble,index_y,period_y,timestamp_y,...,second_y,type_y,possession_y,possession_team_y,play_pattern_y,team_y,player_y,position_y,location_y,duration_y,related_events_y,match_id_y,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,under_pressure_y,pass_outcome,pass_aerial_won,pass_assisted_shot_id,pass_shot_assist,off_camera_y,pass_switch,pass_through_ball,pass_technique,pass_backheel,pass_cross,counterpress,pass_cut_back,pass_deflected,pass_goal_assist,pass_miscommunication,pass_inswinging,pass_straight,pass_outswinging,pass_no_touch,out_y
0,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,258,1,00:04:38.609,4,38,Shot,11,Chelsea FCW,Regular Play,Chelsea FCW,Francesca Kirby,Center Forward,"[109.0, 46.0]",0.2788,True,"[011167bc-9cbc-46a3-9b7b-28065eab7af1, 2c37831...",19743,0.266154,"[112.0, 45.0]",bf82ea91-c3e3-4d8c-b91d-c9d0ccd44f11,Normal,Blocked,Open Play,Left Foot,"[{'location': [104.0, 50.0], 'player': {'id': ...",,,,,,,,,,,,253.0,1.0,00:04:35.786,...,35.0,Pass,11.0,Chelsea FCW,Regular Play,Chelsea FCW,Bethany England,Left Midfield,"[95.0, 49.0]",1.361685,"[58da4d74-7684-405d-a8cc-bef1d658f1b6, 60d1337...",19743.0,Francesca Kirby,11.18034,0.463648,Ground Pass,"[105.0, 54.0]",Left Foot,,True,,,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,True,,,,,,,,,,,,,,,,
1,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,542,1,00:11:45.046,11,45,Shot,24,Chelsea FCW,From Free Kick,Chelsea FCW,Bethany England,Left Midfield,"[113.0, 35.0]",0.25673,True,"[a4b77cbb-14d0-4bd3-ba8b-7312335098fe, b9b246c...",19743,0.093521,"[120.0, 32.9, 0.4]",b99082e1-812b-48dd-bf94-8856b1ff079b,Normal,Off T,Open Play,Head,"[{'location': [108.0, 45.0], 'player': {'id': ...",True,True,,,,,,,,,,539.0,1.0,00:11:42.863,...,42.0,Pass,24.0,Chelsea FCW,From Free Kick,Chelsea FCW,Erin Cuthbert,Right Midfield,"[82.0, 54.0]",2.1038,[540a29f4-8533-4852-b492-307d124cf084],19743.0,Bethany England,37.735924,-0.558599,High Pass,"[114.0, 34.0]",Right Foot,Free Kick,,,,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,True,,,,,,,,,,,,,,,,
2,f68deb6f-0711-4b9d-8081-122dc3722c55,614,1,00:18:03.461,18,3,Shot,29,Chelsea FCW,Regular Play,Chelsea FCW,Drew Spence,Left Defensive Midfield,"[94.0, 43.0]",1.147883,True,"[3c03553f-3bed-4d21-8096-ed4ef269da62, bb13e23...",19743,0.036171,"[120.0, 42.8, 0.5]",5022d0b3-ea32-42a8-bd41-b46cc244beb9,Normal,Saved,Open Play,Left Foot,"[{'location': [118.0, 41.0], 'player': {'id': ...",,,,,,,,,,,,610.0,1.0,00:18:01.596,...,1.0,Pass,29.0,Chelsea FCW,Regular Play,Chelsea FCW,So-yun Ji,Center Attacking Midfield,"[98.0, 60.0]",0.918187,"[753c6e78-72f9-4963-bcb7-c3e4ed58be6a, c884125...",19743.0,Drew Spence,11.18034,-2.034444,Ground Pass,"[93.0, 50.0]",Right Foot,,True,,,f68deb6f-0711-4b9d-8081-122dc3722c55,True,,,,,,,,,,,,,,,,
3,f301190f-cc0a-4f16-8278-27e5279ea24e,877,1,00:23:11.935,23,11,Shot,43,Birmingham City WFC,From Goal Kick,Birmingham City WFC,Chloe Arthur,Right Back,"[86.0, 34.0]",2.161012,True,"[0bfe1b6c-d690-41a6-be3e-f9b6295ddd85, 570e15b...",19743,0.016625,"[119.0, 33.3, 0.5]",fdf4a564-4973-46e5-bc07-d84785f8c183,Normal,Off T,Open Play,Left Foot,"[{'location': [78.0, 58.0], 'player': {'id': 1...",,,,,,,,,,,,873.0,1.0,00:23:08.192,...,8.0,Pass,43.0,Birmingham City WFC,From Goal Kick,Birmingham City WFC,Emma Follis,Center Forward,"[86.0, 15.0]",2.033567,[7d3eb214-4b99-4e3f-ad83-155793b118fc],19743.0,Chloe Arthur,13.892444,2.098871,Ground Pass,"[79.0, 27.0]",Right Foot,,,,,f301190f-cc0a-4f16-8278-27e5279ea24e,True,,,,,,,,,,,,,,,,
4,8558535e-b1ee-4f53-b003-1b5fba2712bd,892,1,00:23:45.810,23,45,Shot,44,Chelsea FCW,From Goal Kick,Chelsea FCW,Bethany England,Left Midfield,"[94.0, 33.0]",1.225187,,[1455cb46-43a3-4e6f-b845-171abcd344bc],19743,0.030716,"[120.0, 34.8, 0.5]",37712221-3b0b-4090-a30c-08a3ee6492be,Normal,Off T,Open Play,Right Foot,"[{'location': [117.0, 40.0], 'player': {'id': ...",,,,,,,,,,,,888.0,1.0,00:23:41.728,...,41.0,Pass,44.0,Chelsea FCW,From Goal Kick,Chelsea FCW,Jonna Andersson,Left Back,"[83.0, 10.0]",1.243357,[fad5af63-bf6e-4e51-9321-644b99e9f2b8],19743.0,Bethany England,14.56022,1.292497,Ground Pass,"[87.0, 24.0]",Left Foot,,,,,8558535e-b1ee-4f53-b003-1b5fba2712bd,True,,,,,,,,,,,,,,,,


# Drop Features

In [4]:
print('Total Features:',
      extracted_data.shape[1])

Total Features: 81


In [5]:
list(extracted_data.columns.values.tolist())

['id',
 'index_x',
 'period_x',
 'timestamp_x',
 'minute_x',
 'second_x',
 'type_x',
 'possession_x',
 'possession_team_x',
 'play_pattern_x',
 'team_x',
 'player_x',
 'position_x',
 'location_x',
 'duration_x',
 'under_pressure_x',
 'related_events_x',
 'match_id_x',
 'shot_statsbomb_xg',
 'shot_end_location',
 'shot_key_pass_id',
 'shot_technique',
 'shot_outcome',
 'shot_type',
 'shot_body_part',
 'shot_freeze_frame',
 'shot_one_on_one',
 'shot_aerial_won',
 'shot_open_goal',
 'shot_first_time',
 'out_x',
 'shot_redirect',
 'shot_deflected',
 'off_camera_x',
 'shot_saved_off_target',
 'shot_saved_to_post',
 'shot_follows_dribble',
 'index_y',
 'period_y',
 'timestamp_y',
 'minute_y',
 'second_y',
 'type_y',
 'possession_y',
 'possession_team_y',
 'play_pattern_y',
 'team_y',
 'player_y',
 'position_y',
 'location_y',
 'duration_y',
 'related_events_y',
 'match_id_y',
 'pass_recipient',
 'pass_length',
 'pass_angle',
 'pass_height',
 'pass_end_location',
 'pass_body_part',
 'pass_typ

### Duplicate Features

In [6]:
# Drop duplicate features

extracted_data.drop(['shot_saved_off_target',
                     'shot_saved_to_post',
                     'pass_outcome',
                     'pass_assisted_shot_id',
                     'pass_shot_assist',
                     'pass_goal_assist',
                     'pass_end_location',
                     'index_y',
                     'period_y',
                     'timestamp_y',
                     'minute_x',
                     'second_x',
                     'minute_y',
                     'second_y',
                     'type_y',
                     'possession_y',
                     'possession_team_y',
                     'team_y',
                     'player_y',
                     'position_y',
                     'location_y',
                     'duration_y',
                     'related_events_y',
                     'match_id_y',
                     'under_pressure_y',
                     'off_camera_y',
                     'out_y'],
                    axis = 1,
                    inplace = True)

### Non-Shot-Specific Features

In [7]:
# Drop features unrelated to shot-specific data

extracted_data.drop(['id',
                     'index_x',
                     'type_x',
                     'possession_x',
                     'possession_team_x',
                     'team_x',
                     'player_x',
                     'position_x',
                     'duration_x',
                     'related_events_x',
                     'match_id_x',
                     'shot_key_pass_id',
                     'shot_freeze_frame',
                     'out_x',
                     'off_camera_x',
                     'pass_recipient',
                     'pass_aerial_won'],
                    axis = 1,
                    inplace = True)

In [8]:
print('Total Features:',
      extracted_data.shape[1])

Total Features: 37


In [9]:
list(extracted_data.columns.values.tolist())

['period_x',
 'timestamp_x',
 'play_pattern_x',
 'location_x',
 'under_pressure_x',
 'shot_statsbomb_xg',
 'shot_end_location',
 'shot_technique',
 'shot_outcome',
 'shot_type',
 'shot_body_part',
 'shot_one_on_one',
 'shot_aerial_won',
 'shot_open_goal',
 'shot_first_time',
 'shot_redirect',
 'shot_deflected',
 'shot_follows_dribble',
 'play_pattern_y',
 'pass_length',
 'pass_angle',
 'pass_height',
 'pass_body_part',
 'pass_type',
 'pass_switch',
 'pass_through_ball',
 'pass_technique',
 'pass_backheel',
 'pass_cross',
 'counterpress',
 'pass_cut_back',
 'pass_deflected',
 'pass_miscommunication',
 'pass_inswinging',
 'pass_straight',
 'pass_outswinging',
 'pass_no_touch']

In [10]:
extracted_data.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,location_x,under_pressure_x,shot_statsbomb_xg,shot_end_location,shot_technique,shot_outcome,shot_type,shot_body_part,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,shot_redirect,shot_deflected,shot_follows_dribble,play_pattern_y,pass_length,pass_angle,pass_height,pass_body_part,pass_type,pass_switch,pass_through_ball,pass_technique,pass_backheel,pass_cross,counterpress,pass_cut_back,pass_deflected,pass_miscommunication,pass_inswinging,pass_straight,pass_outswinging,pass_no_touch
0,1,00:04:38.609,Regular Play,"[109.0, 46.0]",True,0.266154,"[112.0, 45.0]",Normal,Blocked,Open Play,Left Foot,,,,,,,,Regular Play,11.18034,0.463648,Ground Pass,Left Foot,,,,,,,,,,,,,,
1,1,00:11:45.046,From Free Kick,"[113.0, 35.0]",True,0.093521,"[120.0, 32.9, 0.4]",Normal,Off T,Open Play,Head,True,True,,,,,,From Free Kick,37.735924,-0.558599,High Pass,Right Foot,Free Kick,,,,,,,,,,,,,
2,1,00:18:03.461,Regular Play,"[94.0, 43.0]",True,0.036171,"[120.0, 42.8, 0.5]",Normal,Saved,Open Play,Left Foot,,,,,,,,Regular Play,11.18034,-2.034444,Ground Pass,Right Foot,,,,,,,,,,,,,,
3,1,00:23:11.935,From Goal Kick,"[86.0, 34.0]",True,0.016625,"[119.0, 33.3, 0.5]",Normal,Off T,Open Play,Left Foot,,,,,,,,From Goal Kick,13.892444,2.098871,Ground Pass,Right Foot,,,,,,,,,,,,,,
4,1,00:23:45.810,From Goal Kick,"[94.0, 33.0]",,0.030716,"[120.0, 34.8, 0.5]",Normal,Off T,Open Play,Right Foot,,,,,,,,From Goal Kick,14.56022,1.292497,Ground Pass,Left Foot,,,,,,,,,,,,,,


# Target Feature

In [11]:
# The taget feature is shot_outcome

In [12]:
# Display value counts for shot_outcome

extracted_data['shot_outcome'].value_counts(dropna = False)

Off T               1912
Saved               1531
Blocked             1460
Goal                 664
Wayward              336
Post                 136
Saved Off Target      24
Saved to Post         17
Name: shot_outcome, dtype: int64

In [13]:
# For training a classification model to predict expected goals,
# the only import measurement is if the shot resulted in a goal

# Change to boolean goal

extracted_data['shot_outcome'] = extracted_data['shot_outcome'].apply(lambda i: 'True' if i == 'Goal' else 'False')

extracted_data.rename(columns = {'shot_outcome' : 'goal'},
                      inplace = True)

In [14]:
extracted_data['goal'].value_counts(dropna = False)

False    5416
True      664
Name: goal, dtype: int64

# Assessing Feature Values

## ProfileReport

In [15]:
pr_report = pr(extracted_data)
pr_report

Output hidden; open in https://colab.research.google.com to view.

## Split Location Coordinates

### location_x

In [16]:
# Split location_x into location_y and location_x

shot_location_df = pd.DataFrame(extracted_data['location_x'].tolist(),
                                index = extracted_data.index)

In [17]:
# Replace location_x with shot_location_x and shot_location_y

extracted_data.drop('location_x',
                    axis = 1,
                    inplace = True)

extracted_data['shot_location_y'] = shot_location_df[0]
extracted_data['shot_location_x'] = shot_location_df[1]

### shot_end_location

In [18]:
# Split shot_end_location into end_location_y, end_ocation_x, and end_location_z

end_location_df = pd.DataFrame(extracted_data['shot_end_location'].tolist(),
                               index = extracted_data.index)

In [19]:
end_location_df[0].describe()

count    6080.000000
mean      116.010099
std         6.252067
min        84.000000
25%       115.000000
50%       119.000000
75%       120.000000
max       120.000000
Name: 0, dtype: float64

In [20]:
end_location_df[0].isna().sum()

0

In [21]:
# Drop y-coordinate
# All shots are aimed to end at the endline (120)

In [22]:
end_location_df[1].describe()

count    6080.000000
mean       40.149474
std         6.305472
min         0.100000
25%        36.400000
50%        40.000000
75%        43.800000
max        80.000000
Name: 1, dtype: float64

In [23]:
end_location_df[1].isna().sum()

0

In [24]:
end_location_df[2].describe()

count    4284.000000
mean        1.752311
std         1.522281
min         0.000000
25%         0.500000
50%         1.300000
75%         2.400000
max         7.800000
Name: 2, dtype: float64

In [25]:
end_location_df[2].isna().sum()

1796

In [26]:
print('Percent NA for z-coordinate:',
      ((end_location_df[2].isna().sum()) / (extracted_data.shape[0]) * 100))

Percent NA for z-coordinate: 29.539473684210527


In [27]:
# Drop z-coordinate
# 29.53% values missing

In [28]:
# Replace shot_end_location with x-coordinate

extracted_data['shot_end_location'] = end_location_df[1]

## Correct Boolean Features

In [29]:
boolean_features = ['shot_one_on_one',
                    'shot_aerial_won',
                    'shot_open_goal',
                    'shot_first_time',
                    'shot_redirect',
                    'shot_deflected',
                    'shot_follows_dribble',
                    'under_pressure_x',
                    'counterpress',
                    'pass_switch',
                    'pass_through_ball',
                    'pass_backheel',
                    'pass_cross',
                    'pass_cut_back',
                    'pass_deflected',
                    'pass_miscommunication',
                    'pass_inswinging',
                    'pass_straight',
                    'pass_outswinging',
                    'pass_no_touch']

In [30]:
extracted_data[boolean_features] = extracted_data[boolean_features].astype(bool)

In [31]:
# Drop pass_miscommunication due to no True values

extracted_data.drop('pass_miscommunication',
                    axis = 1,
                    inplace = True)

## ProfileReport 2

In [32]:
pr_report = pr(extracted_data)
pr_report

Output hidden; open in https://colab.research.google.com to view.

## Missing Values

In [33]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 37 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   period_x              6080 non-null   int64  
 1   timestamp_x           6080 non-null   object 
 2   play_pattern_x        6080 non-null   object 
 3   under_pressure_x      6080 non-null   bool   
 4   shot_statsbomb_xg     6080 non-null   float64
 5   shot_end_location     6080 non-null   float64
 6   shot_technique        6080 non-null   object 
 7   goal                  6080 non-null   object 
 8   shot_type             6080 non-null   object 
 9   shot_body_part        6080 non-null   object 
 10  shot_one_on_one       6080 non-null   bool   
 11  shot_aerial_won       6080 non-null   bool   
 12  shot_open_goal        6080 non-null   bool   
 13  shot_first_time       6080 non-null   bool   
 14  shot_redirect         6080 non-null   bool   
 15  shot_deflected       

### No Pass Preceding Shot

In [34]:
# Note: pass_length, pass_angle, and pass_height each have 1942 missing values
# Assume these missing values are shots which were not preceded by a pass

In [35]:
extracted_data[['pass_length',
                'pass_angle']].fillna(0,
                                      inplace = True)

In [36]:
extracted_data.loc[extracted_data['pass_length'] == 0,
                   ['pass_height',
                    'pass_type',
                    'pass_technique',
                    'pass_body_part']] = 'No Pass'

### pass_type

In [37]:
extracted_data['pass_type'].value_counts(dropna = False)

NaN             5119
Corner           400
Recovery         305
Free Kick        201
Throw-in          42
Interception      10
Goal Kick          1
No Pass            1
Kick Off           1
Name: pass_type, dtype: int64

In [38]:
# Defined pass_type are set-plays
# Assume missing values are from open play

extracted_data['pass_type'].fillna('Open Play',
                                   inplace = True)

### pass_technique

In [39]:
extracted_data['pass_technique'].value_counts(dropna = False)

NaN             5724
Through Ball     198
Inswinging        76
Outswinging       55
Straight          26
No Pass            1
Name: pass_technique, dtype: int64

In [40]:
# Assume missing values are standard passes

extracted_data['pass_technique'].fillna('Standard',
                                        inplace = True)

### pass_body_part

In [41]:
extracted_data['pass_body_part'].value_counts(dropna = False)

Right Foot    2744
NaN           2040
Left Foot     1162
Head           105
Other           19
No Touch         5
Drop Kick        4
No Pass          1
Name: pass_body_part, dtype: int64

In [42]:
# Assume missing values as Other

extracted_data['pass_body_part'].fillna('Other',
                                        inplace = True)

# ProfileReport 3

In [43]:
pr_report = pr(extracted_data)
pr_report

Output hidden; open in https://colab.research.google.com to view.

# SweetViz

In [44]:
sv_report = sv.analyze(extracted_data)
sv_report.show_notebook()

Output hidden; open in https://colab.research.google.com to view.

# Cleaned Data

In [45]:
cleaned_data = extracted_data

In [46]:
print('Total Events:',
      len(cleaned_data))

Total Events: 6080


In [47]:
print('Total Features:',
      cleaned_data.shape[1])

Total Features: 37


In [48]:
# Save cleaned_data

cleaned_data.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_cleaning/dataframes/cleaned_data.parquet')

In [53]:
print('extracted_data Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/data_cleaning/dataframes/cleaned_data.parquet').stat().st_size,
      'bytes')

extracted_data Filesize: 275320 bytes


Continued in [expected_goals_feature engineering_notebook]()

*3 of 7*