# **Expected Goals Classifier**

## Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [Expected Goals Classifier]()

# Data Cleaning Notebook

Continued from expected_goals_data_extraction_notebook

*Notebook 2 of 7*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data cleaned in [expected_goals_data_cleaning_notebook]()
3. Features engineered in [expected_goals_feature_engineering_notebook]()
4. Data explored in [expected_goals_data_exploration_notebook]()
5. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
6. Predictions modeled in [expected_goals_model_fitting_notebook]()
7. Conclusions in [expected_goals_model_assessment_notebook]()

<a id = 'packages'></a>
# Packages

In [1]:
# rpy2 to run R
%load_ext rpy2.ipython

# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# PyPy to improve speed
!apt-get install pypy

# warnings to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Pandas for dataframes
import pandas as pd

# Numpy for mathematical functions
import numpy as np

# ProfileReport and SweetViz for exploratory data analysis
!pip install http://github.com/pandas-profiling/pandas-profiling/archive/master.zip
from pandas_profiling import ProfileReport as pr

# Matplotlib, Seaborn, and Plotly for visualizations
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px

# Scipy for statistical functions

from scipy import stats

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading package lists... Done
Building dependency tree       
Reading state information... Done
pypy is already the newest version (5.10.0+dfsg-3build2).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
Collecting http://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached http://github.com/pandas-profiling/pandas-profiling/archive/master.zip


# Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [2]:
# Import extracted_data from expected_goals_data_extraction_notebook

extracted_data = pd.read_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/extracted_data.parquet')

In [3]:
extracted_data.head()

Unnamed: 0,id,index_x,period_x,timestamp_x,minute_x,second_x,type_x,possession_x,possession_team_x,play_pattern_x,team_x,player_x,position_x,location_x,duration_x,under_pressure_x,related_events_x,match_id_x,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_technique,shot_outcome,shot_type,shot_body_part,shot_freeze_frame,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,out_x,shot_redirect,shot_deflected,off_camera_x,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble,index_y,period_y,timestamp_y,...,second_y,type_y,possession_y,possession_team_y,play_pattern_y,team_y,player_y,position_y,location_y,duration_y,related_events_y,match_id_y,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,under_pressure_y,pass_outcome,pass_aerial_won,pass_assisted_shot_id,pass_shot_assist,off_camera_y,pass_switch,pass_through_ball,pass_technique,pass_backheel,pass_cross,counterpress,pass_cut_back,pass_deflected,pass_goal_assist,pass_miscommunication,pass_inswinging,pass_straight,pass_outswinging,pass_no_touch,out_y
0,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,258,1,00:04:38.609,4,38,Shot,11,Chelsea FCW,Regular Play,Chelsea FCW,Francesca Kirby,Center Forward,"[109.0, 46.0]",0.2788,True,"[011167bc-9cbc-46a3-9b7b-28065eab7af1, 2c37831...",19743,0.266154,"[112.0, 45.0]",bf82ea91-c3e3-4d8c-b91d-c9d0ccd44f11,Normal,Blocked,Open Play,Left Foot,"[{'location': [104.0, 50.0], 'player': {'id': ...",,,,,,,,,,,,253.0,1.0,00:04:35.786,...,35.0,Pass,11.0,Chelsea FCW,Regular Play,Chelsea FCW,Bethany England,Left Midfield,"[95.0, 49.0]",1.361685,"[58da4d74-7684-405d-a8cc-bef1d658f1b6, 60d1337...",19743.0,Francesca Kirby,11.18034,0.463648,Ground Pass,"[105.0, 54.0]",Left Foot,,True,,,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,True,,,,,,,,,,,,,,,,
1,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,542,1,00:11:45.046,11,45,Shot,24,Chelsea FCW,From Free Kick,Chelsea FCW,Bethany England,Left Midfield,"[113.0, 35.0]",0.25673,True,"[a4b77cbb-14d0-4bd3-ba8b-7312335098fe, b9b246c...",19743,0.093521,"[120.0, 32.9, 0.4]",b99082e1-812b-48dd-bf94-8856b1ff079b,Normal,Off T,Open Play,Head,"[{'location': [108.0, 45.0], 'player': {'id': ...",True,True,,,,,,,,,,539.0,1.0,00:11:42.863,...,42.0,Pass,24.0,Chelsea FCW,From Free Kick,Chelsea FCW,Erin Cuthbert,Right Midfield,"[82.0, 54.0]",2.1038,[540a29f4-8533-4852-b492-307d124cf084],19743.0,Bethany England,37.735924,-0.558599,High Pass,"[114.0, 34.0]",Right Foot,Free Kick,,,,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,True,,,,,,,,,,,,,,,,
2,f68deb6f-0711-4b9d-8081-122dc3722c55,614,1,00:18:03.461,18,3,Shot,29,Chelsea FCW,Regular Play,Chelsea FCW,Drew Spence,Left Defensive Midfield,"[94.0, 43.0]",1.147883,True,"[3c03553f-3bed-4d21-8096-ed4ef269da62, bb13e23...",19743,0.036171,"[120.0, 42.8, 0.5]",5022d0b3-ea32-42a8-bd41-b46cc244beb9,Normal,Saved,Open Play,Left Foot,"[{'location': [118.0, 41.0], 'player': {'id': ...",,,,,,,,,,,,610.0,1.0,00:18:01.596,...,1.0,Pass,29.0,Chelsea FCW,Regular Play,Chelsea FCW,So-yun Ji,Center Attacking Midfield,"[98.0, 60.0]",0.918187,"[753c6e78-72f9-4963-bcb7-c3e4ed58be6a, c884125...",19743.0,Drew Spence,11.18034,-2.034444,Ground Pass,"[93.0, 50.0]",Right Foot,,True,,,f68deb6f-0711-4b9d-8081-122dc3722c55,True,,,,,,,,,,,,,,,,
3,f301190f-cc0a-4f16-8278-27e5279ea24e,877,1,00:23:11.935,23,11,Shot,43,Birmingham City WFC,From Goal Kick,Birmingham City WFC,Chloe Arthur,Right Back,"[86.0, 34.0]",2.161012,True,"[0bfe1b6c-d690-41a6-be3e-f9b6295ddd85, 570e15b...",19743,0.016625,"[119.0, 33.3, 0.5]",fdf4a564-4973-46e5-bc07-d84785f8c183,Normal,Off T,Open Play,Left Foot,"[{'location': [78.0, 58.0], 'player': {'id': 1...",,,,,,,,,,,,873.0,1.0,00:23:08.192,...,8.0,Pass,43.0,Birmingham City WFC,From Goal Kick,Birmingham City WFC,Emma Follis,Center Forward,"[86.0, 15.0]",2.033567,[7d3eb214-4b99-4e3f-ad83-155793b118fc],19743.0,Chloe Arthur,13.892444,2.098871,Ground Pass,"[79.0, 27.0]",Right Foot,,,,,f301190f-cc0a-4f16-8278-27e5279ea24e,True,,,,,,,,,,,,,,,,
4,8558535e-b1ee-4f53-b003-1b5fba2712bd,892,1,00:23:45.810,23,45,Shot,44,Chelsea FCW,From Goal Kick,Chelsea FCW,Bethany England,Left Midfield,"[94.0, 33.0]",1.225187,,[1455cb46-43a3-4e6f-b845-171abcd344bc],19743,0.030716,"[120.0, 34.8, 0.5]",37712221-3b0b-4090-a30c-08a3ee6492be,Normal,Off T,Open Play,Right Foot,"[{'location': [117.0, 40.0], 'player': {'id': ...",,,,,,,,,,,,888.0,1.0,00:23:41.728,...,41.0,Pass,44.0,Chelsea FCW,From Goal Kick,Chelsea FCW,Jonna Andersson,Left Back,"[83.0, 10.0]",1.243357,[fad5af63-bf6e-4e51-9321-644b99e9f2b8],19743.0,Bethany England,14.56022,1.292497,Ground Pass,"[87.0, 24.0]",Left Foot,,,,,8558535e-b1ee-4f53-b003-1b5fba2712bd,True,,,,,,,,,,,,,,,,


# ProfileReport

In [4]:
pr_report = pr(extracted_data)
pr_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

# Target Feature

In [5]:
# The target feature is shot_outcome

In [6]:
# Display value counts for shot_outcome

extracted_data['shot_outcome'].value_counts(dropna = False)

In [7]:
# xG measures the likelihood a shot will result in a goal
# For this reason, the target measurement needs to be if the shot was a goal

# Update shot_outcome to boolean goal

extracted_data['shot_outcome'] = extracted_data['shot_outcome'].apply(lambda i: 'True' if i == 'Goal' else 'False')

extracted_data.rename(columns = {'shot_outcome' : 'goal'},
                      inplace = True)

In [8]:
extracted_data['goal'].value_counts(dropna = False)

# Irrelevant Data

In [9]:
print('Total Features:',
      extracted_data.shape[1])

In [10]:
# List current features

list(extracted_data.columns.values.tolist())

In [11]:
# Drop duplicate features

extracted_data.drop(['shot_saved_off_target',
                     'shot_saved_to_post',
                     'pass_outcome',
                     'pass_assisted_shot_id',
                     'pass_shot_assist',
                     'pass_goal_assist',
                     'pass_end_location',
                     'index_y',
                     'period_y',
                     'timestamp_y',
                     'minute_x',
                     'second_x',
                     'minute_y',
                     'second_y',
                     'type_y',
                     'possession_y',
                     'possession_team_y',
                     'play_pattern_y',
                     'team_y',
                     'player_y',
                     'position_y',
                     'location_y',
                     'duration_y',
                     'related_events_y',
                     'match_id_y',
                     'under_pressure_y',
                     'off_camera_y',
                     'out_y'],
                    axis = 1,
                    inplace = True)

In [12]:
# Drop features unrelated to shot-specific data

extracted_data.drop(['id',
                     'index_x',
                     'type_x',
                     'possession_x',
                     'possession_team_x',
                     'team_x',
                     'player_x',
                     'position_x',
                     'duration_x',
                     'related_events_x',
                     'match_id_x',
                     'shot_key_pass_id',
                     'shot_freeze_frame',
                     'out_x',
                     'off_camera_x',
                     'shot_aerial_won',
                     'pass_recipient',
                     'pass_body_part',
                     'pass_aerial_won',
                     'pass_deflected',
                     'pass_miscommunication',
                     'pass_no_touch'],
                    axis = 1,
                    inplace = True)

In [13]:
print('Total Features:',
      extracted_data.shape[1])

In [14]:
# List current features

list(extracted_data.columns.values.tolist())

# Data Types

## Boolean Features

In [15]:
# Define boolean features

boolean_features = ['shot_one_on_one',
                    'shot_open_goal',
                    'shot_first_time',
                    'shot_redirect',
                    'shot_deflected',
                    'shot_follows_dribble',
                    'under_pressure_x',
                    'counterpress',
                    'pass_switch',
                    'pass_through_ball',
                    'pass_backheel',
                    'pass_cross',
                    'pass_cut_back',
                    'pass_inswinging',
                    'pass_straight',
                    'pass_outswinging']

In [16]:
# Convert boolean features to boolean data type

extracted_data[boolean_features] = extracted_data[boolean_features].astype(bool)

## Datetime

In [17]:
# Convert timestamp_x datatype to datetime

extracted_data['timestamp_x'] = extracted_data['timestamp_x'].astype(str)
extracted_data['timestamp_x'] = pd.to_datetime(extracted_data['timestamp_x'])

# Missing Values

## No Pass

In [18]:
print('pass_length NA:',
      sum(extracted_data['pass_length'].isna()),
      '\n',
      'pass_angle NA:',
      sum(extracted_data['pass_angle'].isna()),
      '\n',
      'pass_height NA:',
      sum(extracted_data['pass_height'].isna()))

In [19]:
# Note: pass_length, pass_angle, and pass_height each have 1942 missing values
# Assume these missing values are shots which were not preceded by a pass

In [20]:
extracted_data['pass_length'].fillna(0,
                                     inplace = True)

extracted_data['pass_angle'].fillna(0,
                                    inplace = True)

In [21]:
# Fill values in pass-related categorical features for shots which were not preceded by a pass

# Note: will not account for all missing values in pass-related categorical features,
# just those identified through missing values in pass_length, pass_angle, and pass_height
# as shots which were not preceded by a pass

extracted_data.loc[extracted_data['pass_length'] == 0,
                   ['pass_height',
                    'pass_type',
                    'pass_technique',
                    'pass_body_part']] = 'No Pass'

## pass_type

In [22]:
extracted_data['pass_type'].value_counts(dropna = False)

In [23]:
# pass_type defined values are set-plays
# Assume missing values are from open play

extracted_data['pass_type'].fillna('Open Play',
                                   inplace = True)

## pass_technique

In [24]:
extracted_data['pass_technique'].value_counts(dropna = False)

In [25]:
# pass_technique defined values are specialized passes
# Assume missing values are standard passes

extracted_data['pass_technique'].fillna('Standard',
                                        inplace = True)

## pass_body_part

In [26]:
extracted_data['pass_body_part'].value_counts(dropna = False)

In [27]:
# Assume missing values as Other

extracted_data['pass_body_part'].fillna('Other',
                                        inplace = True)

# Split Location Coordinates Features

### location_x

In [28]:
# Split shot location coordinates, location_x, into separate x and y-coordinates

shot_location_df = pd.DataFrame(extracted_data['location_x'].tolist(),
                                index = extracted_data.index)

In [29]:
# Replace location_x with shot_location_x and shot_location_y

extracted_data.drop('location_x',
                    axis = 1,
                    inplace = True)

extracted_data['shot_location_y'] = shot_location_df[0]
extracted_data['shot_location_x'] = shot_location_df[1]

### shot_end_location

In [30]:
# Split shot end location coordinates, shot_end_location_x,
# into separate x, y, and z-coordinates

end_location_df = pd.DataFrame(extracted_data['shot_end_location'].tolist(),
                               index = extracted_data.index)

In [31]:
# Drop y-coordinate
# All shots are aimed to end at the endline (120)

In [32]:
print('z-coordinate NA:',
      end_location_df[2].isna().sum(),
      '\n'
      'z-coordinate percent NA:',
      ((end_location_df[2].isna().sum()) / (extracted_data.shape[0]) * 100))

In [33]:
# Drop z-coordinate

In [34]:
# Replace shot_end_location with x-coordinate

extracted_data['shot_end_location'] = end_location_df[1]

# Outliers

## Numeric Features

In [35]:
# Function defining outliers

def iqr_outliers(feature):
  q1 = extracted_data[feature].quantile(0.25)
  q3 = extracted_data[feature].quantile(0.75)

  iqr = q3 - q1

  lower_limit = q1 - (1.5 * iqr)
  upper_limit = q3 + (1.5 * iqr)

  return lower_limit, upper_limit

### shot_location_x

In [36]:
# Vizualize distribution of shot_location_x

fig = px.box(extracted_data,
             y = 'shot_location_x')

fig.show()

In [37]:
lower_q_shot_location_x, upper_q_shot_location_x = iqr_outliers('shot_location_x')

print('shot_location_x Outliers:',
      '\n',
      'Wider than',
      (round((40 - lower_q_shot_location_x),
       2)),
      'Yards Left of Center',
      '\n',
      'Wider than',
      (round((upper_q_shot_location_x - 40),
       2)),
      'Yards Right of Center')

In [38]:
outliers_shot_location_x = extracted_data[(extracted_data['shot_location_x'] <
                                           lower_q_shot_location_x) |
                                          (extracted_data['shot_location_x'] >
                                           upper_q_shot_location_x)]['shot_location_x']

print('shot_location_x Outlier Count:',
      len(outliers_shot_location_x))

In [39]:
outliers_shot_location_x

In [40]:
# Drop outliers

extracted_data.drop(outliers_shot_location_x.index,
                    inplace = True)

In [41]:
extracted_data['shot_location_x'].describe()

In [42]:
len(extracted_data)

### shot_location_y

In [None]:
# Vizualize distribution of shot_location_y

fig = px.box(extracted_data,
             y = 'shot_location_y')

fig.show()

In [None]:
# Note: Only shots significantly further from goal v the median will be considered outliers
# Shots closer to goal are more desireable and will likely have higher xG

lower_q_shot_location_y, upper_q_shot_location_y = iqr_outliers('shot_location_y')

print('shot_location_y Outliers:',
      '\n',
      'Further than',
      (round((120 - lower_q_shot_location_y),
       2)),
      'Yards from Goal')

In [None]:
outliers_shot_location_y = extracted_data[(extracted_data['shot_location_y'] <
                                           lower_q_shot_location_y)]['shot_location_y']

print('shot_location_y Outlier Count:',
      len(outliers_shot_location_y))

In [None]:
outliers_shot_location_y

In [None]:
# Drop outliers

extracted_data.drop(outliers_shot_location_y.index,
                    inplace = True)

In [None]:
extracted_data['shot_location_y'].describe()

### shot_end_location

In [None]:
# Vizualize distribution of shot_end_location

fig = px.box(extracted_data,
             y = 'shot_end_location')

fig.show()

In [None]:
lower_q_shot_end_location, upper_q_shot_end_location = iqr_outliers('shot_end_location')

print('shot_location_x Outliers:',
      '\n',
      'Wider than',
      (round((40 - lower_q_shot_end_location),
       2)),
      'Yards Left of Center',
      '\n',
      'Wider than',
      (round((upper_q_shot_end_location - 40),
       2)),
      'Yards Right of Center')

In [None]:
outliers_shot_end_location = extracted_data[(extracted_data['shot_end_location'] <
                                           lower_q_shot_end_location) |
                                          (extracted_data['shot_end_location'] >
                                           upper_q_shot_end_location)]['shot_end_location']

print('shot_end_location Outlier Count:',
      len(outliers_shot_end_location))

In [None]:
# Retain outliers

# Poor shot end location does not indicate a poor shot opportunity

### pass_length

In [None]:
# Vizualize distribution of pass_length

fig = px.box(extracted_data,
             y = 'pass_length')

fig.show()

In [None]:
lower_q_pass_length, upper_q_pass_length = iqr_outliers('pass_length')

print('pass_length Outliers:',
      '\n',
      'Further than',
      (round((upper_q_pass_length),
       2)),
      'Yards')

In [None]:
outliers_pass_length = extracted_data[(extracted_data['pass_length'] >
                                       upper_q_pass_length)]['pass_length']

print('pass_length Outlier Count:',
      len(outliers_pass_length))

In [None]:
# Retain outliers
# Long passes can create high quality shot opportunities

### pass_angle

In [None]:
# Vizualize distribution of pass_angle

fig = px.box(extracted_data,
             y = 'pass_angle')

fig.show()

In [None]:
lower_q_pass_angle, upper_q_pass_angle = iqr_outliers('pass_angle')

print('pass_angle Outliers:',
      '\n',
      'Wider than',
      (round((45 - lower_q_shot_location_x),
       2)),
      'Degrees Left of Center',
      '\n',
      'Wider than',
      (round((upper_q_shot_location_x - 45),
       2)),
      'Degrees Right of Center')

In [59]:
outliers_pass_angle = extracted_data[(extracted_data['pass_angle'] <
                                      lower_q_pass_angle) |
                                          (extracted_data['pass_angle'] >
                                           upper_q_pass_angle)]['pass_angle']

print('pass_angle Outlier Count:',
      len(outliers_pass_angle))

pass_angle Outlier Count: 64


In [60]:
# Retain outliers
# Narrow angle passes can create high quality shot opportunities

## Total Dropped

In [61]:
print('Total Dropped Outliers:',
      (len(outliers_shot_location_x) +
       len(outliers_shot_location_y)),
      '\n',
      'Percent Dropped Outliers:',
      (round(((len(outliers_shot_location_x) +
       len(outliers_shot_location_y)) / 
       len(extracted_data)),
       3)),
      '%')

Total Dropped Outliers: 37 
 Percent Dropped Outliers: 0.006 %


# Combine Redundant Features

## shot_technique

In [62]:
extracted_data['shot_technique'].value_counts()

Normal           5097
Half Volley       506
Volley            336
Lob                55
Backheel           20
Diving Header      15
Overhead Kick      14
Name: shot_technique, dtype: int64

In [63]:
extracted_data['shot_redirect'].value_counts()

False    6021
True       22
Name: shot_redirect, dtype: int64

In [64]:
# Compare shot_redirect values v shot_technique values

extracted_data.loc[extracted_data['shot_redirect']]['shot_technique'].value_counts()

Normal         13
Volley          6
Backheel        2
Half Volley     1
Name: shot_technique, dtype: int64

In [65]:
# No changes

# shot_redirect is a unique descriptor from shot_technique values

## play_pattern_x

In [66]:
extracted_data['play_pattern_x'].value_counts()

Regular Play      2166
From Throw In     1207
From Corner       1006
From Free Kick     912
From Counter       335
From Goal Kick     214
Other               70
From Keeper         68
From Kick Off       65
Name: play_pattern_x, dtype: int64

In [67]:
extracted_data['shot_type'].value_counts()

Open Play    5804
Free Kick     186
Penalty        53
Name: shot_type, dtype: int64

In [68]:
# Compare shot_type 'Free Kick' values v play_pattern_x values

extracted_data.loc[extracted_data['shot_type'] == 'Free Kick']['play_pattern_x'].value_counts()

From Free Kick    186
Name: play_pattern_x, dtype: int64

In [69]:
# Combine shot_type 'Free Kick' values into play_pattern_x as 'Direct Free Kick'

# Assume differentiation between 191 shot_type 'Free Kick' values and 918
# play_pattern_x 'From Free Kick' values is if the free kick was a direct
# shot or a pass leading to a shot

extracted_data.loc[extracted_data['shot_type'] == 'Free Kick',
                   'play_pattern_x'] = 'Direct Free Kick'

In [70]:
# Compare shot_type 'Penalty' values v play_pattern_x values

extracted_data.loc[extracted_data['shot_type'] == 'Penalty']['play_pattern_x'].value_counts()

Other    53
Name: play_pattern_x, dtype: int64

In [71]:
# Combine shot_type 'Penalty' values into play_pattern_x

extracted_data.loc[extracted_data['shot_type'] == 'Penalty',
                   'play_pattern_x'] = 'Penalty'

In [72]:
# Compare shot_type 'Corner' values v play_pattern_x values

extracted_data.loc[extracted_data['shot_type'] == 'Corner']['play_pattern_x'].value_counts()

Series([], Name: play_pattern_x, dtype: int64)

In [73]:
# shot_type 'Corner' values match play_pattern_x 'From Corner' values

In [74]:
# Drop shot_type

extracted_data.drop('shot_type',
                    axis = 1,
                    inplace = True)

In [75]:
extracted_data['pass_type'].value_counts()

Open Play       3166
No Pass         1924
Corner           400
Recovery         299
Free Kick        201
Throw-in          41
Interception      10
Kick Off           1
Goal Kick          1
Name: pass_type, dtype: int64

In [76]:
# Compare pass_type 'Corner' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Corner']['play_pattern_x'].value_counts()

From Corner    400
Name: play_pattern_x, dtype: int64

In [77]:
# pass_type 'Corner' values match play_pattern_x 'From Corner' values

In [78]:
# Compare pass_type 'Recovery' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Recovery']['play_pattern_x'].value_counts()

Regular Play      113
From Throw In      50
From Corner        44
From Counter       39
From Free Kick     33
From Goal Kick     10
From Keeper         7
From Kick Off       2
Other               1
Name: play_pattern_x, dtype: int64

In [79]:
# Compare pass_type 'Interception' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Interception']['play_pattern_x'].value_counts()

Regular Play      5
From Counter      2
From Throw In     2
From Goal Kick    1
Name: play_pattern_x, dtype: int64

In [80]:
# Combine pass_type 'Recovery' and 'Interception' values into play_pattern_x as 'Turnover'

# Assume pass_type 'Recovery' values are events in which the ball was recovered
# from the opposition's play described by play_pattern_x values

# Combine pass_type 'Interception' values with 'Recovery' values as interceptions are
# a type of recovery

extracted_data.loc[extracted_data['pass_type'] == 'Recovery',
                   'play_pattern_x'] = 'Recovery'

In [81]:
# Compare pass_type 'Throw-in' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Throw-in']['play_pattern_x'].value_counts()

From Throw In    41
Name: play_pattern_x, dtype: int64

In [82]:
# pass_type 'Throw-in' values match play_pattern_x 'From Throw In' values

In [83]:
# Compare pass_type 'Kick Off' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Kick Off']['play_pattern_x'].value_counts()

From Kick Off    1
Name: play_pattern_x, dtype: int64

In [84]:
# pass_type 'Kick Off' values match play_pattern_x 'From Kick Off' values

In [85]:
# Compare pass_type 'Goal Kick' values v play_pattern_x values

extracted_data.loc[extracted_data['pass_type'] == 'Goal Kick']['play_pattern_x'].value_counts()

From Goal Kick    1
Name: play_pattern_x, dtype: int64

In [86]:
# pass_type 'Goal Kick' values match play_pattern_x 'From Goal Kick' values

In [87]:
# Drop pass_type

extracted_data.drop('pass_type',
                    axis = 1,
                    inplace = True)

In [88]:
extracted_data['counterpress'].value_counts()

False    6038
True        5
Name: counterpress, dtype: int64

In [89]:
# Compare counterpress values v play_pattern_x values

extracted_data.loc[extracted_data['counterpress']]['play_pattern_x'].value_counts()

Regular Play      2
From Counter      2
From Goal Kick    1
Name: play_pattern_x, dtype: int64

In [90]:
# Combine counterpress into play_pattern_x

extracted_data.loc[extracted_data['counterpress'] == True,
                   'play_pattern_x'] = 'From Counterpress'

In [91]:
# Drop counterpress

extracted_data.drop('counterpress',
                    axis = 1,
                    inplace = True)

In [92]:
extracted_data['shot_follows_dribble'].value_counts()

False    6040
True        3
Name: shot_follows_dribble, dtype: int64

In [93]:
# Compare shot_follows_dribble values v play_pattern_x values

extracted_data.loc[extracted_data['shot_follows_dribble']]['play_pattern_x'].value_counts()

Regular Play        1
Recovery            1
Direct Free Kick    1
Name: play_pattern_x, dtype: int64

In [94]:
# Combine shot_follows_dribble into play_pattern_x

extracted_data.loc[extracted_data['shot_follows_dribble'] == True,
                   'play_pattern_x'] = 'From Dribble'

In [95]:
# Drop shot_follows_dribble

extracted_data.drop('shot_follows_dribble',
                    axis = 1,
                    inplace = True)

In [96]:
extracted_data['play_pattern_x'].value_counts()

Regular Play         2050
From Throw In        1157
From Corner           962
From Free Kick        693
Recovery              298
From Counter          294
From Goal Kick        203
Direct Free Kick      185
From Kick Off          63
From Keeper            61
Penalty                53
Other                  16
From Counterpress       5
From Dribble            3
Name: play_pattern_x, dtype: int64

## pass_technique

In [97]:
extracted_data['pass_technique'].value_counts()

Standard        3764
No Pass         1924
Through Ball     198
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64

In [98]:
extracted_data['pass_through_ball'].value_counts()

False    5845
True      198
Name: pass_through_ball, dtype: int64

In [99]:
# Compare pass_through_ball values v pass_technique

extracted_data.loc[extracted_data['pass_through_ball']]['pass_technique'].value_counts()

Through Ball    198
Name: pass_technique, dtype: int64

In [100]:
# Drop pass_through_ball

# pass_through_ball True values match pass_technique 'Through Ball' values

extracted_data.drop('pass_through_ball',
                    axis = 1,
                    inplace = True)

In [101]:
extracted_data['pass_inswinging'].value_counts()

False    5967
True       76
Name: pass_inswinging, dtype: int64

In [102]:
# Compare pass_inswinging values v pass_technique

extracted_data.loc[extracted_data['pass_inswinging']]['pass_technique'].value_counts()

Inswinging    76
Name: pass_technique, dtype: int64

In [103]:
# Drop pass_inswinging

# pass_inswinging True values match pass_technique 'Inswinging' values

extracted_data.drop('pass_inswinging',
                    axis = 1,
                    inplace = True)

In [104]:
extracted_data['pass_outswinging'].value_counts()

False    5988
True       55
Name: pass_outswinging, dtype: int64

In [105]:
# Compare pass_outswinging values v pass_technique

extracted_data.loc[extracted_data['pass_outswinging']]['pass_technique'].value_counts()

Outswinging    55
Name: pass_technique, dtype: int64

In [106]:
# Drop pass_outswinging

# pass_outswinging True values match pass_technique 'Outswinging' values

extracted_data.drop('pass_outswinging',
                    axis = 1,
                    inplace = True)

In [107]:
extracted_data['pass_straight'].value_counts()

False    6017
True       26
Name: pass_straight, dtype: int64

In [108]:
# Compare pass_straight values v pass_technique

extracted_data.loc[extracted_data['pass_straight']]['pass_technique'].value_counts()

Straight    26
Name: pass_technique, dtype: int64

In [109]:
# Drop pass_straight

# pass_straight True values match pass_technique 'Straight' values

extracted_data.drop('pass_straight',
                    axis = 1,
                    inplace = True)

In [110]:
extracted_data['pass_cross'].value_counts()

False    5289
True      754
Name: pass_cross, dtype: int64

In [111]:
# Compare pass_cross values v pass_technique

extracted_data.loc[extracted_data['pass_cross']]['pass_technique'].value_counts()

Standard        751
Through Ball      3
Name: pass_technique, dtype: int64

In [112]:
# Combine pass_cross into pass_technique

extracted_data.loc[extracted_data['pass_cross'] == True,
                   'pass_technique'] = 'Cross'

In [113]:
# Drop pass_cross

extracted_data.drop('pass_cross',
                    axis = 1,
                    inplace = True)

In [114]:
extracted_data['pass_backheel'].value_counts()

False    6030
True       13
Name: pass_backheel, dtype: int64

In [115]:
# Compare pass_backheel values v pass_technique

extracted_data.loc[extracted_data['pass_backheel']]['pass_technique'].value_counts()

Standard    13
Name: pass_technique, dtype: int64

In [116]:
# Combine pass_backheel into pass_technique

extracted_data.loc[extracted_data['pass_backheel'] == True,
                   'pass_technique'] = 'Backheel'

In [117]:
# Drop pass_backheel

extracted_data.drop('pass_backheel',
                    axis = 1,
                    inplace = True)

In [118]:
extracted_data['pass_cut_back'].value_counts()

False    5935
True      108
Name: pass_cut_back, dtype: int64

In [119]:
# Compare pass_cut_back values v pass_technique

extracted_data.loc[extracted_data['pass_cut_back']]['pass_technique'].value_counts()

Cross       63
Standard    43
Backheel     2
Name: pass_technique, dtype: int64

In [120]:
# Combine pass_cut_back into pass_technique for values of 'Standard'

extracted_data.loc[(extracted_data['pass_cut_back'] == True) &
                   (extracted_data['pass_technique'] == 'Standard'),
                   'pass_technique'] = 'Cut Back'

In [121]:
# Drop pass_cut_back

extracted_data.drop('pass_cut_back',
                    axis = 1,
                    inplace = True)

In [122]:
extracted_data['pass_switch'].value_counts()

False    5720
True      323
Name: pass_switch, dtype: int64

In [123]:
# Compare pass_switch values v pass_technique

extracted_data.loc[extracted_data['pass_switch']]['pass_technique'].value_counts()

Standard        182
Cross            46
Inswinging       41
Outswinging      39
Straight         13
Through Ball      2
Name: pass_technique, dtype: int64

In [124]:
# Combine pass_switch into pass_technique for values of 'Standard'

extracted_data.loc[(extracted_data['pass_switch'] == True) &
                   (extracted_data['pass_technique'] == 'Standard'),
                   'pass_technique'] = 'Switch'

In [125]:
# Drop pass_switch

extracted_data.drop('pass_switch',
                    axis = 1,
                    inplace = True)

In [126]:
extracted_data['pass_technique'].value_counts()

Standard        2775
No Pass         1924
Cross            754
Through Ball     195
Switch           182
Inswinging        76
Outswinging       55
Cut Back          43
Straight          26
Backheel          13
Name: pass_technique, dtype: int64

In [127]:
# Combine pass_technique values 'Switch', 'Inswinging', 'Outswinging', and
# 'Straight' into value 'Cross'

# Values are types of crosses
# Value 'Cross' is nonspecific, but the most frequent
# Combining into single, less specific, value could increase importance in modeling

extracted_data.loc[extracted_data['pass_technique'].isin(['Switch',
                                                          'Inswinging',
                                                          'Outswinging',
                                                          'Straight']),
                   'pass_technique'] = 'Cross'

In [128]:
extracted_data['pass_technique'].value_counts()

Standard        2775
No Pass         1924
Cross           1093
Through Ball     195
Cut Back          43
Backheel          13
Name: pass_technique, dtype: int64

# Update Feature Names

In [129]:
extracted_data.rename(columns = {'period_x' : 'period',
                                 'timestamp_x' : 'timestamp',
                                 'play_pattern_x' : 'play_pattern',
                                 'under_pressure_x' : 'under_pressure',
                                 'shot_statsbomb_xg' : 'statsbomb_xg',
                                 'shot_end_location' : 'end_location',
                                 'shot_one_on_one': 'one_on_one',
                                 'shot_open_goal' : 'open_goal',
                                 'shot_first_time' : 'first_time',
                                 'shot_redirect' : 'redirected',
                                 'shot_deflected' : 'deflected',
                                 'shot_location_y' : 'location_y',
                                 'shot_location_x' : 'location_x',
                                 'pass_technique' : 'pass_type'},
                      inplace = True)

# Update Feature Order

In [130]:
extracted_data = extracted_data[['goal',
                                 'statsbomb_xg',
                                 'period',
                                 'timestamp',
                                 'location_x',
                                 'location_y',
                                 'end_location',
                                 'shot_technique',
                                 'shot_body_part',
                                 'play_pattern',
                                 'pass_length',
                                 'pass_angle',
                                 'pass_height',
                                 'pass_type',
                                 'under_pressure',
                                 'one_on_one',
                                 'open_goal',
                                 'first_time',
                                 'redirected',
                                 'deflected']]

# ProfileReport 2

In [131]:
pr_report = pr(extracted_data)
pr_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# Cleaned Data

In [132]:
cleaned_data = extracted_data

In [133]:
cleaned_data.head()

Unnamed: 0,goal,statsbomb_xg,period,timestamp,location_x,location_y,end_location,shot_technique,shot_body_part,play_pattern,pass_length,pass_angle,pass_height,pass_type,under_pressure,one_on_one,open_goal,first_time,redirected,deflected
0,False,0.266154,1,2021-10-08 00:04:38.609,46.0,109.0,45.0,Normal,Left Foot,Regular Play,11.18034,0.463648,Ground Pass,Standard,True,False,False,False,False,False
1,False,0.093521,1,2021-10-08 00:11:45.046,35.0,113.0,32.9,Normal,Head,From Free Kick,37.735924,-0.558599,High Pass,Standard,True,True,False,False,False,False
2,False,0.036171,1,2021-10-08 00:18:03.461,43.0,94.0,42.8,Normal,Left Foot,Regular Play,11.18034,-2.034444,Ground Pass,Standard,True,False,False,False,False,False
3,False,0.016625,1,2021-10-08 00:23:11.935,34.0,86.0,33.3,Normal,Left Foot,From Goal Kick,13.892444,2.098871,Ground Pass,Standard,True,False,False,False,False,False
4,False,0.030716,1,2021-10-08 00:23:45.810,33.0,94.0,34.8,Normal,Right Foot,From Goal Kick,14.56022,1.292497,Ground Pass,Standard,False,False,False,False,False,False


In [134]:
print('Total Events:',
      len(cleaned_data))

Total Events: 6043


In [135]:
print('Total Features:',
      cleaned_data.shape[1])

Total Features: 20


In [136]:
# Save cleaned_data

cleaned_data.to_parquet('/content/drive/MyDrive/expected_goals/data_cleaning/dataframes/cleaned_data.parquet')

In [137]:
print('extracted_data Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_cleaning/dataframes/cleaned_data.parquet').stat().st_size,
      'bytes')

extracted_data Filesize: 248621 bytes


Continued in [expected_goals_feature engineering_notebook]()

*3 of 7*