**Capstone Project Submission**

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Friday, June 11, 2021, 2:30pm CST
    * Monday, June 13, 2021,

# **Expected Goals Classifier**

# Overview

Create an expected goals classification model using existing historical match data for use with future match analysis and actionable recommendations which can be utilized in training to help improve goal-scoring.

Project detailed on Github: [milwaukee_fc](https://github.com/wswager/milwaukee_fc)

# Data Cleaning Notebook

*Notebook 4 of 8*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data organized in [expected_goals_data_organization_notebook]()
3. Features engineered in [expected_goals_feature_engineering_notebook]()
4. Data cleaned in [expected_goals_data_cleaning_notebook]()
5. Data explored in [expected_goals_data_exploration_notebook]()
6. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
7. Model fitting and refinement in [expected_goals_model_fitting_notebook]()
8. Model assessment in [expected_goals_model_assessment_notebook]()

### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [22]:
# Import data_with_engineered_features from expected_goals_feature_engineering_notebook

data_with_engineered_features = pd.read_csv('/content/drive/MyDrive/flatiron/expected_goals/feature_engineering/data_with_engineered_features.csv')

In [23]:
data_with_engineered_features = data_with_engineered_features.iloc[: , 1:]

In [4]:
data_with_engineered_features.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player,team,bodypart,technique,first_touch,state_of_play,assist1,assist2,assist3,assist_state_of_play,inside_18_width,inside_18_depth,inside_18,shot_distance,shot_angle,bodypart_angle,significant_time
0,109.0,46.0,2021-06-14 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play,True,True,True,12.529964,118.61,Right - Inside Foot,First 5min
1,113.0,35.0,2021-06-14 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play,High Pass,,,From Free Kick,True,True,True,8.602325,54.46,Left - Head,Regular Time
2,94.0,43.0,2021-06-14 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play,True,False,False,26.172505,96.58,Right - Inside Foot,Regular Time
3,86.0,34.0,2021-06-14 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick,True,False,False,34.525353,79.99,Left - Outside Foot,Regular Time
4,94.0,33.0,2021-06-14 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick,True,False,False,26.925824,74.93,Left - Inside Foot,Regular Time


<a id = 'packages'></a>
# Packages

In [1]:
# Drive  and IO to access saved data
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pandas for Dataframes
import pandas as pd
from IPython.display import display

# Numpy for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive


# Reorder Columns

In [24]:
data_with_engineered_features = data_with_engineered_features[['statsbomb_xg',
                                 'outcome',
                                 'time',
                                 'significant_time',
                                 'player',
                                 'team',
                                 'location_x',
                                 'location_y',
                                 'shot_distance',
                                 'inside_18_width',
                                 'inside_18_depth',
                                 'inside_18',
                                 'shot_angle',
                                 'bodypart',
                                 'bodypart_angle',
                                 'technique',
                                 'first_touch',
                                 'assist1',
                                 'assist2',
                                 'assist3',
                                 'state_of_play',
                                 'assist_state_of_play']]

In [25]:
data_with_engineered_features.head()

Unnamed: 0,statsbomb_xg,outcome,time,significant_time,player,team,location_x,location_y,shot_distance,inside_18_width,inside_18_depth,inside_18,shot_angle,bodypart,bodypart_angle,technique,first_touch,assist1,assist2,assist3,state_of_play,assist_state_of_play
0,0.266154,Blocked,2021-06-14 00:04:38.609,First 5min,Francesca Kirby,Chelsea FCW,109.0,46.0,12.529964,True,True,True,118.61,Left Foot,Right - Inside Foot,Normal,False,Ground Pass,,,Open Play,Regular Play
1,0.093521,Off T,2021-06-14 00:11:45.046,Regular Time,Bethany England,Chelsea FCW,113.0,35.0,8.602325,True,True,True,54.46,Head,Left - Head,Normal,False,High Pass,,,Open Play,From Free Kick
2,0.036171,Saved,2021-06-14 00:18:03.461,Regular Time,Drew Spence,Chelsea FCW,94.0,43.0,26.172505,True,False,False,96.58,Left Foot,Right - Inside Foot,Normal,False,Ground Pass,,,Open Play,Regular Play
3,0.016625,Off T,2021-06-14 00:23:11.935,Regular Time,Chloe Arthur,Birmingham City WFC,86.0,34.0,34.525353,True,False,False,79.99,Left Foot,Left - Outside Foot,Normal,False,Ground Pass,,,Open Play,From Goal Kick
4,0.030716,Off T,2021-06-14 00:23:45.810,Regular Time,Bethany England,Chelsea FCW,94.0,33.0,26.925824,True,False,False,74.93,Right Foot,Left - Inside Foot,Normal,False,Ground Pass,,,Open Play,From Goal Kick


# Target Feature

In [26]:
# Display value counts for outcome

display(data_with_engineered_features['outcome'].value_counts(dropna = False))

Off T               1921
Saved               1537
Blocked             1467
Goal                 666
Wayward              336
Post                 136
Saved Off Target      24
Saved to Post         17
Name: outcome, dtype: int64

In [27]:
# Outcome will be used as the target feature
# Change to boolean Goal

data_with_engineered_features['outcome'] = data_with_engineered_features['outcome'].apply(lambda i: 'True' if i == 'Goal' else 'False')

data_with_engineered_features.rename(columns = {'outcome' : 'goal'},
                                     inplace = True)

In [28]:
# No NA values

In [29]:
data_with_engineered_features['goal'].value_counts()

False    5438
True      666
Name: goal, dtype: int64

# Assessing Feature Values

* For the sake of simplicity in eventual results, values combined if:
  * Values likely represent the same description but may be recorded differently due to subjectivity in recording methods
  * Values both account for a low percentage of the total and, by definition, can be justified as another value

* Features with data extracted from multiple locations due to subjectivity in recording combined and their values standardized

* NA replaced with appropriate descriptions of why they are likely missing

* Assume the following are unique identifiers, and, therefore, will not be modified:
  * player
  * team
  * shot_distance
  * shot_angle

## Time

In [30]:
# Convert time datatype to datetime

data_with_engineered_features['time'] = data_with_engineered_features['time'].astype(str)
data_with_engineered_features['time'] = pd.to_datetime(data_with_engineered_features['time'])

In [31]:
print('Earliest Goal:', data_with_engineered_features['time'].dt.time.min())
print('Latest Goal:', data_with_engineered_features['time'].dt.time.max())

Earliest Goal: 00:00:07.476000
Latest Goal: 00:54:19.062000


In [32]:
# Based on the distribution of times, assume time is measured from start of each half
# 00:00-45:00, plus stoppage-time

# Time does not appear to be measured from start of game, 00:00-90:00, plus stoppage-time

# Time does not appear to be measured as time of day

In [33]:
# Convert time datatype to integer for minutes

# Because time is measured from start of each half, date and hours are not valuable
# Because of uniqueness of time as a variable and assuming subjectivity in recording,
# drop seconds

# Converting to integer will allow for direct inclusion in eventual modeling

# Convert time to int for minutes

data_with_engineered_features['time'] = data_with_engineered_features['time'].dt.minute

In [34]:
data_with_engineered_features['time'].describe()

count    6104.000000
mean       23.864187
std        13.977114
min         0.000000
25%        12.000000
50%        24.000000
75%        36.000000
max        54.000000
Name: time, dtype: float64

## Significant Time

In [35]:
# Display value counts for significant_time

data_with_engineered_features['significant_time'].value_counts()

Regular Time     4433
Last 5min         611
First 5min        611
Stoppage Time     449
Name: significant_time, dtype: int64

In [36]:
# Assess ratio of shots and goals within specified intervals

# Create a dataframe of time intervals

time_intervals_list = ['First 5min', 'Regular Time',
                       'Last 5min', 'Stoppage-Time']

time_intervals_df = pd.DataFrame(time_intervals_list)

# Calculate ratios each interval represents of total

time_ratios_list = []
time_ratios_list.append((round(((5 / 45) * 100), 2)))
time_ratios_list.append((round((((40 - 5) / 45) * 100), 2)))
time_ratios_list.append((round((((45 - 40) / 45) * 100), 2)))
time_ratios_list.append((round((((55 - 45) / 45) * 100), 2)))

time_ratios_df = pd.DataFrame(time_ratios_list)

# Concatenate time_intervals_df and time_ratios_df

time_interval_ratios = pd.concat([time_intervals_df,
                                  time_ratios_df],
                                  axis = 1)
time_interval_ratios.columns = ['Time Interval', 'Ratio Time']

# Calculate ratio of shots within defined time intervals

shot_time_ratios = []
shot_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['time'] <
                                             5)])) /
                          (len(data_with_engineered_features)) * 100), 2))
shot_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['time'] >
                                             5) &
                                            (data_with_engineered_features['time'] <
                                             40)])) /
                          (len(data_with_engineered_features)) * 100), 2))
shot_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['time'] >
                                             40) &
                                            (data_with_engineered_features['time'] <
                                             45)])) /
                          (len(data_with_engineered_features)) * 100), 2))
shot_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['time'] >
                                             45)])) /
                          (len(data_with_engineered_features)) * 100), 2))

# Add shot_time_ratios to time_interval_ratios

time_interval_ratios['Ratio Total Shots'] = shot_time_ratios

# Calculate ratio of goals within defined time intervals

goal_time_ratios = []
goal_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                 (data_with_engineered_features['time'] <
                                                  5)]) /
                                len(data_with_engineered_features[data_with_engineered_features['goal'] == 'True'])) * 100), 2))
goal_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                 (data_with_engineered_features['time'] >
                                                  5) &
                                                 (data_with_engineered_features['time'] <
                                                  40)]) /
                                len(data_with_engineered_features[data_with_engineered_features['goal'] == 'True'])) * 100), 2))
goal_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                 (data_with_engineered_features['time'] >
                                                  40) &
                                                 (data_with_engineered_features['time'] <
                                                  45)]) /
                                len(data_with_engineered_features[data_with_engineered_features['goal'] == 'True'])) * 100), 2))
goal_time_ratios.append(round(((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                 (data_with_engineered_features['time'] >
                                                  45)]) /
                                len(data_with_engineered_features[data_with_engineered_features['goal'] == 'True'])) * 100), 2))

# Add goal_time_ratios to time_interval_ratios

time_interval_ratios['Ratio Total Goals'] = goal_time_ratios

# Calculate the ratio of shots within the interval which resulted in a goal

goal_shot_time_ratios = []
goal_shot_time_ratios.append(round((((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                                        (data_with_engineered_features['time'] < 5)]))/
                                     (len(data_with_engineered_features[(data_with_engineered_features['time'] < 5)]))) *
                                    100), 2))
goal_shot_time_ratios.append(round((((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                                        (data_with_engineered_features['time'] > 5) &
                                                                        (data_with_engineered_features['time'] < 35)]))/
                                     (len(data_with_engineered_features[(data_with_engineered_features['time'] > 5) &
                                                                        (data_with_engineered_features['time'] < 35)]))) *
                                    100), 2))
goal_shot_time_ratios.append(round((((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                                        (data_with_engineered_features['time'] > 40) &
                                                                        (data_with_engineered_features['time'] < 45)]))/
                                     (len(data_with_engineered_features[(data_with_engineered_features['time'] > 40) &
                                                                        (data_with_engineered_features['time'] < 45)]))) *
                                    100), 2))
goal_shot_time_ratios.append(round((((len(data_with_engineered_features[(data_with_engineered_features['goal'] == 'True') &
                                                                        (data_with_engineered_features['time'] > 45)]))/
                                     (len(data_with_engineered_features[(data_with_engineered_features['time'] > 45)]))) *
                                    100), 2))
# Add goal_shot_time_ratios to time_interval_ratios

time_interval_ratios['Ratio Goals within Interval'] = goal_shot_time_ratios

time_interval_ratios = time_interval_ratios[['Time Interval',
                                            'Ratio Time',
                                            'Ratio Total Shots',
                                            'Ratio Total Goals',
                                            'Ratio Goals within Interval']]
# Display results

time_interval_ratios

Unnamed: 0,Time Interval,Ratio Time,Ratio Total Shots,Ratio Total Goals,Ratio Goals within Interval
0,First 5min,11.11,10.01,9.46,10.31
1,Regular Time,77.78,70.77,74.17,11.46
2,Last 5min,11.11,8.06,7.81,10.57
3,Stoppage-Time,22.22,5.36,3.75,7.65


In [37]:
# Ratio of shots and goals do not vary significantly from ratio of time

# Ratio of goals within specified intervals do not vary significantly

# Time intervals are not significantly linked with goals

# Drop significant_time

data_with_engineered_features.drop('significant_time',
                                   axis = 1,
                                   inplace = True)

## Bodypart

In [38]:
# Display value counts for bodypart

display(data_with_engineered_features['bodypart'].value_counts(dropna = False))

Right Foot    3493
Left Foot     1676
Head           926
Other            9
Name: bodypart, dtype: int64

In [39]:
# Rename 'Other' as 'Other Bodypart' for clarity

data_with_engineered_features['bodypart'].replace({'Other' : 'Other Bodypart'},
                                                  inplace = True)

In [40]:
# No NA values

## Bodypart Angle

In [41]:
data_with_engineered_features['bodypart_angle'].value_counts()

Right - Outside Foot    1882
Left - Inside Foot      1518
Left - Outside Foot      891
Right - Inside Foot      734
Right - Head             440
Left - Head              436
Other                    203
Name: bodypart_angle, dtype: int64

In [None]:
# No changes required for values

In [42]:
# No NA Values

## Technique

In [43]:
# Display value counts for technique

data_with_engineered_features['technique'].value_counts(dropna = False)

Normal           5154
Half Volley       509
Volley            337
Lob                55
Backheel           20
Diving Header      15
Overhead Kick      14
Name: technique, dtype: int64

In [44]:
# Rename 'Normal' as 'Ground' for clarity

# 'Lob' and 'Backheel' are types of shots from the ground, so can be included as 'Ground'

# 'Half Volley', 'Diving Header', and 'Overhead Kick' are types of shots from the air,
# so can be included as 'Volley'

data_with_engineered_features['technique'].replace({'Normal' : 'Ground',
                                                    'Half Volley' : 'Volley',
                                                    'Lob' : 'Ground',
                                                    'Backheel' : 'Ground',
                                                    'Diving Header' : 'Volley',
                                                    'Overhead Kick' : 'Volley'},
                                                   inplace = True)

In [45]:
# No NA values

In [46]:
data_with_engineered_features['technique'].value_counts()

Ground    5229
Volley     875
Name: technique, dtype: int64

## First Touch

In [47]:
# Display value counts for first_touch

data_with_engineered_features['first_touch'].value_counts(dropna = False)

False    4808
True     1296
Name: first_touch, dtype: int64

In [48]:
# Values for first_touch are boolean

In [49]:
# No NA values

## Assist

In [50]:
# Display value counts for assist1

data_with_engineered_features['assist1'].value_counts(dropna = False)

Ground Pass    2088
NaN            1953
High Pass      1532
Low Pass        531
Name: assist1, dtype: int64

In [51]:
# Display value counts for assist2

data_with_engineered_features['assist2'].value_counts(dropna = False)

NaN             5746
Through Ball     200
Inswinging        76
Outswinging       56
Straight          26
Name: assist2, dtype: int64

In [52]:
# Display value counts for assist3

data_with_engineered_features['assist3'].value_counts(dropna = False)

NaN             5104
Cross            758
Through Ball     197
Cut Back          45
Name: assist3, dtype: int64

In [53]:
# assist1 was extracted from the 'pass' subset of the Statsbomb events data
# which only defined value 'Cross'

# assist2 was extracted from the 'technique' subset of the StatsBomb events data
# and defines values 'Outswinging', 'Inswinging', and 'Straight' as types of crosses

# assist3 was extracted from StatsBomb events data points which included
# unique subsets 'Cross', 'Cut Back', and 'Through Ball' with boolean values

# For consistency, 'Outswinging', 'Inswinging', and 'Straight', from assist2 can be
# included as 'Cross'

# 'Cut Back' can be included as 'Cross'

data_with_engineered_features['assist2'].replace({'Outswinging' : 'Cross',
                                   'Straight' : 'Cross',
                                   'Inswinging' : 'Cross'},
                                 inplace = True)

data_with_engineered_features['assist3'].replace({'Cut Back' : 'Cross'},
                                 inplace = True)

In [54]:
# Combined three sources of assist data into single feature

data_with_engineered_features['assist3'].fillna(data_with_engineered_features['assist2'],
                                                inplace = True)
data_with_engineered_features['assist3'].fillna(data_with_engineered_features['assist1'],
                                                inplace = True)

data_with_engineered_features.drop(['assist1', 'assist2'],
                    axis = 1,
                    inplace = True)

data_with_engineered_features.rename(columns = {'assist3': 'assist'},
                      inplace = True)

In [55]:
# Fill NA values with 'Unassisted'

# Assume shots with NA assist values were unassisted
# Shooting player did not receive the ball via a pass from a teammate

data_with_engineered_features['assist'].fillna('Unassisted',
                                inplace = True)

In [56]:
# Display value counts for new combined assist

data_with_engineered_features['assist'].value_counts(dropna = False)

Unassisted      1953
Ground Pass     1787
Cross            961
High Pass        830
Low Pass         376
Through Ball     197
Name: assist, dtype: int64

## State of Play

In [57]:
# Display value counts for state_of_play

data_with_engineered_features['state_of_play'].value_counts(dropna = False)

Open Play    5858
Free Kick     191
Penalty        53
Corner          2
Name: state_of_play, dtype: int64

In [58]:
# Rename 'Open Play' as 'Open Play - Shot' for clarity v 'Open Play - Assist'

# Add precursor 'Set Piece' to 'Free Kick' and 'Penalty' to specify 
# set piece type v open play type

# Replace 'Free Kick' with 'Direct Free Kick', to specify shot taken directly v 
# 'From Freekick' in assist_state_of_play

# 'Corner' implies shot direct on goal from corner set piece
# Due to only occuring 2 times, despite not being same type of play, the intent, 
# a direct shot from set piece is the same, therefore, can be included as 
# 'Direct Free Kick'

data_with_engineered_features['state_of_play'].replace({'Open Play' : 'Open Play - Shot',
                                                        'Free Kick' : 'Set Piece - Direct Free Kick',
                                                        'Penalty' : 'Set Piece - Penalty',
                                                        'Corner' : 'Set Piece - Direct Free Kick'},
                                                       inplace = True)

In [59]:
# No NA values

In [60]:
data_with_engineered_features['state_of_play'].value_counts(dropna = False)

Open Play - Shot                5858
Set Piece - Direct Free Kick     193
Set Piece - Penalty               53
Name: state_of_play, dtype: int64

## Assist State of Play Values

In [61]:
# Display value counts for assist_state_of_play

data_with_engineered_features['assist_state_of_play'].value_counts(dropna = False)

NaN               1953
Regular Play      1608
From Throw In      844
From Corner        667
From Free Kick     489
From Counter       296
From Goal Kick     146
From Keeper         48
From Kick Off       45
Other                8
Name: assist_state_of_play, dtype: int64

In [62]:
# Replace 'Regular Play' with 'Open Play - Assist' for clarity

# Add precursor 'Set Piece' to 'From Throw In', 'From Corner',
# 'From Free Kick', and 'From Goal Kick' to specify set piece type v open play type

# 'From Keeper' is an assist from open play, specifically 
# from the keeper v 'From Goal Kick', which is likely also from the keeper,
# but a goal kick set piece
# 'From Keeper' can be included with 'Open Play - Assist'

# 'From Kickoff' only occurs 45 times, despite a kickoff being a set piece type,
# due to the distance the play travels, can be included as 'Open Play - Assist'

# Replace 'From Counter' with 'Open Play - Counter Attack' to specify assist from open play,
# but specifically from counter attack

# Due to lack of specificity, assume 'Other' can be included with 'Open Play'
# NA value count, 1953, matches the count from assist, assume 'Other' cannot
# be included with NA

data_with_engineered_features['assist_state_of_play'].replace({'Regular Play' : 'Open Play - Assist',
                                                'From Throw In' : 'Set Piece - Throw In',
                                                'From Corner' : 'Set Piece - Corner',
                                                'From Free Kick' : 'Set Piece - Free Kick',
                                                'From Counter' : 'Open Play - Counter Attack',
                                                'From Goal Kick' : 'Set Piece - Goal Kick',
                                                'From Keeper' : 'Open Play - Assist',
                                                'From Kick Off' : 'Open Play - Assist',
                                                'Other' : 'Open Play - Assist'},
                                               inplace = True)

In [63]:
# Fill NA values with 'No Assist'

# Assume shots with NA assist_state_of_play values unassisted

data_with_engineered_features['assist_state_of_play'].fillna('No Assist',
                                              inplace = True)

In [64]:
data_with_engineered_features['assist_state_of_play'].value_counts()

No Assist                     1953
Open Play - Assist            1709
Set Piece - Throw In           844
Set Piece - Corner             667
Set Piece - Free Kick          489
Open Play - Counter Attack     296
Set Piece - Goal Kick          146
Name: assist_state_of_play, dtype: int64

In [65]:
# 'Open Play - From Counter' will be retained separate, despite accounting
# for low percentage of total due to uniqueness of the state of play and expectation 
# of comparitively high correlations with high xG

# Location

In [66]:
# location_x and location_y values were utilized to engineer shot_distance, inside_18,
# and shot_angle, and are no longer relevant independently

# Drop location_x and location_y

data_with_engineered_features.drop(['location_x',
                                    'location_y'],
                                   axis = 1,
                                   inplace = True)

# Inside 18

In [67]:
print('Ratio of Shots Within the Width of the 18-Yard Box:',
      (round((len(data_with_engineered_features[(data_with_engineered_features['inside_18_width'] == True)]) /
              (len(data_with_engineered_features)) * 100), 2)), '%')
print('Ratio of Shots Within the Depth of the 18-Yard Box:',
      (round((len(data_with_engineered_features[(data_with_engineered_features['inside_18_depth'] == True)]) /
              (len(data_with_engineered_features)) * 100), 2)), '%')
print('Ratio of Shots Within the 18-Yard Box:',
      (round((len(data_with_engineered_features[(data_with_engineered_features['inside_18'] == True)]) /
              (len(data_with_engineered_features)) * 100), 2)), '%')

Ratio of Shots Within the Width of the 18-Yard Box: 93.69 %
Ratio of Shots Within the Depth of the 18-Yard Box: 61.39 %
Ratio of Shots Within the 18-Yard Box: 58.78 %


In [68]:
# Majority of shots within the width of the 18-yard box
# Difference between shots within depth and shots within both width and depth minimal
# inside_18_width and inside_18_depth not valuable independently

# Drop inside_18_width and inside_18_depth

data_with_engineered_features.drop(['inside_18_width',
                                    'inside_18_depth'],
                                   axis = 1,
                                   inplace = True)

# Combine State of Play with Assist State of Play

In [69]:
# state_of_play only specifies shots directly from set piece type
# 'Set Piece - Direct Free Kick' and 'Set Piece - Penalty' v Open Play
# If an assist was received from a specified assist value, then the
# shot is still defined as open play

# assist_state_of_play includes greater variety defining the overall play

# Combine state_of_play values and assist_state_of_play

data_with_engineered_features['state_of_play'] = np.where(data_with_engineered_features['state_of_play'] == 'Open Play - Shot', 
                                                          data_with_engineered_features['assist_state_of_play'],
                                                          data_with_engineered_features['state_of_play'])

data_with_engineered_features.drop('assist_state_of_play',
                                   axis = 1,
                                   inplace = True)

data_with_engineered_features['state_of_play'].replace({'Open Play - Assist' : 'Open Play',
                                                        'No Assist' : 'Open Play - Unassisted'},
                                                       inplace = True)

In [70]:
data_with_engineered_features['state_of_play'].value_counts()

Open Play                       1709
Open Play - Unassisted          1707
Set Piece - Throw In             844
Set Piece - Corner               667
Set Piece - Free Kick            489
Open Play - Counter Attack       296
Set Piece - Direct Free Kick     193
Set Piece - Goal Kick            146
Set Piece - Penalty               53
Name: state_of_play, dtype: int64

# Cleaned Data

In [71]:
cleaned_data = data_with_engineered_features
cleaned_data.head()

Unnamed: 0,statsbomb_xg,goal,time,player,team,shot_distance,inside_18,shot_angle,bodypart,bodypart_angle,technique,first_touch,assist,state_of_play
0,0.266154,False,4,Francesca Kirby,Chelsea FCW,12.529964,True,118.61,Left Foot,Right - Inside Foot,Ground,False,Ground Pass,Open Play
1,0.093521,False,11,Bethany England,Chelsea FCW,8.602325,True,54.46,Head,Left - Head,Ground,False,High Pass,Set Piece - Free Kick
2,0.036171,False,18,Drew Spence,Chelsea FCW,26.172505,False,96.58,Left Foot,Right - Inside Foot,Ground,False,Ground Pass,Open Play
3,0.016625,False,23,Chloe Arthur,Birmingham City WFC,34.525353,False,79.99,Left Foot,Left - Outside Foot,Ground,False,Ground Pass,Set Piece - Goal Kick
4,0.030716,False,23,Bethany England,Chelsea FCW,26.925824,False,74.93,Right Foot,Left - Inside Foot,Ground,False,Ground Pass,Set Piece - Goal Kick


In [72]:
cleaned_data.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_cleaning/cleaned_data.csv')

Continued in [expected_goals_data_exploration_notebook]()