### Capstone Project Submission

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Unknown

# Data Cleaning

<a id = 'proposal'></a>
### Proposal

**Problem Statement**

Create an expected goals metric using existing historical data which can be used to analyze future match data and provide specific recommendations to be utilized in following training to help improve the likelihood of goals.

**Supervised Learning Target**

Classification model which predicts the likelihood of a goal (percentage) given data features specific to the shot and preceding play.

**Data Source**

[StatsBomb Open Data](https://github.com/statsbomb/open-data)

# Contents

* **[Proposal](#proposal)**
* **[Packages](#packages)**
* **[Data](#data)**

<a id = 'packages'></a>
# Packages

In [1]:
# Pandas for Dataframes
import pandas as pd

# Numpy and Math for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

<a id = 'data'></a>
# Data

Data extracted from [StatsBomb Open Data](https://github.com/statsbomb/open-data) in [expected_goals_data_extraction_notebook](https://github.com/wswager/expected_goals/blob/main/notebooks/expected_goals_data_extraction_notebook.ipynb)

In [2]:
# Import dataframes extracted from
# expected_goals_data_extraction_notebook

%store -r extracted_data

In [3]:
extracted_data.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player_shot,team,bodypart,technique,first_time,state_of_play,assist,assist2,assist3,assist_state_of_play
0,109.0,46.0,2021-06-05 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play
1,113.0,35.0,2021-06-05 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play,High Pass,,,From Free Kick
2,94.0,43.0,2021-06-05 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play
3,86.0,34.0,2021-06-05 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick
4,94.0,33.0,2021-06-05 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick


# Outcome

In [4]:
extracted_data['outcome'].value_counts()

Off T               1921
Saved               1537
Blocked             1467
Goal                 666
Wayward              336
Post                 136
Saved Off Target      24
Saved to Post         17
Name: outcome, dtype: int64

In [5]:
# Combine similar values for outcome

extracted_data['outcome'].replace({'Off T' : 'Off Target',
                                   'Wayward' : 'Off Target',
                                   'Post' : 'Off Target',
                                   'Saved Off Target' : 'Saved',
                                   'Saved to Post' : 'Saved'},
                                 inplace = True)

In [6]:
extracted_data['outcome'].value_counts()

Off Target    2393
Saved         1578
Blocked       1467
Goal           666
Name: outcome, dtype: int64

# Bodypart

In [7]:
extracted_data['bodypart'].value_counts()

Right Foot    3493
Left Foot     1676
Head           926
Other            9
Name: bodypart, dtype: int64

# Technique

In [8]:
extracted_data['technique'].value_counts()

Normal           5154
Half Volley       509
Volley            337
Lob                55
Backheel           20
Diving Header      15
Overhead Kick      14
Name: technique, dtype: int64

In [9]:
extracted_data['technique'].replace({'Lob' : 'Normal',
                                     'Backheel' : 'Normal',
                                     'Diving Header' : 'Volley',
                                     'Overhead Kick' : 'Volley'},
                                    inplace = True)

In [10]:
extracted_data['technique'].value_counts()

Normal         5229
Half Volley     509
Volley          366
Name: technique, dtype: int64

# Combine Assist Values

In [11]:
extracted_data['assist'].value_counts()

Ground Pass    2088
High Pass      1532
Low Pass        531
Name: assist, dtype: int64

In [12]:
extracted_data['assist2'].value_counts()

Through Ball    200
Inswinging       76
Outswinging      56
Straight         26
Name: assist2, dtype: int64

In [13]:
extracted_data['assist3'].value_counts()

Cross           758
Through Ball    197
Cut Back         45
Name: assist3, dtype: int64

In [14]:
# Replace specific types of crosses from 'technique' feature
# with standard 'cross' from 'pass' feature

extracted_data['assist2'].replace({'Outswinging' : 'Cross',
                                   'Straight' : 'Cross',
                                   'Inswinging' : 'Cross'},
                                 inplace = True)

# Combined three sources of assist data into single column

extracted_data['assist3'].fillna(extracted_data['assist2'], inplace = True)
extracted_data['assist3'].fillna(extracted_data['assist'], inplace = True)

extracted_data.drop(['assist', 'assist2'], axis = 1, inplace = True)

extracted_data.rename(columns = {'assist3': 'assist'},
                      inplace = True)

In [15]:
extracted_data['assist'].value_counts()

Ground Pass     1787
Cross            916
High Pass        830
Low Pass         376
Through Ball     197
Cut Back          45
Name: assist, dtype: int64

# Match State of Play v Assist State of Play Values

In [16]:
extracted_data['state_of_play'].value_counts()

Open Play    5858
Free Kick     191
Penalty        53
Corner          2
Name: state_of_play, dtype: int64

In [17]:
extracted_data['assist_state_of_play'].value_counts()

Regular Play      1608
From Throw In      844
From Corner        667
From Free Kick     489
From Counter       296
From Goal Kick     146
From Keeper         48
From Kick Off       45
Other                8
Name: assist_state_of_play, dtype: int64

In [18]:
# Replace value names for state_of_play and assist_state_of_play
# for consistency

extracted_data['state_of_play'].replace({'Free Kick' : 'Set Piece',
                                         'Penalty' : 'Set Piece',
                                         'Corner' : 'Set Piece'},
                                        inplace = True)

extracted_data['assist_state_of_play'].replace({'From Free Kick' : 'Set Piece'},
                                               inplace = True)

# Assume because only 'From Freekick' is specified,
# the remainder are 'Open Play'

extracted_data['assist_state_of_play'].fillna('Open Play',
                                              inplace = True)

In [19]:
extracted_data['state_of_play'].value_counts()

Open Play    5858
Set Piece     246
Name: state_of_play, dtype: int64

In [20]:
extracted_data['assist_state_of_play'].value_counts()

Open Play         1953
Regular Play      1608
From Throw In      844
From Corner        667
Set Piece          489
From Counter       296
From Goal Kick     146
From Keeper         48
From Kick Off       45
Other                8
Name: assist_state_of_play, dtype: int64

# Fill NA values

In [21]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6104 entries, 0 to 6103
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   location_x            6104 non-null   float64       
 1   location_y            6104 non-null   float64       
 2   time                  6104 non-null   datetime64[ns]
 3   statsbomb_xg          6104 non-null   float64       
 4   outcome               6104 non-null   object        
 5   player_shot           6104 non-null   object        
 6   team                  6104 non-null   object        
 7   bodypart              6104 non-null   object        
 8   technique             6104 non-null   object        
 9   first_time            6104 non-null   bool          
 10  state_of_play         6104 non-null   object        
 11  assist                4151 non-null   object        
 12  assist_state_of_play  6104 non-null   object        
dtypes: bool(1), dateti

In [22]:
# Fill NA values for assist with 'None'

extracted_data['assist'].fillna('None',
                                inplace = True)

# Reorder Columns

In [23]:
extracted_data = extracted_data[['statsbomb_xg',
                                 'outcome',
                                 'time',
                                 'player_shot',
                                 'team',
                                 'location_x',
                                 'location_y',
                                 'bodypart',
                                 'technique',
                                 'first_time',
                                 'state_of_play',
                                 'assist',
                                 'assist_state_of_play']]

# Cleaned Data

In [24]:
cleaned_data = extracted_data
cleaned_data.head()

Unnamed: 0,statsbomb_xg,outcome,time,player_shot,team,location_x,location_y,bodypart,technique,first_time,state_of_play,assist,assist_state_of_play
0,0.266154,Blocked,2021-06-05 00:04:38.609,Francesca Kirby,Chelsea FCW,109.0,46.0,Left Foot,Normal,False,Open Play,Ground Pass,Regular Play
1,0.093521,Off Target,2021-06-05 00:11:45.046,Bethany England,Chelsea FCW,113.0,35.0,Head,Normal,False,Open Play,High Pass,Set Piece
2,0.036171,Saved,2021-06-05 00:18:03.461,Drew Spence,Chelsea FCW,94.0,43.0,Left Foot,Normal,False,Open Play,Ground Pass,Regular Play
3,0.016625,Off Target,2021-06-05 00:23:11.935,Chloe Arthur,Birmingham City WFC,86.0,34.0,Left Foot,Normal,False,Open Play,Ground Pass,From Goal Kick
4,0.030716,Off Target,2021-06-05 00:23:45.810,Bethany England,Chelsea FCW,94.0,33.0,Right Foot,Normal,False,Open Play,Ground Pass,From Goal Kick


In [25]:
cleaned_data.to_csv(r'C:\Users\westi\Documents\github\expected_goals\data\saved_dataframes\data_cleaning\cleaned_data.csv')
%store cleaned_data

Stored 'cleaned_data' (DataFrame)
