# **Expected Goals Classifier**

## Overview

Classification model for expected goals (xG) in women's club soccer, predicting the likelihood that a shot will score using data extracted from StatsBomb.

Project Github: [Expected Goals Classifier]()

# Data Cleaning Notebook

Continued from expected_goals_data_extraction_notebook

*Notebook 2 of 7*

### Index

1. [Data Extraction]()
2. [Data Cleaning]()
3. [Features Engineering]()
4. [Data Exploration]()
5. [Data Preprocessing]()
6. [Prediction Modeling]()
7. [Conclusions]()

<a id = 'packages'></a>
# Packages

In [963]:
# Drive  and IO to access saved files

from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval and assessment

import pathlib
from pathlib import Path as path

# warnings to ignore warnings

import warnings
warnings.filterwarnings('ignore')

# Pandas for dataframes

import pandas as pd

# Numpy for mathematical functions

import numpy as np

# Plotly for visualizations

import plotly.express as px

# Scipy for statistical functions

from scipy import stats

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Functions

In [964]:
def save_df (df,
             name):
  
  # Save dataframe to drive in parquet format

  df.to_parquet(str('/content/drive/MyDrive/expected_goals/data_cleaning/dataframes/' + name + '.parquet'))

  print(name,
        'Filesize:',
        path(str('/content/drive/MyDrive/expected_goals/data_cleaning/dataframes/' + name + '.parquet')).stat().st_size,
        'bytes')

In [965]:
def df_description(df):
  
  print("Total Events:",
        len(df),
        '\n',
        'Total Features:',
        df.shape[1])

In [966]:
def value_percent_v_total(feature,
                          target_value):
  print(feature,
        "Percent '",
        target_value,
        "':",
        round((((sum(extracted_data[feature] == target_value)) /
                (len(extracted_data))) *
               100),
              2),
        '%')

In [967]:
def compare_value_count(feature_1,
                        value,
                        feature_2):

  # Display value count for one defined feature when
  # a second defined feature has a defined value
  
  return extracted_data.loc[extracted_data[feature_1] == value][feature_2].value_counts(dropna = False)

# Data

In [968]:
# Import extracted_data from expected_goals_data_extraction_notebook

extracted_data = pd.read_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/extracted_data.parquet')

In [969]:
extracted_data.head()

Unnamed: 0,id,index_x,period_x,timestamp_x,minute_x,second_x,type_x,possession_x,possession_team_x,play_pattern_x,team_x,player_x,position_x,location_x,duration_x,match_id_x,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part,shot_technique,shot_type,shot_outcome,shot_freeze_frame,possession_team_id_x,player_id_x,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,off_camera_x,shot_aerial_won,shot_open_goal,out_x,shot_redirect,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble,index_y,period_y,...,position_y,location_y,duration_y,related_events_y,match_id_y,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,possession_team_id_y,player_id_y,under_pressure_y,pass_outcome,pass_assisted_shot_id,pass_shot_assist,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_goal_assist,pass_aerial_won,pass_backheel,pass_deflected,counterpress,off_camera_y,pass_cut_back,pass_miscommunication,pass_outswinging,pass_straight,pass_inswinging,pass_no_touch,out_y,related_events_1,related_events_2,related_events_3,related_events_4,goal
0,d9c27699-dd12-4e55-96d6-4c95685e4c66,42,1,00:00:47.620,0,47,Shot,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[115.0, 25.0]",0.56,7298,0.092624,"[117.0, 34.0]",abb17b2a-1775-4226-9158-67efe68ee0c3,Right Foot,Normal,Open Play,Blocked,"[{'location': [115.0, 36.0], 'player': {'id': ...",971,4641,,,,,,,,,,,,,33.0,1.0,...,Center Back,"[44.0, 17.0]",3.453,[7fbb0f53-f758-4667-9993-062fde493f1c],7298.0,Francesca Kirby,51.351727,0.117109,High Pass,"[95.0, 23.0]",Right Foot,,971.0,4657.0,,,d9c27699-dd12-4e55-96d6-4c95685e4c66,True,Through Ball,True,,,,,,,,,,,,,,,,524c23dc-265a-4cfc-b367-bd37356c0185,f59f670e-cb52-4805-875a-bdd20adcd9f7,,,False
1,4eb844e2-9466-424a-abe3-1ba730afe716,237,1,00:05:12.780,5,12,Shot,15,Chelsea FCW,From Throw In,Chelsea FCW,Francesca Kirby,Right Center Forward,"[109.0, 51.0]",0.4,7298,0.041837,"[112.0, 44.0]",fcb92aad-eba5-4053-8606-e9d70856ddd9,Left Foot,Normal,Open Play,Blocked,"[{'location': [113.0, 46.0], 'player': {'id': ...",971,4641,,,,,,,,,,,,,231.0,1.0,...,Right Center Midfield,"[102.0, 45.0]",1.56,[9998635e-bd33-4f9e-aa46-81aa674b65b4],7298.0,Francesca Kirby,14.866069,0.737815,Low Pass,"[113.0, 55.0]",Head,,971.0,10395.0,,,4eb844e2-9466-424a-abe3-1ba730afe716,True,,,,,,,,,,,,,,,,,,1880a599-93d5-450a-ac3d-0dc6b7b0609a,a61f6f58-49b6-47f0-9b2e-f9d61f9da195,,,False
2,3e41e219-ac37-4e1c-97c3-eca7cd886484,243,1,00:05:41.940,5,41,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,So-Yun Ji,Center Midfield,"[99.0, 52.0]",0.48,7298,0.017603,"[108.0, 51.0]",,Right Foot,Half Volley,Open Play,Blocked,"[{'location': [111.0, 44.0], 'player': {'id': ...",971,4647,True,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2fb3a5ea-a6cd-4231-ac7f-ea88e85c109a,94bbaf41-4573-415f-9576-0f6dac70493f,,,False
3,f1c6e6cb-ab00-4223-8980-09747ce40924,248,1,00:05:43.900,5,43,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Drew Spence,Left Center Midfield,"[107.0, 40.0]",0.16,7298,0.144138,"[112.0, 37.0]",,Right Foot,Normal,Open Play,Blocked,"[{'location': [99.0, 52.0], 'player': {'id': 4...",971,4638,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,9b005c5d-8621-4764-838b-dd7695925bee,d3d11c93-6603-455e-b031-aa27a4810617,,,False
4,33060f55-fd05-4a3b-bf81-4315b9cb6417,256,1,00:05:46.380,5,46,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Millie Bright,Right Center Back,"[108.0, 32.0]",1.48,7298,0.068946,"[120.0, 43.2, 2.0]",bd21493e-5b27-499b-a0ba-367c3b18a70e,Left Foot,Normal,Open Play,Goal,"[{'location': [103.0, 36.0], 'player': {'id': ...",971,4642,True,True,,,,,,,,,,,251.0,1.0,...,Center Back,"[104.0, 36.0]",0.96,[0f23d443-778e-4c00-9ac3-b23c8d0f6f3b],7298.0,Millie Bright,8.602325,-0.950547,Ground Pass,"[109.0, 29.0]",Right Foot,Recovery,971.0,4657.0,,,33060f55-fd05-4a3b-bf81-4315b9cb6417,,,,,,True,,,,,,,,,,,,,55631e56-7585-498c-923f-b0f9f47f790a,56f9c1ff-b529-479f-9c3e-00e706a6d242,5f05f74f-e4fe-4db5-9823-a479dc68e6a6,,True


# Target Feature

In [970]:
# xG measures the likelihood a shot will result in a goal

# The target outcome for eventual modeling will be
# if the shot resulted in a goal

# Currently, shot_outcome includes this description

In [971]:
# Display value counts for shot_outcome

extracted_data['shot_outcome'].value_counts(dropna = False)

Off T               1921
Saved               1539
Blocked             1472
Goal                 668
Wayward              337
Post                 136
Saved Off Target      24
Saved to Post         17
Name: shot_outcome, dtype: int64

In [972]:
# Update shot_outcome to boolean goal

extracted_data['goal'] = extracted_data['shot_outcome'].apply(lambda x: 1 if x == 'Goal' else 0)

extracted_data['goal'] = extracted_data['goal'].astype(bool)

In [973]:
extracted_data['goal'].head()

0    False
1    False
2    False
3    False
4     True
Name: goal, dtype: bool

In [974]:
extracted_data['goal'].value_counts()

False    5446
True      668
Name: goal, dtype: int64

In [975]:
value_percent_v_total('goal',
                      True)

goal Percent ' True ': 10.93 %


In [976]:
df_description(extracted_data)

Total Events: 6114 
 Total Features: 89


In [977]:
save_df(extracted_data,
        'extracted_data')

extracted_data Filesize: 2282149 bytes


# Irrelevant Data

In [978]:
# Current list of features

list(extracted_data.columns.values)

['id',
 'index_x',
 'period_x',
 'timestamp_x',
 'minute_x',
 'second_x',
 'type_x',
 'possession_x',
 'possession_team_x',
 'play_pattern_x',
 'team_x',
 'player_x',
 'position_x',
 'location_x',
 'duration_x',
 'match_id_x',
 'shot_statsbomb_xg',
 'shot_end_location',
 'shot_key_pass_id',
 'shot_body_part',
 'shot_technique',
 'shot_type',
 'shot_outcome',
 'shot_freeze_frame',
 'possession_team_id_x',
 'player_id_x',
 'shot_first_time',
 'under_pressure_x',
 'shot_one_on_one',
 'shot_deflected',
 'off_camera_x',
 'shot_aerial_won',
 'shot_open_goal',
 'out_x',
 'shot_redirect',
 'shot_saved_off_target',
 'shot_saved_to_post',
 'shot_follows_dribble',
 'index_y',
 'period_y',
 'timestamp_y',
 'minute_y',
 'second_y',
 'type_y',
 'possession_y',
 'possession_team_y',
 'play_pattern_y',
 'team_y',
 'player_y',
 'position_y',
 'location_y',
 'duration_y',
 'related_events_y',
 'match_id_y',
 'pass_recipient',
 'pass_length',
 'pass_angle',
 'pass_height',
 'pass_end_location',
 'pass_bo

### Duplicate Features

In [979]:
# Define duplicate features

duplicate_features = ['shot_saved_off_target',
                     'shot_saved_to_post',
                     'pass_outcome',
                     'pass_assisted_shot_id',
                     'pass_shot_assist',
                     'pass_goal_assist',
                     'index_y',
                     'period_y',
                     'timestamp_y',
                     'minute_x',
                     'second_x',
                     'minute_y',
                     'second_y',
                     'type_y',
                     'possession_y',
                     'possession_team_y',
                     'play_pattern_y',
                     'team_y',
                     'player_y',
                     'position_y',
                     'duration_y',
                     'related_events_y',
                     'match_id_y',
                     'under_pressure_y',
                     'off_camera_y',
                     'out_y']

In [980]:
# Drop duplicate features

extracted_data.drop(duplicate_features,
                    axis = 1,
                    inplace = True)

### Non Shot-Specific Features

In [981]:
# Define features unrelated to shots

non_shot_specific_features = ['id',
                              'index_x',
                              'type_x',
                              'possession_x',
                              'possession_team_x',
                              'possession_team_id_x',
                              'team_x',
                              'player_x',
                              'player_id_x',
                              'position_x',
                              'duration_x',
                              'related_events_1',
                              'related_events_2',
                              'related_events_3',
                              'related_events_4',
                              'match_id_x',
                              'shot_key_pass_id',
                              'shot_freeze_frame',
                              'out_x',
                              'off_camera_x',
                              'possession_team_id_y',
                              'player_id_y',
                              'pass_recipient',
                              'pass_body_part',
                              'pass_aerial_won',
                              'pass_deflected',
                              'pass_miscommunication',
                              'pass_no_touch']

In [982]:
# Drop features unrelated to shots

extracted_data.drop(non_shot_specific_features,
                    axis = 1,
                    inplace = True)

### Results

In [983]:
extracted_data.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,location_x,shot_statsbomb_xg,shot_end_location,shot_body_part,shot_technique,shot_type,shot_outcome,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,shot_aerial_won,shot_open_goal,shot_redirect,shot_follows_dribble,location_y,pass_length,pass_angle,pass_height,pass_end_location,pass_type,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_backheel,counterpress,pass_cut_back,pass_outswinging,pass_straight,pass_inswinging,goal
0,1,00:00:47.620,From Counter,"[115.0, 25.0]",0.092624,"[117.0, 34.0]",Right Foot,Normal,Open Play,Blocked,,,,,,,,,"[44.0, 17.0]",51.351727,0.117109,High Pass,"[95.0, 23.0]",,Through Ball,True,,,,,,,,,False
1,1,00:05:12.780,From Throw In,"[109.0, 51.0]",0.041837,"[112.0, 44.0]",Left Foot,Normal,Open Play,Blocked,,,,,,,,,"[102.0, 45.0]",14.866069,0.737815,Low Pass,"[113.0, 55.0]",,,,,,,,,,,,False
2,1,00:05:41.940,From Corner,"[99.0, 52.0]",0.017603,"[108.0, 51.0]",Right Foot,Half Volley,Open Play,Blocked,True,,,,,,,,,,,,,,,,,,,,,,,,False
3,1,00:05:43.900,From Corner,"[107.0, 40.0]",0.144138,"[112.0, 37.0]",Right Foot,Normal,Open Play,Blocked,,,,,,,,,,,,,,,,,,,,,,,,,False
4,1,00:05:46.380,From Corner,"[108.0, 32.0]",0.068946,"[120.0, 43.2, 2.0]",Left Foot,Normal,Open Play,Goal,True,True,,,,,,,"[104.0, 36.0]",8.602325,-0.950547,Ground Pass,"[109.0, 29.0]",Recovery,,,,,,,,,,,True


In [984]:
df_description(extracted_data)

Total Events: 6114 
 Total Features: 35


In [985]:
# Updated list of features

list(extracted_data.columns.values)

['period_x',
 'timestamp_x',
 'play_pattern_x',
 'location_x',
 'shot_statsbomb_xg',
 'shot_end_location',
 'shot_body_part',
 'shot_technique',
 'shot_type',
 'shot_outcome',
 'shot_first_time',
 'under_pressure_x',
 'shot_one_on_one',
 'shot_deflected',
 'shot_aerial_won',
 'shot_open_goal',
 'shot_redirect',
 'shot_follows_dribble',
 'location_y',
 'pass_length',
 'pass_angle',
 'pass_height',
 'pass_end_location',
 'pass_type',
 'pass_technique',
 'pass_through_ball',
 'pass_cross',
 'pass_switch',
 'pass_backheel',
 'counterpress',
 'pass_cut_back',
 'pass_outswinging',
 'pass_straight',
 'pass_inswinging',
 'goal']

In [986]:
save_df(extracted_data,
        'extracted_data')

extracted_data Filesize: 328553 bytes


# Data Types

In [987]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6114 entries, 0 to 6113
Data columns (total 35 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   period_x              6114 non-null   int64  
 1   timestamp_x           6114 non-null   object 
 2   play_pattern_x        6114 non-null   object 
 3   location_x            6114 non-null   object 
 4   shot_statsbomb_xg     6114 non-null   float64
 5   shot_end_location     6114 non-null   object 
 6   shot_body_part        6114 non-null   object 
 7   shot_technique        6114 non-null   object 
 8   shot_type             6114 non-null   object 
 9   shot_outcome          6114 non-null   object 
 10  shot_first_time       1299 non-null   object 
 11  under_pressure_x      1052 non-null   object 
 12  shot_one_on_one       332 non-null    object 
 13  shot_deflected        64 non-null     object 
 14  shot_aerial_won       362 non-null    object 
 15  shot_open_goal       

## Boolean Features

In [988]:
# Define boolean features

boolean_features = ['shot_one_on_one',
                    'shot_aerial_won',
                    'shot_open_goal',
                    'shot_first_time',
                    'shot_redirect',
                    'shot_deflected',
                    'shot_follows_dribble',
                    'under_pressure_x',
                    'counterpress',
                    'pass_switch',
                    'pass_through_ball',
                    'pass_backheel',
                    'pass_cross',
                    'pass_cut_back',
                    'pass_inswinging',
                    'pass_straight',
                    'pass_outswinging']

In [989]:
# Convert boolean features to boolean data type

extracted_data[boolean_features] = extracted_data[boolean_features].astype(bool)

## Datetime

In [990]:
# Convert timestamp_x datatype to datetime

extracted_data['timestamp_x'] = extracted_data['timestamp_x'].astype(str)
extracted_data['timestamp_x'] = pd.to_datetime(extracted_data['timestamp_x'])

## Results

In [991]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6114 entries, 0 to 6113
Data columns (total 35 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   period_x              6114 non-null   int64         
 1   timestamp_x           6114 non-null   datetime64[ns]
 2   play_pattern_x        6114 non-null   object        
 3   location_x            6114 non-null   object        
 4   shot_statsbomb_xg     6114 non-null   float64       
 5   shot_end_location     6114 non-null   object        
 6   shot_body_part        6114 non-null   object        
 7   shot_technique        6114 non-null   object        
 8   shot_type             6114 non-null   object        
 9   shot_outcome          6114 non-null   object        
 10  shot_first_time       6114 non-null   bool          
 11  under_pressure_x      6114 non-null   bool          
 12  shot_one_on_one       6114 non-null   bool          
 13  shot_deflected    

In [992]:
# Note: location_x, location_y, shot_end_location, and
# pass_end_location datatypes will be addressed in a later step

In [993]:
save_df(extracted_data,
        'extracted_data')

extracted_data Filesize: 319156 bytes


# Missing Values

In [994]:
extracted_data.isnull().sum()

period_x                   0
timestamp_x                0
play_pattern_x             0
location_x                 0
shot_statsbomb_xg          0
shot_end_location          0
shot_body_part             0
shot_technique             0
shot_type                  0
shot_outcome               0
shot_first_time            0
under_pressure_x           0
shot_one_on_one            0
shot_deflected             0
shot_aerial_won            0
shot_open_goal             0
shot_redirect              0
shot_follows_dribble       0
location_y              1950
pass_length             1950
pass_angle              1950
pass_height             1950
pass_end_location       1950
pass_type               5148
pass_technique          5758
pass_through_ball          0
pass_cross                 0
pass_switch                0
pass_backheel              0
counterpress               0
pass_cut_back              0
pass_outswinging           0
pass_straight              0
pass_inswinging            0
goal          

## No Pass Shots

In [995]:
def feature_na(feature):

  # Identify total NA values for defined feature

  return (sum(extracted_data[feature].isna()))

def print_feature_na(feature):

    print(feature,
        'NA :',
        feature_na(feature))

In [996]:
print_feature_na('pass_length')
print_feature_na('pass_angle')
print_feature_na('pass_height')

pass_length NA : 1950
pass_angle NA : 1950
pass_height NA : 1950


In [997]:
def compare_feature_na(feature):
  
  # Compare NA values of defined feature with a
  # separate defined feature
  
  print('pass_length NA =/=',
        feature,
        'NA:',
        (feature_na('pass_length')) -
        (sum(extracted_data.loc[extracted_data['pass_length'].isna()][feature].isna())))

In [998]:
# Compare pass_length, pass_angle, and pass_height NA

compare_feature_na('pass_angle')
compare_feature_na('pass_height')

pass_length NA =/= pass_angle NA: 0
pass_length NA =/= pass_height NA: 0


In [999]:
# Note: pass_length, pass_angle, and pass_height each have 1942 missing values

# Note: The 1942 missing values for each pass_length, pass_angle, and pass_height
# are the same events

# Assume these missing values are shots which were not preceded by a pass ('No Pass Shots')

In [1000]:
print('Percent no pass shots:',
      (round((((sum(extracted_data['pass_length'].isna())) /
               (len(extracted_data))) *
              100),
             2)),
      '%')

Percent no pass shots: 31.89 %


In [1001]:
# 31.84% of values is too significant to drop

# Features are too significant to drop: No preceding pass is valuable in its
# description of the play preceeding the shot in a manner which may be
# significant for xG

### Functions

In [1002]:
def NA_percent_v_total(feature):

  # Calculate percent defined feature NA total v
  # total events

  print(feature,
        'Percent NA:',
        round((((sum(extracted_data[feature].isna())) /
                (len(extracted_data))) *
               100),
              2),
        '%')

In [1003]:
def fill_feature_na(feature,
                    value):
  
  # Fill defined feature NA with defined value

  extracted_data[feature].fillna(value,
                                 inplace = True)

In [1004]:
def fill_no_pass(feature,
                 value):

  # Fill defined feature values for events identified
  # as no pass shots with defined value
  
  extracted_data.loc[extracted_data['pass_length'] == 0,
                     [feature]] = value

### Numerical Features

In [1005]:
# Define no pass numerical features

no_pass_numerical = ['pass_length',
                     'pass_angle']

# Display current no_pass_numerical NA sums

extracted_data[no_pass_numerical].isnull().sum()

pass_length    1950
pass_angle     1950
dtype: int64

In [1006]:
# Fill numerical features no pass events with 0

fill_feature_na('pass_length',
                0)

fill_feature_na('pass_angle',
                0)

In [1007]:
# Display updated no_pass_numerical NA sums

extracted_data[no_pass_numerical].isnull().sum()

pass_length    0
pass_angle     0
dtype: int64

### Categorical Features

In [1008]:
# Define no pass categorical features

no_pass_categorical = ['pass_height',
                       'pass_type',
                       'pass_technique']

# Display current no_pass_categorical NA sums

extracted_data[no_pass_categorical].isnull().sum()

pass_height       1950
pass_type         5148
pass_technique    5758
dtype: int64

In [1009]:
# Compare no pass shots v pass_type values

compare_value_count('pass_length',
                    0,
                    'pass_type')

NaN    1951
Name: pass_type, dtype: int64

In [1010]:
# Compare no pass shots v pass_technique values

compare_value_count('pass_length',
                    0,
                    'pass_technique')

NaN    1951
Name: pass_technique, dtype: int64

In [1011]:
# Note: All events identified as no pass shots currently
# have NA values for pass_type and pass_technique

In [1012]:
fill_no_pass('pass_height',
             'No Pass')

fill_no_pass('pass_type',
             'No Pass')

fill_no_pass('pass_technique',
             'No Pass')

In [1013]:
# Display updated no_pass_categorical NA sums

extracted_data[no_pass_categorical].isnull().sum()

pass_height          0
pass_type         3197
pass_technique    3807
dtype: int64

In [1014]:
# Note: Additional pass_type and pass_technique NA
# will be addressed in later step

### Boolean Features

In [1015]:
# Define boolean pass features

boolean_pass_features = ['pass_switch',
                         'pass_through_ball',
                         'pass_backheel',
                         'pass_cross',
                         'pass_cut_back',
                         'pass_inswinging',
                         'pass_straight',
                         'pass_outswinging']

# Display current boolean_pass_features NA sums

extracted_data[boolean_pass_features].isnull().sum()

pass_switch          0
pass_through_ball    0
pass_backheel        0
pass_cross           0
pass_cut_back        0
pass_inswinging      0
pass_straight        0
pass_outswinging     0
dtype: int64

In [1016]:
# Compare shot events with no pass v boolean_pass_features values

extracted_data.loc[extracted_data['pass_length'] == 0][boolean_pass_features].value_counts()

pass_switch  pass_through_ball  pass_backheel  pass_cross  pass_cut_back  pass_inswinging  pass_straight  pass_outswinging
False        False              False          False       False          False            False          False               1951
dtype: int64

In [1017]:
# Note: No boolean pass features currently contain NA

# Note: All boolean pass feature values for no pass events
# are currently 'False'

# No changes required

## Additional Missing Values

In [1018]:
extracted_data.isnull().sum()

period_x                   0
timestamp_x                0
play_pattern_x             0
location_x                 0
shot_statsbomb_xg          0
shot_end_location          0
shot_body_part             0
shot_technique             0
shot_type                  0
shot_outcome               0
shot_first_time            0
under_pressure_x           0
shot_one_on_one            0
shot_deflected             0
shot_aerial_won            0
shot_open_goal             0
shot_redirect              0
shot_follows_dribble       0
location_y              1950
pass_length                0
pass_angle                 0
pass_height                0
pass_end_location       1950
pass_type               3197
pass_technique          3807
pass_through_ball          0
pass_cross                 0
pass_switch                0
pass_backheel              0
counterpress               0
pass_cut_back              0
pass_outswinging           0
pass_straight              0
pass_inswinging            0
goal          

### pass_type

In [1019]:
extracted_data['pass_type'].value_counts(dropna = False)

NaN             3197
No Pass         1951
Corner           402
Recovery         307
Free Kick        202
Throw-in          42
Interception      11
Kick Off           1
Goal Kick          1
Name: pass_type, dtype: int64

In [1020]:
# pass_type defined values are set-plays and defensive recoveries
# Assume missing values are from open play

# Fill pass_type missing values with 'Open Play'

fill_feature_na('pass_type',
                'Open Play')

In [1021]:
extracted_data['pass_type'].value_counts(dropna = False)

Open Play       3197
No Pass         1951
Corner           402
Recovery         307
Free Kick        202
Throw-in          42
Interception      11
Kick Off           1
Goal Kick          1
Name: pass_type, dtype: int64

### pass_technique

In [1022]:
extracted_data['pass_technique'].value_counts(dropna = False)

NaN             3807
No Pass         1951
Through Ball     199
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64

In [1023]:
# pass_technique defined values are specialized passes
# Assume missing values are standard passes

# Fill pass_technique missing values with 'Standard'

fill_feature_na('pass_technique',
                'Standard')

In [1024]:
extracted_data['pass_technique'].value_counts(dropna = False)

Standard        3807
No Pass         1951
Through Ball     199
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64

## Results

In [1025]:
extracted_data.isnull().sum()

period_x                   0
timestamp_x                0
play_pattern_x             0
location_x                 0
shot_statsbomb_xg          0
shot_end_location          0
shot_body_part             0
shot_technique             0
shot_type                  0
shot_outcome               0
shot_first_time            0
under_pressure_x           0
shot_one_on_one            0
shot_deflected             0
shot_aerial_won            0
shot_open_goal             0
shot_redirect              0
shot_follows_dribble       0
location_y              1950
pass_length                0
pass_angle                 0
pass_height                0
pass_end_location       1950
pass_type                  0
pass_technique             0
pass_through_ball          0
pass_cross                 0
pass_switch                0
pass_backheel              0
counterpress               0
pass_cut_back              0
pass_outswinging           0
pass_straight              0
pass_inswinging            0
goal          

In [1026]:
# Note: location_y and pass_end_location NA will 
# be addressed in a later step

In [1027]:
save_df(extracted_data,
        'extracted_data')

extracted_data Filesize: 326099 bytes


# Location Features

In [1028]:
# Define location features

location_features = ['location_x',
                     'shot_end_location',
                     'location_y',
                     'pass_end_location']

In [1029]:
extracted_data[location_features].head()

Unnamed: 0,location_x,shot_end_location,location_y,pass_end_location
0,"[115.0, 25.0]","[117.0, 34.0]","[44.0, 17.0]","[95.0, 23.0]"
1,"[109.0, 51.0]","[112.0, 44.0]","[102.0, 45.0]","[113.0, 55.0]"
2,"[99.0, 52.0]","[108.0, 51.0]",,
3,"[107.0, 40.0]","[112.0, 37.0]",,
4,"[108.0, 32.0]","[120.0, 43.2, 2.0]","[104.0, 36.0]","[109.0, 29.0]"


In [1030]:
# Currently location_x, location_y, and pass_end_location
# values are x and y-coordinates

# Currently shot_end_location values are x, y, z-coordinates

In [1031]:
print_feature_na('location_y')
print_feature_na('pass_end_location')

location_y NA : 1950
pass_end_location NA : 1950


In [1032]:
def compare_location_na(feature):
  
  # Compare NA values of defined feature with no pass events
  
  print('No Pass Events =/=',
        feature,
        'NA:',
        (sum(extracted_data[feature].isna())) -
        (sum(extracted_data.loc[extracted_data['pass_length'] == 0][feature].isna())))

In [1033]:
compare_location_na('location_y')
compare_location_na('pass_end_location')

No Pass Events =/= location_y NA: 0
No Pass Events =/= pass_end_location NA: 0


In [1034]:
# Currently location_y and pass_end_location contain NA

# location_y and pass_end_locatio NA are no pass events

## Functions

In [1035]:
def split_coordinates(feature):
  
  # Split location feature values into separate x and y-coordinates
  
  split_coordinates_df = pd.DataFrame(extracted_data[feature].tolist(),
                                      index = extracted_data.index)
  
  return split_coordinates_df

In [1036]:
def replace_location_feature(feature,
                             feature_x,
                             feature_y,
                             split_coordinates_df):
  
  # Drop previous location feature
  
  extracted_data.drop(feature,
                      axis = 1,
                      inplace = True)
  
  # Add separate x and y-coordinate features
  
  extracted_data[feature_x] = split_coordinates_df[0]
  extracted_data[feature_y] = split_coordinates_df[1]

## location_x

In [1037]:
# Split location_x into separate x and y-coordinates

split_location_x = split_coordinates('location_x')

In [1038]:
split_location_x.head()

Unnamed: 0,0,1
0,115.0,25.0
1,109.0,51.0
2,99.0,52.0
3,107.0,40.0
4,108.0,32.0


In [1039]:
# Replace location_x with shot_location_x and shot_location_y

replace_location_feature('location_x',
                         'shot_location_y',
                         'shot_location_x',
                         split_location_x)

In [1040]:
extracted_data[['shot_location_x',
                'shot_location_y']].head()

Unnamed: 0,shot_location_x,shot_location_y
0,25.0,115.0
1,51.0,109.0
2,52.0,99.0
3,40.0,107.0
4,32.0,108.0


In [1041]:
extracted_data['shot_location_x'].describe()

count    6114.000000
mean       40.303059
std         9.926676
min         0.700000
25%        33.000000
50%        40.100000
75%        47.100000
max        79.400000
Name: shot_location_x, dtype: float64

In [1042]:
extracted_data['shot_location_y'].describe()

count    6114.000000
mean      103.889401
std         9.039966
min        58.000000
25%        97.425000
50%       105.400000
75%       111.000000
max       120.000000
Name: shot_location_y, dtype: float64

## shot_end_location

In [1043]:
# Split shot_end_location into separate x and y-coordinates

split_shot_end_location = split_coordinates('shot_end_location')

In [1044]:
split_shot_end_location.head()

Unnamed: 0,0,1,2
0,117.0,34.0,
1,112.0,44.0,
2,108.0,51.0,
3,112.0,37.0,
4,120.0,43.2,2.0


In [1045]:
print('shot_end_location z-coordinate NA:',
      (sum(split_shot_end_location[2].isna())),
      '\n',
      'Percent shot_end_location z-coordinate NA:',
      (round((((sum(split_shot_end_location[2].isna())) /
      (len(extracted_data))) *
      100), 2)),
      '%')

shot_end_location z-coordinate NA: 1809 
 Percent shot_end_location z-coordinate NA: 29.59 %


In [1046]:
# Drop shot_end_location z-coordinate due to too many missing values

In [1047]:
# Replace shot_end_location with shot_end_location_x and
# shot_end_location_y

replace_location_feature('shot_end_location',
                         'shot_end_location_y',
                         'shot_end_location_x',
                         split_shot_end_location)

In [1048]:
extracted_data[['shot_end_location_x',
                'shot_end_location_y']].head()

Unnamed: 0,shot_end_location_x,shot_end_location_y
0,34.0,117.0
1,44.0,112.0
2,51.0,108.0
3,37.0,112.0
4,43.2,120.0


In [1049]:
extracted_data['shot_end_location_x'].describe()

count    6114.000000
mean       40.146402
std         6.302520
min         0.100000
25%        36.400000
50%        40.000000
75%        43.800000
max        80.000000
Name: shot_end_location_x, dtype: float64

In [1050]:
extracted_data['shot_end_location_y'].describe()

count    6114.000000
mean      116.012987
std         6.241731
min        84.000000
25%       115.000000
50%       119.000000
75%       120.000000
max       120.000000
Name: shot_end_location_y, dtype: float64

## location_y

In [1051]:
# Fill location_y NA with 00

fill_feature_na('location_y',
                '00')

In [1052]:
# Split location_y into separate x and y-coordinates

split_pass_location = split_coordinates('location_y')

In [1053]:
split_pass_location.head()

Unnamed: 0,0,1
0,44,17
1,102,45
2,0,0
3,0,0
4,104,36


In [1054]:
# Convert x and y-coordinates to float data type

split_pass_location = split_pass_location.astype(float)

In [1055]:
# Replace location_y with pass_location_x and
# pass_location_y

replace_location_feature('location_y',
                         'pass_location_y',
                         'pass_location_x',
                         split_pass_location)

In [1056]:
extracted_data[['pass_location_x',
                'pass_location_y']].head()

Unnamed: 0,pass_location_x,pass_location_y
0,17.0,44.0
1,45.0,102.0
2,0.0,0.0
3,0.0,0.0
4,36.0,104.0


In [1057]:
extracted_data['pass_location_x'].describe()

count    6114.000000
mean       28.212365
std        27.262018
min         0.000000
25%         0.000000
50%        23.000000
75%        52.000000
max        80.000000
Name: pass_location_x, dtype: float64

In [1058]:
extracted_data['pass_location_y'].describe()

count    6114.000000
mean       63.746729
std        46.964437
min         0.000000
25%         0.000000
50%        83.000000
75%       104.475000
max       120.600000
Name: pass_location_y, dtype: float64

In [1059]:
extracted_data[['pass_location_x',
                'pass_location_y']].isnull().sum()

pass_location_x    0
pass_location_y    0
dtype: int64

## pass_end_location

In [1060]:
# Fill pass_end_location NA with 00

fill_feature_na('pass_end_location',
                '00')

In [1061]:
# Split pass_end_location into separate x and y-coordinates

split_pass_end_location = split_coordinates('pass_end_location')

In [1062]:
split_pass_end_location.head()

Unnamed: 0,0,1
0,95,23
1,113,55
2,0,0
3,0,0
4,109,29


In [1063]:
# Convert x and y-coordinates to float data type

split_pass_end_location = split_pass_end_location.astype(float)

In [1064]:
# Replace pass_end_location with pass_end_location_x and
# pass_end_location_y

replace_location_feature('pass_end_location',
                         'pass_end_location_y',
                         'pass_end_location_x',
                         split_pass_location)

In [1065]:
extracted_data[['pass_end_location_x',
                'pass_end_location_y']].head()

Unnamed: 0,pass_end_location_x,pass_end_location_y
0,17.0,44.0
1,45.0,102.0
2,0.0,0.0
3,0.0,0.0
4,36.0,104.0


In [1066]:
extracted_data['pass_end_location_x'].describe()

count    6114.000000
mean       28.212365
std        27.262018
min         0.000000
25%         0.000000
50%        23.000000
75%        52.000000
max        80.000000
Name: pass_end_location_x, dtype: float64

In [1067]:
extracted_data['pass_end_location_y'].describe()

count    6114.000000
mean       63.746729
std        46.964437
min         0.000000
25%         0.000000
50%        83.000000
75%       104.475000
max       120.600000
Name: pass_end_location_y, dtype: float64

In [1068]:
extracted_data[['pass_end_location_x',
                'pass_end_location_y']].isnull().sum()

pass_end_location_x    0
pass_end_location_y    0
dtype: int64

## Results

In [1069]:
extracted_data.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,shot_statsbomb_xg,shot_body_part,shot_technique,shot_type,shot_outcome,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,shot_aerial_won,shot_open_goal,shot_redirect,shot_follows_dribble,pass_length,pass_angle,pass_height,pass_type,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_backheel,counterpress,pass_cut_back,pass_outswinging,pass_straight,pass_inswinging,goal,shot_location_y,shot_location_x,shot_end_location_y,shot_end_location_x,pass_location_y,pass_location_x,pass_end_location_y,pass_end_location_x
0,1,2021-11-18 00:00:47.620,From Counter,0.092624,Right Foot,Normal,Open Play,Blocked,False,False,False,False,False,False,False,False,51.351727,0.117109,High Pass,Open Play,Through Ball,True,False,False,False,False,False,False,False,False,False,115.0,25.0,117.0,34.0,44.0,17.0,44.0,17.0
1,1,2021-11-18 00:05:12.780,From Throw In,0.041837,Left Foot,Normal,Open Play,Blocked,False,False,False,False,False,False,False,False,14.866069,0.737815,Low Pass,Open Play,Standard,False,False,False,False,False,False,False,False,False,False,109.0,51.0,112.0,44.0,102.0,45.0,102.0,45.0
2,1,2021-11-18 00:05:41.940,From Corner,0.017603,Right Foot,Half Volley,Open Play,Blocked,True,False,False,False,False,False,False,False,0.0,0.0,No Pass,No Pass,No Pass,False,False,False,False,False,False,False,False,False,False,99.0,52.0,108.0,51.0,0.0,0.0,0.0,0.0
3,1,2021-11-18 00:05:43.900,From Corner,0.144138,Right Foot,Normal,Open Play,Blocked,False,False,False,False,False,False,False,False,0.0,0.0,No Pass,No Pass,No Pass,False,False,False,False,False,False,False,False,False,False,107.0,40.0,112.0,37.0,0.0,0.0,0.0,0.0
4,1,2021-11-18 00:05:46.380,From Corner,0.068946,Left Foot,Normal,Open Play,Goal,True,True,False,False,False,False,False,False,8.602325,-0.950547,Ground Pass,Recovery,Standard,False,False,False,False,False,False,False,False,False,True,108.0,32.0,120.0,43.2,104.0,36.0,104.0,36.0


In [1070]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6114 entries, 0 to 6113
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   period_x              6114 non-null   int64         
 1   timestamp_x           6114 non-null   datetime64[ns]
 2   play_pattern_x        6114 non-null   object        
 3   shot_statsbomb_xg     6114 non-null   float64       
 4   shot_body_part        6114 non-null   object        
 5   shot_technique        6114 non-null   object        
 6   shot_type             6114 non-null   object        
 7   shot_outcome          6114 non-null   object        
 8   shot_first_time       6114 non-null   bool          
 9   under_pressure_x      6114 non-null   bool          
 10  shot_one_on_one       6114 non-null   bool          
 11  shot_deflected        6114 non-null   bool          
 12  shot_aerial_won       6114 non-null   bool          
 13  shot_open_goal    

In [1071]:
df_description(extracted_data)

Total Events: 6114 
 Total Features: 39


In [1072]:
save_df(extracted_data,
        'extracted_data')

extracted_data Filesize: 320754 bytes


# Outliers

## Functions

In [1073]:
def box_plot(feature):

  # Vizualize distribution of defined feature
  
  fig = px.box(extracted_data,
               y = feature)
  fig.show()

In [1074]:
def iqr_outliers(feature):
  
  # Define outliers by interquartile range

  q1 = extracted_data[feature].quantile(0.25)
  q3 = extracted_data[feature].quantile(0.75)

  iqr = q3 - q1

  lower_limit = q1 - (1.5 * iqr)
  upper_limit = q3 + (1.5 * iqr)

  return lower_limit, upper_limit

In [1075]:
def outlier_values(feature,
                   lower_q,
                   upper_q):
  
  outliers_locations = extracted_data[(extracted_data[feature] <
                                       lower_q) |
                                      (extracted_data[feature] >
                                       upper_q)]
  
  return outliers_locations

In [1076]:
def print_lateral_outliers(feature,
                           lower_q,
                           upper_q,
                           outlier_locations):
  
  print(feature,
        'Outliers:',
        '\n',
        'Wider than',
        (round((40 - lower_q),
               2)),
        'Yards Left of Center',
        '\n',
        '-and-'
        '\n',
        'Wider than',
        (round((upper_q - 40),
               2)),
        'Yards Right of Center',
        '\n\n',
        'Total:',
        (len(outlier_locations)),
        '\n',
        'Percent:',
        (round((((len(outlier_locations)) /
                 (len(extracted_data))) *
                100),
               2)),
        '%')

In [1077]:
def print_vertical_outliers(feature,
                            lower_q,
                            upper_q,
                            outlier_locations):
  
  print(feature,
        'Outliers:',
        '\n',
        'Further than',
        (round((120 - lower_q),
               2)),
        'from the Endline',
        '\n',
        '-and-'
        '\n',
        'Closer than',
        (round((120 - upper_q),
               2)),
        'from the Endline',
        '\n\n',
        'Total:',
        (len(outlier_locations)),
        '\n',
        'Percent:',
        (round((((len(outlier_locations)) /
                 (len(extracted_data))) *
                100),
               2)),
        '%')

In [1078]:
def outlier_result(feature,
                   outlier_locations):
  
  print('Percent',
        feature,
        'outliers result in goal:',
        (round(((((sum(outlier_locations['shot_outcome'] == 'Goal'))) /
                 (len(outlier_locations))) *
                100),
               2)),
        '%',
        '\n\n',
        'Percent',
        feature,
        'outliers result on-target:',
        (round(((((sum(outlier_locations['shot_outcome'] == 'Goal')) +
                  (sum(outlier_locations['shot_outcome'] == 'Saved'))) /
                 (len(outlier_locations))) *
                100),
               2)),
        '%')

## shot_location_x

In [1079]:
box_plot('shot_location_x')

In [1080]:
# Define shot_location_x outliers

shot_location_x_lower_q, shot_location_x_upper_q = iqr_outliers('shot_location_x')

shot_location_x_outliers = outlier_values('shot_location_x',
                                          shot_location_x_lower_q,
                                          shot_location_x_upper_q)

In [1081]:
shot_location_x_outliers['shot_location_x']

99      69.0
369     78.6
1368    69.0
1413    75.0
1561    11.0
1706     5.0
2011     7.0
2555    11.8
2905    75.6
3033    76.5
3401     0.7
3955    75.0
4491    11.5
4522    10.1
4813    79.4
5786    71.0
Name: shot_location_x, dtype: float64

In [1082]:
print_lateral_outliers('shot_location_x',
                       shot_location_x_lower_q,
                       shot_location_x_upper_q,
                       shot_location_x_outliers)

shot_location_x Outliers: 
 Wider than 28.15 Yards Left of Center 
 -and-
 Wider than 28.25 Yards Right of Center 

 Total: 16 
 Percent: 0.26 %


In [1083]:
shot_location_x_outliers['shot_outcome'].value_counts()

Goal     7
Saved    6
Off T    3
Name: shot_outcome, dtype: int64

In [1084]:
outlier_result('shot_location_x',
               shot_location_x_outliers)

Percent shot_location_x outliers result in goal: 43.75 % 

 Percent shot_location_x outliers result on-target: 81.25 %


In [1085]:
# Retain outliers

# High percentage of outliers resulted in goal (target outcome)

## shot_location_y

In [1086]:
box_plot('shot_location_y')

In [1087]:
# Define shot_location_y outliers

shot_location_y_lower_q, shot_location_y_upper_q = iqr_outliers('shot_location_y')

shot_location_y_outliers = outlier_values('shot_location_y',
                                          shot_location_y_lower_q,
                                          shot_location_y_upper_q)

In [1088]:
shot_location_y_outliers['shot_location_y']

36      68.0
360     71.0
369     65.8
1200    76.0
1413    68.0
1486    63.0
1502    58.0
1533    76.0
2139    76.0
2509    74.9
2821    64.0
3755    68.2
3993    72.8
4285    75.2
4439    74.7
4481    71.3
4596    74.9
4824    70.4
5490    77.0
5538    61.0
5583    76.0
6032    77.0
6058    76.0
Name: shot_location_y, dtype: float64

In [1089]:
print_vertical_outliers('shot_location_y',
                        shot_location_y_lower_q,
                        shot_location_y_upper_q,
                        shot_location_y_outliers)

shot_location_y Outliers: 
 Further than 42.94 from the Endline 
 -and-
 Closer than -11.36 from the Endline 

 Total: 23 
 Percent: 0.38 %


In [1090]:
shot_location_y_outliers['shot_outcome'].value_counts()

Saved               11
Off T                7
Wayward              2
Blocked              1
Goal                 1
Saved Off Target     1
Name: shot_outcome, dtype: int64

In [1091]:
outlier_result('shot_location_y',
               shot_location_y_outliers)

Percent shot_location_y outliers result in goal: 4.35 % 

 Percent shot_location_y outliers result on-target: 52.17 %


In [1092]:
# Retain outliers

# High percentage of outliers were on-target, indicating
# an opportunity to score a goal

## shot_end_location_x

In [1093]:
box_plot('shot_end_location_x')

In [1094]:
# Define shot_end_location_x outliers

shot_end_location_x_lower_q, shot_end_location_x_upper_q = iqr_outliers('shot_end_location_x')

shot_end_location_x_outliers = outlier_values('shot_end_location_x',
                                              shot_end_location_x_lower_q,
                                              shot_end_location_x_upper_q)

In [1095]:
shot_end_location_x_outliers['shot_end_location_x']

23      22.0
37      78.0
141     24.0
153     57.0
239     22.1
        ... 
5981    11.0
5985    25.0
5998    23.0
6030    56.0
6046    61.0
Name: shot_end_location_x, Length: 199, dtype: float64

In [1096]:
print_lateral_outliers('shot_end_location_x',
                       shot_end_location_x_lower_q,
                       shot_end_location_x_upper_q,
                       shot_end_location_x_outliers)

shot_end_location_x Outliers: 
 Wider than 14.7 Yards Left of Center 
 -and-
 Wider than 14.9 Yards Right of Center 

 Total: 199 
 Percent: 3.25 %


In [1097]:
shot_end_location_x_outliers['shot_outcome'].value_counts()

Blocked    98
Wayward    73
Off T      28
Name: shot_outcome, dtype: int64

In [1098]:
# Retain outliers

# Poor shot end location does not indicate a poor shot opportunity

## shot_end_location_y

In [1099]:
box_plot('shot_end_location_y')

In [1100]:
# Define shot_end_location_y outliers

shot_end_location_y_lower_q, shot_end_location_y_upper_q = iqr_outliers('shot_end_location_y')

shot_end_location_y_outliers = outlier_values('shot_end_location_y',
                                              shot_end_location_y_lower_q,
                                              shot_end_location_y_upper_q)

In [1101]:
shot_end_location_y_outliers['shot_end_location_y']

43      107.0
48       97.0
50      105.0
53      101.0
54      101.0
        ...  
6090    105.0
6092     96.0
6098     97.0
6107    107.0
6112    106.0
Name: shot_end_location_y, Length: 737, dtype: float64

In [1102]:
print_vertical_outliers('shot_end_location_y',
                        shot_end_location_y_lower_q,
                        shot_end_location_y_upper_q,
                        shot_end_location_y_outliers)

shot_end_location_y Outliers: 
 Further than 12.5 from the Endline 
 -and-
 Closer than -7.5 from the Endline 

 Total: 737 
 Percent: 12.05 %


In [1103]:
shot_end_location_y_outliers['shot_outcome'].value_counts()

Blocked    687
Wayward     35
Saved       13
Off T        2
Name: shot_outcome, dtype: int64

In [1104]:
# Retain outliers

# Poor shot end location does not indicate a poor shot opportunity

## pass_location_x

In [1105]:
box_plot('pass_location_x')

In [1106]:
# Define pass_location_x outliers

pass_location_x_lower_q, pass_location_x_upper_q = iqr_outliers('pass_location_x')

pass_location_x_outliers = outlier_values('pass_location_x',
                                          pass_location_x_lower_q,
                                          pass_location_x_upper_q)

In [1107]:
print('pass_location_x Outliers:',
      len(pass_location_x_outliers))

pass_location_x Outliers: 0


##pass_location_y

In [1108]:
box_plot('pass_location_y')

In [1109]:
# Define pass_location_y outliers

pass_location_y_lower_q, pass_location_y_upper_q = iqr_outliers('pass_location_y')

pass_location_y_outliers = outlier_values('pass_location_y',
                                          pass_location_y_lower_q,
                                          pass_location_y_upper_q)

In [1110]:
print('pass_location_y Outliers:',
      len(pass_location_y_outliers))

pass_location_y Outliers: 0


## pass_end_location_x

In [1111]:
box_plot('pass_end_location_x')

In [1112]:
# Define pass_end_location_x outliers

pass_end_location_x_lower_q, pass_end_location_x_upper_q = iqr_outliers('pass_end_location_x')

pass_end_location_x_outliers = outlier_values('pass_end_location_x',
                                              pass_end_location_x_lower_q,
                                              pass_end_location_x_upper_q)

In [1113]:
print('pass_end_location_x Outliers:',
      len(pass_end_location_x_outliers))

pass_end_location_x Outliers: 0


## pass_end_location_y

In [1114]:
box_plot('pass_end_location_y')

In [1115]:
# Define pass_end_location_y outliers

pass_end_location_y_lower_q, pass_end_location_y_upper_q = iqr_outliers('pass_end_location_y')

pass_end_location_y_outliers = outlier_values('pass_end_location_y',
                                              pass_end_location_y_lower_q,
                                              pass_end_location_y_upper_q)

In [1116]:
print('pass_end_location_y Outliers:',
      len(pass_end_location_y_outliers))

pass_end_location_y Outliers: 0


## Results

In [1118]:
# Note: pass_length and pass_angle will be
# addressed in a later step

In [1117]:
# Following identification and assessment, no outliers were dropped

# No changes were made to the current data

# Combine Redundant Features

## Functions

In [1119]:
def combine_value(feature_1,
                  value,
                  feature_2,
                  new_value):
  
  # Combine defined value of defined feature into another feature
  # as a new defined value
  
  extracted_data.loc[extracted_data[feature_1] == value,
                     feature_2] = new_value

## shot_technique

In [1120]:
extracted_data['shot_technique'].value_counts()

Normal           5163
Half Volley       510
Volley            337
Lob                55
Backheel           20
Diving Header      15
Overhead Kick      14
Name: shot_technique, dtype: int64

In [1121]:
extracted_data['shot_redirect'].value_counts()

False    6092
True       22
Name: shot_redirect, dtype: int64

In [1122]:
# Compare shot_redirect values v shot_technique values

compare_value_count('shot_redirect',
                    True,
                    'shot_technique')

Normal         13
Volley          6
Backheel        2
Half Volley     1
Name: shot_technique, dtype: int64

In [1123]:
# No changes

# shot_redirect is a unique descriptor from shot_technique values

## play_pattern_x

In [1124]:
extracted_data['play_pattern_x'].value_counts()

Regular Play      2195
From Throw In     1221
From Corner       1013
From Free Kick     921
From Counter       345
From Goal Kick     215
Other               70
From Keeper         69
From Kick Off       65
Name: play_pattern_x, dtype: int64

### v shot_type

In [1125]:
extracted_data['shot_type'].value_counts()

Open Play    5868
Free Kick     191
Penalty        53
Corner          2
Name: shot_type, dtype: int64

In [1126]:
# Compare shot_type 'Free Kick' values v play_pattern_x values

compare_value_count('shot_type',
                    'Free Kick',
                    'play_pattern_x')

From Free Kick    191
Name: play_pattern_x, dtype: int64

In [1127]:
# Combine shot_type 'Free Kick' values into play_pattern_x as 'Direct Free Kick'

# Assume differentiation between 191 shot_type 'Free Kick' values and 918
# play_pattern_x 'From Free Kick' values is if the free kick was a direct
# shot or a pass leading to a shot

combine_value('shot_type',
              'Free Kick',
              'play_pattern_x',
              'Direct Free Kick')

In [1128]:
# Compare shot_type 'Penalty' values v play_pattern_x values

compare_value_count('shot_type',
                    'Penalty',
                    'play_pattern_x')

Other    53
Name: play_pattern_x, dtype: int64

In [1129]:
# Combine shot_type 'Penalty' values into play_pattern_x

combine_value('shot_type',
              'Penalty',
              'play_pattern_x',
              'Penalty')

In [1130]:
# Compare shot_type 'Corner' values v play_pattern_x values

compare_value_count('shot_type',
                    'Corner',
                    'play_pattern_x')

From Corner    2
Name: play_pattern_x, dtype: int64

In [1131]:
# shot_type 'Corner' values match play_pattern_x 'From Corner' values

# No change required

In [1132]:
# Drop shot_type

extracted_data.drop('shot_type',
                    axis = 1,
                    inplace = True)

In [1133]:
extracted_data['play_pattern_x'].value_counts()

Regular Play        2195
From Throw In       1221
From Corner         1013
From Free Kick       730
From Counter         345
From Goal Kick       215
Direct Free Kick     191
From Keeper           69
From Kick Off         65
Penalty               53
Other                 17
Name: play_pattern_x, dtype: int64

### v pass_type

In [1134]:
extracted_data['pass_type'].value_counts()

Open Play       3197
No Pass         1951
Corner           402
Recovery         307
Free Kick        202
Throw-in          42
Interception      11
Kick Off           1
Goal Kick          1
Name: pass_type, dtype: int64

In [1135]:
# Compare pass_type 'Corner' values v play_pattern_x values

compare_value_count('pass_type',
                    'Corner',
                    'play_pattern_x')

From Corner    402
Name: play_pattern_x, dtype: int64

In [1136]:
# pass_type 'Corner' values match play_pattern_x 'From Corner' values

# No change required

In [1137]:
# Compare pass_type 'Recovery' values v play_pattern_x values

compare_value_count('pass_type',
                    'Recovery',
                    'play_pattern_x')

Regular Play      118
From Throw In      51
From Corner        45
From Counter       40
From Free Kick     33
From Goal Kick     10
From Keeper         7
From Kick Off       2
Other               1
Name: play_pattern_x, dtype: int64

In [1138]:
# Compare pass_type 'Interception' values v play_pattern_x values

compare_value_count('pass_type',
                    'Interception',
                    'play_pattern_x')

Regular Play      6
From Throw In     2
From Counter      2
From Goal Kick    1
Name: play_pattern_x, dtype: int64

In [1139]:
# Combine pass_type 'Recovery' and 'Interception' values into play_pattern_x as 'Turnover'

# Assume pass_type 'Recovery' values are events in which the ball was recovered
# from the opposition's play described by play_pattern_x values

# Combine pass_type 'Interception' values with 'Recovery' values as interceptions are
# a type of recovery

combine_value('pass_type',
              'Recovery',
              'play_pattern_x',
              'Recovery')

combine_value('pass_type',
              'Interception',
              'play_pattern_x',
              'Recovery')

In [1140]:
# Compare pass_type 'Throw-in' values v play_pattern_x values

compare_value_count('pass_type',
                    'Throw-in',
                    'play_pattern_x')

From Throw In    42
Name: play_pattern_x, dtype: int64

In [1141]:
# pass_type 'Throw-in' values match play_pattern_x 'Throw In' values

# No change required

In [1142]:
# Compare pass_type 'Kick Off' values v play_pattern_x values

compare_value_count('pass_type',
                    'Kick Off',
                    'play_pattern_x')

From Kick Off    1
Name: play_pattern_x, dtype: int64

In [1143]:
# pass_type 'Kick Off' values match play_pattern_x 'From Kick Off' values

# No change required

In [1144]:
# Compare pass_type 'Goal Kick' values v play_pattern_x values

compare_value_count('pass_type',
                    'Goal Kick',
                    'play_pattern_x')

From Goal Kick    1
Name: play_pattern_x, dtype: int64

In [1145]:
# pass_type 'Goal Kick' values match play_pattern_x 'From Goal Kick' values

# No change required

In [1146]:
# Drop pass_type

# Values are now redundant v play_pattern_x values

extracted_data.drop('pass_type',
                    axis = 1,
                    inplace = True)

In [1147]:
extracted_data['play_pattern_x'].value_counts()

Regular Play        2071
From Throw In       1168
From Corner          968
From Free Kick       697
Recovery             318
From Counter         303
From Goal Kick       204
Direct Free Kick     191
From Kick Off         63
From Keeper           62
Penalty               53
Other                 16
Name: play_pattern_x, dtype: int64

### v counterpress

In [1148]:
extracted_data['counterpress'].value_counts()

False    6109
True        5
Name: counterpress, dtype: int64

In [1149]:
# Compare counterpress values v play_pattern_x values

compare_value_count('counterpress',
                    True,
                    'play_pattern_x')

Recovery    5
Name: play_pattern_x, dtype: int64

In [1150]:
# Combine counterpress into play_pattern_x

combine_value('counterpress',
              True,
              'play_pattern_x',
              'From Counterpress')

In [1151]:
# Drop counterpress

extracted_data.drop('counterpress',
                    axis = 1,
                    inplace = True)

In [1152]:
extracted_data['play_pattern_x'].value_counts()

Regular Play         2071
From Throw In        1168
From Corner           968
From Free Kick        697
Recovery              313
From Counter          303
From Goal Kick        204
Direct Free Kick      191
From Kick Off          63
From Keeper            62
Penalty                53
Other                  16
From Counterpress       5
Name: play_pattern_x, dtype: int64

### v shot_follows_dribble

In [1153]:
extracted_data['shot_follows_dribble'].value_counts()

False    6111
True        3
Name: shot_follows_dribble, dtype: int64

In [1154]:
# Compare shot_follows_dribble values v play_pattern_x values

compare_value_count('shot_follows_dribble',
                    True,
                    'play_pattern_x')

Direct Free Kick    1
Recovery            1
Regular Play        1
Name: play_pattern_x, dtype: int64

In [1155]:
# Combine shot_follows_dribble into play_pattern_x

combine_value('shot_follows_dribble',
              True,
              'play_pattern_x',
              'From Dribble')

In [1156]:
# Drop shot_follows_dribble

extracted_data.drop('shot_follows_dribble',
                    axis = 1,
                    inplace = True)

In [1157]:
extracted_data['play_pattern_x'].value_counts()

Regular Play         2070
From Throw In        1168
From Corner           968
From Free Kick        697
Recovery              312
From Counter          303
From Goal Kick        204
Direct Free Kick      190
From Kick Off          63
From Keeper            62
Penalty                53
Other                  16
From Counterpress       5
From Dribble            3
Name: play_pattern_x, dtype: int64

## pass_length

In [1158]:
# pass_length will be redundant following feature engineering phase

# Using pass_location_x and pass_location_y as a start-point with
# pass_end_location_x and pass_end_location_y as an end-point,
# a vector can be created to calculate the pass distance

In [1159]:
# Drop pass_length

extracted_data.drop('pass_length',
                    axis = 1,
                    inplace = True)

## pass_angle

In [1160]:
# pass_angle will be redundant following feature engineering phase

# Using pass_location_x and pass_location_y as a start-point with
# pass_end_location_x and pass_end_location_y as an end-point,
# a vector can be created to calculate the pass angle

In [1161]:
# Drop pass_angle

extracted_data.drop('pass_angle',
                    axis = 1,
                    inplace = True)

## pass_technique

In [1162]:
extracted_data['pass_technique'].value_counts()

Standard        3807
No Pass         1951
Through Ball     199
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64

### v pass_through_ball

In [1163]:
extracted_data['pass_through_ball'].value_counts()

False    5915
True      199
Name: pass_through_ball, dtype: int64

In [1164]:
# Compare pass_through_ball values v pass_technique

compare_value_count('pass_through_ball',
                    True,
                    'pass_technique')

Through Ball    199
Name: pass_technique, dtype: int64

In [1165]:
# Drop pass_through_ball

# pass_through_ball True values match pass_technique 'Through Ball' values

extracted_data.drop('pass_through_ball',
                    axis = 1,
                    inplace = True)

### v pass_inswinging

In [1166]:
extracted_data['pass_inswinging'].value_counts()

False    6038
True       76
Name: pass_inswinging, dtype: int64

In [1167]:
# Compare pass_inswinging values v pass_technique

compare_value_count('pass_inswinging',
                    True,
                    'pass_technique')

Inswinging    76
Name: pass_technique, dtype: int64

In [1168]:
# Drop pass_inswinging

# pass_inswinging True values match pass_technique 'Inswinging' values

extracted_data.drop('pass_inswinging',
                    axis = 1,
                    inplace = True)

### v pass_outswinging

In [1169]:
extracted_data['pass_outswinging'].value_counts()

False    6059
True       55
Name: pass_outswinging, dtype: int64

In [1170]:
# Compare pass_outswinging values v pass_technique

compare_value_count('pass_outswinging',
                    True,
                    'pass_technique')

Outswinging    55
Name: pass_technique, dtype: int64

In [1171]:
# Drop pass_outswinging

# pass_outswinging True values match pass_technique 'Outswinging' values

extracted_data.drop('pass_outswinging',
                    axis = 1,
                    inplace = True)

### v pass_straight

In [1172]:
extracted_data['pass_straight'].value_counts()

False    6088
True       26
Name: pass_straight, dtype: int64

In [1173]:
# Compare pass_straight values v pass_technique

compare_value_count('pass_straight',
                    True,
                    'pass_technique')

Straight    26
Name: pass_technique, dtype: int64

In [1174]:
# Drop pass_straight

# pass_straight True values match pass_technique 'Straight' values

extracted_data.drop('pass_straight',
                    axis = 1,
                    inplace = True)

### v pass_cross

In [1175]:
extracted_data['pass_cross'].value_counts()

False    5358
True      756
Name: pass_cross, dtype: int64

In [1176]:
# Compare pass_cross values v pass_technique

compare_value_count('pass_cross',
                    True,
                    'pass_technique')

Standard        753
Through Ball      3
Name: pass_technique, dtype: int64

In [1177]:
# Combine pass_cross into pass_technique

combine_value('pass_cross',
              True,
              'pass_technique',
              'Cross')

In [1178]:
# Drop pass_cross

extracted_data.drop('pass_cross',
                    axis = 1,
                    inplace = True)

In [1179]:
extracted_data['pass_technique'].value_counts()

Standard        3054
No Pass         1951
Cross            756
Through Ball     196
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64

### v pass_backheel

In [1180]:
extracted_data['pass_backheel'].value_counts()

False    6099
True       15
Name: pass_backheel, dtype: int64

In [1181]:
# Compare pass_backheel values v pass_technique

compare_value_count('pass_backheel',
                    True,
                    'pass_technique')

Standard    15
Name: pass_technique, dtype: int64

In [1182]:
# Combine pass_backheel into pass_technique

combine_value('pass_backheel',
              True,
              'pass_technique',
              'Backheel')

In [1183]:
# Drop pass_backheel

extracted_data.drop('pass_backheel',
                    axis = 1,
                    inplace = True)

In [1184]:
extracted_data['pass_technique'].value_counts()

Standard        3039
No Pass         1951
Cross            756
Through Ball     196
Inswinging        76
Outswinging       55
Straight          26
Backheel          15
Name: pass_technique, dtype: int64

### v pass_cut_back

In [1185]:
extracted_data['pass_cut_back'].value_counts()

False    6006
True      108
Name: pass_cut_back, dtype: int64

In [1186]:
# Compare pass_cut_back values v pass_technique

compare_value_count('pass_cut_back',
                    True,
                    'pass_technique')

Cross       63
Standard    43
Backheel     2
Name: pass_technique, dtype: int64

In [1187]:
def combine_value_defined(feature_1,
                          value_1,
                          feature_2,
                          value_2,
                          feature_3,
                          new_value):
  
  # Combine defined value of defined feature into another feature,
  # only when it is another defined value, as a new defined value
  
  extracted_data.loc[(extracted_data[feature_1] == value_1) &
                     (extracted_data[feature_2] == value_2),
                     feature_3] = new_value

In [1188]:
# Combine pass_cut_back into pass_technique for values of 'Standard'

combine_value_defined('pass_cut_back',
                      True,
                      'pass_technique',
                      'Standard',
                      'pass_technique',
                      'Cut Back')

In [1189]:
# Drop pass_cut_back

extracted_data.drop('pass_cut_back',
                    axis = 1,
                    inplace = True)

In [1190]:
extracted_data['pass_technique'].value_counts()

Standard        2996
No Pass         1951
Cross            756
Through Ball     196
Inswinging        76
Outswinging       55
Cut Back          43
Straight          26
Backheel          15
Name: pass_technique, dtype: int64

### v pass_switch

In [1191]:
extracted_data['pass_switch'].value_counts()

False    5790
True      324
Name: pass_switch, dtype: int64

In [1192]:
# Compare pass_switch values v pass_technique

compare_value_count('pass_switch',
                    True,
                    'pass_technique')

Standard        183
Cross            46
Inswinging       41
Outswinging      39
Straight         13
Through Ball      2
Name: pass_technique, dtype: int64

In [1193]:
# Combine pass_switch into pass_technique for values of 'Standard'

combine_value_defined('pass_switch',
                      True,
                      'pass_technique',
                      'Standard',
                      'pass_technique',
                      'Switch')

In [1194]:
# Drop pass_switch

extracted_data.drop('pass_switch',
                    axis = 1,
                    inplace = True)

In [1195]:
extracted_data['pass_technique'].value_counts()

Standard        2813
No Pass         1951
Cross            756
Through Ball     196
Switch           183
Inswinging        76
Outswinging       55
Cut Back          43
Straight          26
Backheel          15
Name: pass_technique, dtype: int64

### Consolidate 'Cross' value

In [1196]:
# Combine pass_technique values 'Switch', 'Inswinging', 'Outswinging', and
# 'Straight' into value 'Cross'

# Values are types of crosses
# Value 'Cross' is nonspecific, but the most frequent
# Combining into single, less specific, value could increase importance in modeling

extracted_data.loc[extracted_data['pass_technique'].isin(['Switch',
                                                          'Inswinging',
                                                          'Outswinging',
                                                          'Straight']),
                   'pass_technique'] = 'Cross'

In [1197]:
extracted_data['pass_technique'].value_counts()

Standard        2813
No Pass         1951
Cross           1096
Through Ball     196
Cut Back          43
Backheel          15
Name: pass_technique, dtype: int64

## Results

In [1198]:
extracted_data.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,shot_statsbomb_xg,shot_body_part,shot_technique,shot_outcome,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,shot_aerial_won,shot_open_goal,shot_redirect,pass_height,pass_technique,goal,shot_location_y,shot_location_x,shot_end_location_y,shot_end_location_x,pass_location_y,pass_location_x,pass_end_location_y,pass_end_location_x
0,1,2021-11-18 00:00:47.620,From Counter,0.092624,Right Foot,Normal,Blocked,False,False,False,False,False,False,False,High Pass,Through Ball,False,115.0,25.0,117.0,34.0,44.0,17.0,44.0,17.0
1,1,2021-11-18 00:05:12.780,From Throw In,0.041837,Left Foot,Normal,Blocked,False,False,False,False,False,False,False,Low Pass,Standard,False,109.0,51.0,112.0,44.0,102.0,45.0,102.0,45.0
2,1,2021-11-18 00:05:41.940,From Corner,0.017603,Right Foot,Half Volley,Blocked,True,False,False,False,False,False,False,No Pass,No Pass,False,99.0,52.0,108.0,51.0,0.0,0.0,0.0,0.0
3,1,2021-11-18 00:05:43.900,From Corner,0.144138,Right Foot,Normal,Blocked,False,False,False,False,False,False,False,No Pass,No Pass,False,107.0,40.0,112.0,37.0,0.0,0.0,0.0,0.0
4,1,2021-11-18 00:05:46.380,Recovery,0.068946,Left Foot,Normal,Goal,True,True,False,False,False,False,False,Ground Pass,Standard,True,108.0,32.0,120.0,43.2,104.0,36.0,104.0,36.0


In [1199]:
df_description(extracted_data)

Total Events: 6114 
 Total Features: 25


In [1200]:
# Updated list of features

list(extracted_data.columns.values)

['period_x',
 'timestamp_x',
 'play_pattern_x',
 'shot_statsbomb_xg',
 'shot_body_part',
 'shot_technique',
 'shot_outcome',
 'shot_first_time',
 'under_pressure_x',
 'shot_one_on_one',
 'shot_deflected',
 'shot_aerial_won',
 'shot_open_goal',
 'shot_redirect',
 'pass_height',
 'pass_technique',
 'goal',
 'shot_location_y',
 'shot_location_x',
 'shot_end_location_y',
 'shot_end_location_x',
 'pass_location_y',
 'pass_location_x',
 'pass_end_location_y',
 'pass_end_location_x']

# Update Feature Names

In [1201]:
extracted_data.rename(columns = {'period_x' : 'period',
                                 'timestamp_x' : 'timestamp',
                                 'play_pattern_x' :'play_pattern',
                                 'shot_statsbomb_xg' : 'statsbomb_xg',
                                 'under_pressure_x' : 'shot_under_pressure'},
                      inplace = True)

# Update Feature Order

In [1202]:
# Note: drop shot_outcome as no longer useful and redundant v goal

extracted_data = extracted_data[['goal',
                                 'period',
                                 'timestamp',
                                 'play_pattern',
                                 'shot_technique',
                                 'shot_body_part',
                                 'shot_first_time',
                                 'shot_under_pressure',
                                 'shot_one_on_one',
                                 'shot_aerial_won',
                                 'shot_deflected',
                                 'shot_redirect',
                                 'shot_open_goal',
                                 'shot_location_y',
                                 'shot_location_x',
                                 'shot_end_location_y',
                                 'shot_end_location_x',
                                 'pass_height',
                                 'pass_technique',
                                 'pass_location_y',
                                 'pass_location_x',
                                 'pass_end_location_y',
                                 'pass_end_location_x',
                                 'statsbomb_xg']]

# Cleaned Data

In [1203]:
cleaned_data = extracted_data

In [1204]:
cleaned_data.head()

Unnamed: 0,goal,period,timestamp,play_pattern,shot_technique,shot_body_part,shot_first_time,shot_under_pressure,shot_one_on_one,shot_aerial_won,shot_deflected,shot_redirect,shot_open_goal,shot_location_y,shot_location_x,shot_end_location_y,shot_end_location_x,pass_height,pass_technique,pass_location_y,pass_location_x,pass_end_location_y,pass_end_location_x,statsbomb_xg
0,False,1,2021-11-18 00:00:47.620,From Counter,Normal,Right Foot,False,False,False,False,False,False,False,115.0,25.0,117.0,34.0,High Pass,Through Ball,44.0,17.0,44.0,17.0,0.092624
1,False,1,2021-11-18 00:05:12.780,From Throw In,Normal,Left Foot,False,False,False,False,False,False,False,109.0,51.0,112.0,44.0,Low Pass,Standard,102.0,45.0,102.0,45.0,0.041837
2,False,1,2021-11-18 00:05:41.940,From Corner,Half Volley,Right Foot,True,False,False,False,False,False,False,99.0,52.0,108.0,51.0,No Pass,No Pass,0.0,0.0,0.0,0.0,0.017603
3,False,1,2021-11-18 00:05:43.900,From Corner,Normal,Right Foot,False,False,False,False,False,False,False,107.0,40.0,112.0,37.0,No Pass,No Pass,0.0,0.0,0.0,0.0,0.144138
4,True,1,2021-11-18 00:05:46.380,Recovery,Normal,Left Foot,True,True,False,False,False,False,False,108.0,32.0,120.0,43.2,Ground Pass,Standard,104.0,36.0,104.0,36.0,0.068946


In [1205]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6114 entries, 0 to 6113
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   goal                 6114 non-null   bool          
 1   period               6114 non-null   int64         
 2   timestamp            6114 non-null   datetime64[ns]
 3   play_pattern         6114 non-null   object        
 4   shot_technique       6114 non-null   object        
 5   shot_body_part       6114 non-null   object        
 6   shot_first_time      6114 non-null   bool          
 7   shot_under_pressure  6114 non-null   bool          
 8   shot_one_on_one      6114 non-null   bool          
 9   shot_aerial_won      6114 non-null   bool          
 10  shot_deflected       6114 non-null   bool          
 11  shot_redirect        6114 non-null   bool          
 12  shot_open_goal       6114 non-null   bool          
 13  shot_location_y      6114 non-nul

In [1206]:
df_description(cleaned_data)

Total Events: 6114 
 Total Features: 24


In [1207]:
save_df(cleaned_data,
        'cleaned_data')

cleaned_data Filesize: 247755 bytes


Continued in [expected_goals_feature engineering_notebook]()

*3 of 7*