# **Expected Goals Classifier**

## Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [Expected Goals Classifier]()

# Data Extraction Notebook

*Notebook 1 of 7*

## Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data cleaned in [expected_goals_data_cleaning_notebook]()
3. Features engineered in [expected_goals_feature_engineering_notebook]()
4. Data explored in [expected_goals_data_exploration_notebook]()
5. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
6. Predictions modeled in [expected_goals_model_fitting_notebook]()
7. Conclusions in [expected_goals_model_assessment_notebook]()

# Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb has provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

StatsBomb Open Data is organized in JSON files:
* **[Matches](https://github.com/statsbomb/open-data/tree/master/data/matches)**
  * Folders organized by competition (league or tournament)
    * Files organized by season (year) ID
    * Files contain nested dictionaries with descriptive data for each individual match
* **[Events](https://github.com/statsbomb/open-data/tree/master/data/events)**
  * Files organized by match ID
  * Files contain nested dictionaries with descriptive data for each event within each individual match

# Packages

In [1]:
# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# warnings to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Statsbombpy package for extracting StatsBomb data
!pip install statsbombpy
from statsbombpy import sb

# Pandas for dataframes
import pandas as pd

Mounted at /content/drive
Collecting statsbombpy
  Downloading statsbombpy-1.1.1-py3-none-any.whl (11 kB)
Collecting requests-cache
  Downloading requests_cache-0.8.1-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 2.0 MB/s 
[?25hCollecting nose2
  Downloading nose2-0.10.0-py2.py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 27.3 MB/s 
Collecting coverage>=4.4.1
  Downloading coverage-6.0.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (253 kB)
[K     |████████████████████████████████| 253 kB 69.3 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 43.1 MB/s 
Collecting cattrs<2.0,>=1.8
  Downloading cattrs-1.8.0-py3-none-any.whl (24 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url

# Matches Data

In [2]:
# Create dataframe from StatsBomb competitions

competitions_df = sb.competitions()

credentials were not supplied. open data access only


In [4]:
print('Available Competitions:',
      competitions_df['competition_name'].unique())

Available Competitions: ['Champions League' "FA Women's Super League" 'FIFA World Cup' 'La Liga'
 'NWSL' 'Premier League' 'UEFA Euro' "Women's World Cup"]


In [5]:
# Isolate target competions from StatsBomb Open Data
# Women's competitions

target_comp_df = competitions_df.loc[competitions_df['competition_gender'] == 'female']

target_comp_ids = target_comp_df['competition_id'].unique()

target_season_ids = target_comp_df['season_id'].unique()

In [10]:
print("Women's Competitions:",
      target_comp_df['competition_name'].unique(),
      '\n',
      "Women's competition_ids:",
      target_comp_ids,
      '\n',
      "Women's Competition season_ids:",
      target_season_ids)

Women's Competitions: ["FA Women's Super League" 'NWSL'] 
 Women's competition_ids: [37 49] 
 Women's Competition season_ids: [42  4  3]


In [7]:
# Refine target competitions
# Women's club competitions

target_comp_df = competitions_df.loc[competitions_df['competition_id'].isin([37, 49])]

target_comp_ids = target_comp_df['competition_id'].unique()

target_season_ids = target_comp_df['season_id'].unique()

In [13]:
print("Women's Club Competitions:",
      target_comp_df['competition_name'].unique(),
      '\n',
      "Women's Club competition_ids:",
      target_comp_ids,
      '\n',
      "Women's Club Competition season_ids:",
      target_season_ids)

Women's Club Competitions: ["FA Women's Super League" 'NWSL'] 
 Women's Club competition_ids: [37 49] 
 Women's Club Competition season_ids: [42  4  3]


In [15]:
print("Number of Women's Club Seasons:",
      len(target_season_ids))

Number of Women's Club Seasons: 3


In [17]:
# Create dataframes for the matches in each season of the target competitions

matches_df_37_42 = sb.matches(competition_id = 37,
                              season_id = 42)

matches_df_37_4 = sb.matches(competition_id = 37,
                             season_id = 4)

matches_df_49_3 = sb.matches(competition_id = 49,
                             season_id = 3)

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only


In [18]:
# Concatenate dataframes for the matches in each season of the target
# leagues into combined dataframe

matches_df = pd.concat([matches_df_37_42,
                        matches_df_37_4,
                        matches_df_49_3],
                       ignore_index = True)

In [19]:
matches_df.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,match_week,competition_stage,stadium,referee,data_version,shot_fidelity_version,xy_fidelity_version
0,2275136,2019-09-07,16:00:00.000,England - FA Women's Super League,2019/2020,Manchester City WFC,Manchester United,1,0,available,scheduled,2021-05-29T16:47:06.782,2021-06-13T16:17:31.694,1,Regular Season,Etihad Stadium (Manchester),Rebecca Welch,1.1.0,2,2
1,2275154,2019-11-17,15:00:00.000,England - FA Women's Super League,2019/2020,Chelsea FCW,Manchester United,1,0,available,scheduled,2021-05-29T17:02:28.194,2021-06-13T16:17:31.694,6,Regular Season,The Cherry Red Records Stadium,Jack Packman,1.1.0,2,2
2,2275150,2019-12-01,16:00:00.000,England - FA Women's Super League,2019/2020,West Ham United LFC,Manchester United,3,2,available,scheduled,2021-06-01T12:37:46.754,2021-06-13T16:17:31.694,8,Regular Season,"The Rush Green Stadium (Romford, Greater London)",Amy Fearn,1.1.0,2,2
3,2275146,2019-12-08,13:00:00.000,England - FA Women's Super League,2019/2020,Manchester United,Everton LFC,3,1,available,scheduled,2021-06-01T12:42:40.738,2021-06-13T16:17:31.694,9,Regular Season,Leigh Sports Village Stadium,Joe Hull,1.1.0,2,2
4,2275142,2020-01-05,13:00:00.000,England - FA Women's Super League,2019/2020,Manchester United,Bristol City WFC,0,1,available,scheduled,2021-06-01T12:47:08.488,2021-06-13T16:17:31.694,11,Regular Season,Leigh Sports Village Stadium,Lucy Oliver,1.1.0,2,2


In [20]:
print("Total Women's Club Matches:",
      len(matches_df))

Total Women's Club Matches: 231


In [21]:
# Save matches_df

matches_df.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/matches_df.parquet')

In [22]:
print('matches_df Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/matches_df.parquet').stat().st_size,
      'bytes')

matches_df Filesize: 21378 bytes


# Shot Events

In [23]:
# Create dataframes for the target events in each season of the target competitions
# Shots

shots_df_37_42 = sb.competition_events(country = 'England',
                                       division = "FA Women's Super League",
                                       season = '2018/2019',
                                       gender = 'female',
                                       split = True)['shots']

shots_df_37_4 = sb.competition_events(country = 'England',
                                      division = "FA Women's Super League",
                                      season = '2019/2020',
                                      gender = 'female',
                                      split = True)['shots']

shots_df_49_3 = sb.competition_events(country = 'United States of America',
                                      division = 'NWSL',
                                      season = '2018',
                                      gender = 'female',
                                      split = True)['shots']

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data acces

In [24]:
# Concatenate shot events dataframes into a combined dataframe

shots_df = pd.concat([shots_df_37_42,
                      shots_df_37_4,
                      shots_df_49_3],
                     ignore_index = True)

In [25]:
shots_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,duration,related_events,match_id,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part,shot_technique,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure,shot_one_on_one,shot_deflected,off_camera,shot_aerial_won,shot_open_goal,out,shot_redirect,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble
0,d9c27699-dd12-4e55-96d6-4c95685e4c66,42,1,00:00:47.620,0,47,Shot,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[115.0, 25.0]",0.56,"[524c23dc-265a-4cfc-b367-bd37356c0185, f59f670...",7298,0.092624,"[117.0, 34.0]",abb17b2a-1775-4226-9158-67efe68ee0c3,Right Foot,Normal,Open Play,Blocked,"[{'location': [113.0, 32.0], 'player': {'id': ...",,,,,,,,,,,,
1,4eb844e2-9466-424a-abe3-1ba730afe716,237,1,00:05:12.780,5,12,Shot,15,Chelsea FCW,From Throw In,Chelsea FCW,Francesca Kirby,Right Center Forward,"[109.0, 51.0]",0.4,"[1880a599-93d5-450a-ac3d-0dc6b7b0609a, a61f6f5...",7298,0.041837,"[112.0, 44.0]",fcb92aad-eba5-4053-8606-e9d70856ddd9,Left Foot,Normal,Open Play,Blocked,"[{'location': [115.0, 44.0], 'player': {'id': ...",,,,,,,,,,,,
2,3e41e219-ac37-4e1c-97c3-eca7cd886484,243,1,00:05:41.940,5,41,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,So-Yun Ji,Center Midfield,"[99.0, 52.0]",0.48,"[2fb3a5ea-a6cd-4231-ac7f-ea88e85c109a, 94bbaf4...",7298,0.017603,"[108.0, 51.0]",,Right Foot,Half Volley,Open Play,Blocked,"[{'location': [102.0, 46.0], 'player': {'id': ...",True,,,,,,,,,,,
3,f1c6e6cb-ab00-4223-8980-09747ce40924,248,1,00:05:43.900,5,43,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Drew Spence,Left Center Midfield,"[107.0, 40.0]",0.16,"[9b005c5d-8621-4764-838b-dd7695925bee, d3d11c9...",7298,0.144138,"[112.0, 37.0]",,Right Foot,Normal,Open Play,Blocked,"[{'location': [99.0, 52.0], 'player': {'id': 4...",,,,,,,,,,,,
4,33060f55-fd05-4a3b-bf81-4315b9cb6417,256,1,00:05:46.380,5,46,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Millie Bright,Right Center Back,"[108.0, 32.0]",1.48,"[55631e56-7585-498c-923f-b0f9f47f790a, 56f9c1f...",7298,0.068946,"[120.0, 43.2, 2.0]",bd21493e-5b27-499b-a0ba-367c3b18a70e,Left Foot,Normal,Open Play,Goal,"[{'location': [97.0, 52.0], 'player': {'id': 4...",True,True,,,,,,,,,,


In [29]:
print("Total Women's Club Competition Shot Events:",
      len(shots_df))

Total Women's Club Competition Shot Events: 6114


In [27]:
print('Total Shot Features:',
      shots_df.shape[1])

Total Shot Features: 37


In [None]:
# Save shots_df

shots_df.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/shots_df.parquet')

In [None]:
print('shots_df Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/shots_df.parquet').stat().st_size,
      'bytes')

shots_df Filesize: 1588328 bytes


# Shot Key Pass Events

In [30]:
# Create dataframes for the target events in each season of the target competitions
# Passes

passes_df_37_42 = sb.competition_events(country = 'England',
                                        division = "FA Women's Super League",
                                        season = '2018/2019',
                                        gender = 'female',
                                        split = True)['passes']

passes_df_37_4 = sb.competition_events(country = 'England',
                                       division = "FA Women's Super League",
                                       season = '2019/2020',
                                       gender = 'female',
                                       split = True)['passes']

passes_df_49_3 = sb.competition_events(country = 'United States of America',
                                       division = 'NWSL',
                                       season = '2018',
                                       gender = 'female',
                                       split = True)['passes']

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data acces

In [31]:
# Concatenate pass event dataframes into a combined dataframe

passes_df = pd.concat([passes_df_37_42,
                       passes_df_37_4,
                       passes_df_49_3],
                      ignore_index = True)

In [32]:
passes_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,duration,related_events,match_id,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,under_pressure,pass_outcome,pass_assisted_shot_id,pass_shot_assist,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_goal_assist,pass_aerial_won,pass_backheel,pass_deflected,counterpress,off_camera,pass_cut_back,pass_miscommunication,pass_outswinging,pass_straight,pass_inswinging,pass_no_touch,out
0,60d35b7c-3b85-42da-9af8-a74a21d8f7ca,5,1,00:00:00.100,0,0,Pass,2,Chelsea FCW,From Kick Off,Chelsea FCW,So-Yun Ji,Center Midfield,"[61.0, 40.0]",0.0,[23fcb90e-16ec-46af-b513-97750d74d58a],7298,Ramona Bachmann,3.605551,-0.982794,Ground Pass,"[63.0, 37.0]",Right Foot,Kick Off,,,,,,,,,,,,,,,,,,,,,
1,bbdcd0fe-1943-4b01-8b03-4eb9a22c7991,9,1,00:00:00.500,0,0,Pass,2,Chelsea FCW,From Kick Off,Chelsea FCW,Ramona Bachmann,Left Center Forward,"[69.0, 33.0]",1.64,"[2ca32e1b-1e10-4a21-9266-5871a12bac57, bdfe49c...",7298,Crystal Alyssia Dunn Soubrier,31.764761,-1.078987,Low Pass,"[84.0, 5.0]",Right Foot,,True,,,,,,,,,,,,,,,,,,,,
2,4bf93e42-bd64-46db-aa6e-891d7433d714,15,1,00:00:25.873,0,25,Pass,3,Manchester City WFC,From Goal Kick,Manchester City WFC,Ellie Roebuck,Goalkeeper,"[6.0, 43.0]",2.587,"[8735386e-0c17-47d3-b076-701bcdedd390, 9dac831...",7298,Nikita Parris,59.03389,0.456072,High Pass,"[59.0, 69.0]",Right Foot,Goal Kick,,Incomplete,,,,,,,,,,,,,,,,,,,
3,9dac8310-54c0-45f1-a4f3-d47df39a2edd,17,1,00:00:28.460,0,28,Pass,3,Manchester City WFC,From Goal Kick,Chelsea FCW,Magdalena Lilly Eriksson,Left Center Back,"[62.0, 12.0]",1.173,"[21351e9d-26a8-4df6-bcd3-69e0e06056c2, 4bf93e4...",7298,Crystal Alyssia Dunn Soubrier,18.110771,-0.110657,High Pass,"[80.0, 10.0]",Right Foot,Recovery,,Incomplete,,,,,,,,,,,,,,,,,,,
4,21351e9d-26a8-4df6-bcd3-69e0e06056c2,20,1,00:00:29.633,0,29,Pass,3,Manchester City WFC,From Goal Kick,Manchester City WFC,Esme Beth Morgan,Right Back,"[41.0, 71.0]",0.147,"[87576ac0-ad30-49fb-a8a4-c65020385165, 9dac831...",7298,,2.828427,0.785398,High Pass,"[43.0, 73.0]",Right Foot,Recovery,True,Incomplete,,,,,,,,,,,,,,,,,,,


In [33]:
print("Total Women's Club Competition Pass Events:",
      len(passes_df))

Total Women's Club Competition Pass Events: 209122


In [34]:
print('Total Pass Features:',
      passes_df.shape[1])

Total Pass Features: 45


In [35]:
# Save passes_df

passes_df.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/passes_df.parquet')

In [36]:
print('passes_df Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/passes_df.parquet').stat().st_size,
      'bytes')

passes_df Filesize: 26135620 bytes


## Search Shot Events 'shot_key_pass_id' for Passes

In [37]:
# Create key_pass_events list from shot_key_pass_id values for shot events

key_pass_events = list(shots_df['shot_key_pass_id'])

In [40]:
# Search pass events for key_pass_events

passes_to_shots_df = passes_df[passes_df['id'].isin(key_pass_events)]

print("Pass Events Identified as 'shot_key_pass_id' for Shot Events:",
      len(passes_to_shots_df))

Pass Events Identified as 'shot_key_pass_id' for Shot Events: 4164


## Concatenate Pass Event Data with Shot Events

In [41]:
# Concatenate pass data from passes_df for passes identified as shot event key passes with shots_df

passes_df2 = passes_df.rename(columns = {'id': 'shot_key_pass_id'})

extracted_data = pd.merge(shots_df, passes_df2, on = ['shot_key_pass_id'],
                          how = 'left')

In [44]:
extracted_data.head()

Unnamed: 0,id,index_x,period_x,timestamp_x,minute_x,second_x,type_x,possession_x,possession_team_x,play_pattern_x,team_x,player_x,position_x,location_x,duration_x,related_events_x,match_id_x,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part,shot_technique,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,off_camera_x,shot_aerial_won,shot_open_goal,out_x,shot_redirect,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble,index_y,period_y,timestamp_y,...,second_y,type_y,possession_y,possession_team_y,play_pattern_y,team_y,player_y,position_y,location_y,duration_y,related_events_y,match_id_y,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,under_pressure_y,pass_outcome,pass_assisted_shot_id,pass_shot_assist,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_goal_assist,pass_aerial_won,pass_backheel,pass_deflected,counterpress,off_camera_y,pass_cut_back,pass_miscommunication,pass_outswinging,pass_straight,pass_inswinging,pass_no_touch,out_y
0,d9c27699-dd12-4e55-96d6-4c95685e4c66,42,1,00:00:47.620,0,47,Shot,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[115.0, 25.0]",0.56,"[524c23dc-265a-4cfc-b367-bd37356c0185, f59f670...",7298,0.092624,"[117.0, 34.0]",abb17b2a-1775-4226-9158-67efe68ee0c3,Right Foot,Normal,Open Play,Blocked,"[{'location': [113.0, 32.0], 'player': {'id': ...",,,,,,,,,,,,,33.0,1.0,00:00:38.660,...,38.0,Pass,4.0,Chelsea FCW,From Counter,Chelsea FCW,Anita Amma Ankyewah Asante,Center Back,"[44.0, 17.0]",3.453,[7fbb0f53-f758-4667-9993-062fde493f1c],7298.0,Francesca Kirby,51.351727,0.117109,High Pass,"[95.0, 23.0]",Right Foot,,,,d9c27699-dd12-4e55-96d6-4c95685e4c66,True,Through Ball,True,,,,,,,,,,,,,,,
1,4eb844e2-9466-424a-abe3-1ba730afe716,237,1,00:05:12.780,5,12,Shot,15,Chelsea FCW,From Throw In,Chelsea FCW,Francesca Kirby,Right Center Forward,"[109.0, 51.0]",0.4,"[1880a599-93d5-450a-ac3d-0dc6b7b0609a, a61f6f5...",7298,0.041837,"[112.0, 44.0]",fcb92aad-eba5-4053-8606-e9d70856ddd9,Left Foot,Normal,Open Play,Blocked,"[{'location': [115.0, 44.0], 'player': {'id': ...",,,,,,,,,,,,,231.0,1.0,00:05:09.540,...,9.0,Pass,15.0,Chelsea FCW,From Throw In,Chelsea FCW,Maren Nævdal Mjelde,Right Center Midfield,"[102.0, 45.0]",1.56,[9998635e-bd33-4f9e-aa46-81aa674b65b4],7298.0,Francesca Kirby,14.866069,0.737815,Low Pass,"[113.0, 55.0]",Head,,,,4eb844e2-9466-424a-abe3-1ba730afe716,True,,,,,,,,,,,,,,,,,
2,3e41e219-ac37-4e1c-97c3-eca7cd886484,243,1,00:05:41.940,5,41,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,So-Yun Ji,Center Midfield,"[99.0, 52.0]",0.48,"[2fb3a5ea-a6cd-4231-ac7f-ea88e85c109a, 94bbaf4...",7298,0.017603,"[108.0, 51.0]",,Right Foot,Half Volley,Open Play,Blocked,"[{'location': [102.0, 46.0], 'player': {'id': ...",True,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,f1c6e6cb-ab00-4223-8980-09747ce40924,248,1,00:05:43.900,5,43,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Drew Spence,Left Center Midfield,"[107.0, 40.0]",0.16,"[9b005c5d-8621-4764-838b-dd7695925bee, d3d11c9...",7298,0.144138,"[112.0, 37.0]",,Right Foot,Normal,Open Play,Blocked,"[{'location': [99.0, 52.0], 'player': {'id': 4...",,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,33060f55-fd05-4a3b-bf81-4315b9cb6417,256,1,00:05:46.380,5,46,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Millie Bright,Right Center Back,"[108.0, 32.0]",1.48,"[55631e56-7585-498c-923f-b0f9f47f790a, 56f9c1f...",7298,0.068946,"[120.0, 43.2, 2.0]",bd21493e-5b27-499b-a0ba-367c3b18a70e,Left Foot,Normal,Open Play,Goal,"[{'location': [97.0, 52.0], 'player': {'id': 4...",True,True,,,,,,,,,,,251.0,1.0,00:05:45.020,...,45.0,Pass,16.0,Chelsea FCW,From Corner,Chelsea FCW,Anita Amma Ankyewah Asante,Center Back,"[104.0, 36.0]",0.96,[0f23d443-778e-4c00-9ac3-b23c8d0f6f3b],7298.0,Millie Bright,8.602325,-0.950547,Ground Pass,"[109.0, 29.0]",Right Foot,Recovery,,,33060f55-fd05-4a3b-bf81-4315b9cb6417,,,,,,True,,,,,,,,,,,,


In [45]:
print('Updated Shot w/ Key Pass Features:',
      extracted_data.shape[1])

Updated Shot w/ Key Pass Features: 81


# Shot Related Events

In [46]:
# Create list of 'related_events' values for shot events

related_events = list(shots_df['related_events'])

## Dribbles

In [47]:
# Create dataframes for the target events in each season of the target competitions
# Dribbles

dribbles_df_37_42 = sb.competition_events(country = 'England',
                                          division = "FA Women's Super League",
                                          season = '2018/2019',
                                          gender = 'female',
                                          split = True)['dribbles']

dribbles_df_37_4 = sb.competition_events(country = 'England',
                                         division = "FA Women's Super League",
                                         season = '2019/2020',
                                         gender = 'female',
                                         split = True)['dribbles']

dribbles_df_49_3 = sb.competition_events(country = 'United States of America',
                                         division = 'NWSL',
                                         season = '2018',
                                         gender = 'female',
                                         split = True)['dribbles']

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data acces

In [48]:
# Concatenate dribble event dataframes into a combined dataframe

dribbles_df = pd.concat([dribbles_df_37_42,
                         dribbles_df_37_4,
                         dribbles_df_49_3],
                        ignore_index = True)

In [49]:
dribbles_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,under_pressure,related_events,match_id,dribble_outcome,dribble_nutmeg,dribble_overrun,duration,out,dribble_no_touch
0,e8903265-c6a4-4e45-9dc3-d9399e9e9772,37,1,00:00:42.220,0,42,Dribble,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[98.0, 22.0]",True,[18a7dea2-ed54-46c7-9d0a-bc3c87b82241],7298,Complete,,,,,
1,54755692-1324-43bb-abc5-acc085fb6874,40,1,00:00:45.980,0,45,Dribble,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[118.0, 22.0]",True,[f16de089-97ad-443b-a022-a8476ab91aab],7298,Complete,,,,,
2,9acab966-b7d0-44d6-bb54-5b2a083576ed,78,1,00:02:07.500,2,7,Dribble,6,Manchester City WFC,Regular Play,Manchester City WFC,Nikita Parris,Right Wing,"[115.0, 59.0]",True,"[a943797d-d163-45b5-ab62-d8f0868716c9, d38ae3f...",7298,Incomplete,,,,,
3,b59849b3-1c91-4c39-a7d5-2144166ff690,101,1,00:03:06.020,3,6,Dribble,9,Manchester City WFC,Regular Play,Manchester City WFC,Julia Spetsmark,Left Wing,"[41.0, 7.0]",True,"[41f0bfc9-6c05-4754-9d98-d1bda99e19e0, fc1a87f...",7298,Incomplete,,,,,
4,4cf12e69-40a3-495a-9d92-89ac9d80d6c6,235,1,00:05:11.100,5,11,Dribble,15,Chelsea FCW,From Throw In,Chelsea FCW,Francesca Kirby,Right Center Forward,"[111.0, 54.0]",True,"[5fab108e-3fc1-4080-a432-5788dfa632dc, 9940802...",7298,Complete,,,,,


In [50]:
print("Total Women's Club Competition Dribble Events:",
      len(dribbles_df))

Total Women's Club Competition Dribble Events: 8187


In [51]:
print('Total Dribble Features:',
      dribbles_df.shape[1])

Total Dribble Features: 23


In [52]:
# Save dribbles_df

dribbles_df.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/dribbles_df.parquet')

In [53]:
print('dribbles_df Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/dribbles_df.parquet').stat().st_size,
      'bytes')

dribbles_df Filesize: 882395 bytes


### Search Shot Events 'related_events' for Dribbles

In [54]:
# Search dribble events for related_events

dribbles_to_shots_df = dribbles_df[dribbles_df['id'].isin(related_events)]

print("Dribble Events Identified as 'related_events' for Shot Events:",
      len(dribbles_to_shots_df))

Dribble Events Identified as 'related_events' for Shot Events: 0


## Carrys

In [55]:
# Create dataframes for the target events in each season of the target competitions
# Carrys

carrys_df_37_42 = sb.competition_events(country = 'England',
                                        division = "FA Women's Super League",
                                        season = '2018/2019',
                                        gender = 'female',
                                        split = True)['carrys']

carrys_df_37_4 = sb.competition_events(country = 'England',
                                       division = "FA Women's Super League",
                                       season = '2019/2020',
                                       gender = 'female',
                                       split = True)['carrys']

carrys_df_49_3 = sb.competition_events(country = 'United States of America',
                                       division = 'NWSL',
                                       season = '2018',
                                       gender = 'female',
                                       split = True)['carrys']

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data acces

In [56]:
# Concatenate carry event dataframes into a combined dataframe

carrys_df = pd.concat([carrys_df_37_42,
                       carrys_df_37_4,
                       carrys_df_49_3],
                      ignore_index = True)

In [57]:
carrys_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,duration,under_pressure,related_events,match_id,carry_end_location
0,eb965fcc-3962-4f91-8768-3756a4f4bba7,7,1,00:00:00.100,0,0,Carry,2,Chelsea FCW,From Kick Off,Chelsea FCW,Ramona Bachmann,Left Center Forward,"[63.0, 37.0]",0.4,True,"[23fcb90e-16ec-46af-b513-97750d74d58a, bbdcd0f...",7298,"[69.0, 33.0]"
1,1d969fe3-e11b-4da6-84c4-f584662b3c72,11,1,00:00:02.140,0,2,Carry,2,Chelsea FCW,From Kick Off,Chelsea FCW,Crystal Alyssia Dunn Soubrier,Left Midfield,"[84.0, 5.0]",4.6,True,"[1b07a297-b912-4eb4-8e7c-e1c19b3b852f, 2ca32e1...",7298,"[108.0, 10.0]"
2,289bf85b-39fc-42ec-8f02-c1c79466670d,25,1,00:00:32.673,0,32,Carry,3,Manchester City WFC,From Goal Kick,Manchester City WFC,Nikita Parris,Right Wing,"[58.0, 76.0]",0.12,,"[54fe20f0-4347-4873-875a-bd0c51dcd563, 7b5f1a3...",7298,"[58.0, 76.0]"
3,91bea69d-ad9d-430f-bbba-c1a28f0b74b5,29,1,00:00:34.299,0,34,Carry,3,Manchester City WFC,From Goal Kick,Manchester City WFC,Nadia Nadim,Center Forward,"[75.0, 71.0]",1.081,True,"[3fdef483-a518-4bd0-bbea-9d8432778b22, 9dcc0dd...",7298,"[79.0, 70.0]"
4,2bbbfa53-2958-4dc0-b413-7e8880b0e244,32,1,00:00:35.620,0,35,Carry,4,Chelsea FCW,From Counter,Chelsea FCW,Anita Amma Ankyewah Asante,Center Back,"[37.0, 10.0]",3.04,,"[010edf6c-0d7a-42a0-b613-aaabac2f01a3, abb17b2...",7298,"[44.0, 17.0]"


In [63]:
print("Total Women's Club Competition Carry Events:",
      len(carrys_df))

Total Women's Club Competition Carry Events: 168439


In [59]:
print('Total Carry Features:',
      carrys_df.shape[1])

Total Carry Features: 19


In [60]:
# Save carrys_df

carrys_df.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/carrys_df.parquet')

In [61]:
print('carrys_df Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/carrys_df.parquet').stat().st_size,
      'bytes')

carrys_df Filesize: 25553176 bytes


### Search Shot Events 'related_events' for Carrys

In [62]:
carrys_to_shots_df = carrys_df[carrys_df['id'].isin(related_events)]

print("Carry Events Identified as 'related_events' for Shot Events:",
      len(carrys_to_shots_df))

Carry Events Identified as 'related_events' for Shot Events: 0


# Extracted Data

In [64]:
extracted_data.head()

Unnamed: 0,id,index_x,period_x,timestamp_x,minute_x,second_x,type_x,possession_x,possession_team_x,play_pattern_x,team_x,player_x,position_x,location_x,duration_x,related_events_x,match_id_x,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part,shot_technique,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure_x,shot_one_on_one,shot_deflected,off_camera_x,shot_aerial_won,shot_open_goal,out_x,shot_redirect,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble,index_y,period_y,timestamp_y,...,second_y,type_y,possession_y,possession_team_y,play_pattern_y,team_y,player_y,position_y,location_y,duration_y,related_events_y,match_id_y,pass_recipient,pass_length,pass_angle,pass_height,pass_end_location,pass_body_part,pass_type,under_pressure_y,pass_outcome,pass_assisted_shot_id,pass_shot_assist,pass_technique,pass_through_ball,pass_cross,pass_switch,pass_goal_assist,pass_aerial_won,pass_backheel,pass_deflected,counterpress,off_camera_y,pass_cut_back,pass_miscommunication,pass_outswinging,pass_straight,pass_inswinging,pass_no_touch,out_y
0,d9c27699-dd12-4e55-96d6-4c95685e4c66,42,1,00:00:47.620,0,47,Shot,4,Chelsea FCW,From Counter,Chelsea FCW,Francesca Kirby,Right Center Forward,"[115.0, 25.0]",0.56,"[524c23dc-265a-4cfc-b367-bd37356c0185, f59f670...",7298,0.092624,"[117.0, 34.0]",abb17b2a-1775-4226-9158-67efe68ee0c3,Right Foot,Normal,Open Play,Blocked,"[{'location': [113.0, 32.0], 'player': {'id': ...",,,,,,,,,,,,,33.0,1.0,00:00:38.660,...,38.0,Pass,4.0,Chelsea FCW,From Counter,Chelsea FCW,Anita Amma Ankyewah Asante,Center Back,"[44.0, 17.0]",3.453,[7fbb0f53-f758-4667-9993-062fde493f1c],7298.0,Francesca Kirby,51.351727,0.117109,High Pass,"[95.0, 23.0]",Right Foot,,,,d9c27699-dd12-4e55-96d6-4c95685e4c66,True,Through Ball,True,,,,,,,,,,,,,,,
1,4eb844e2-9466-424a-abe3-1ba730afe716,237,1,00:05:12.780,5,12,Shot,15,Chelsea FCW,From Throw In,Chelsea FCW,Francesca Kirby,Right Center Forward,"[109.0, 51.0]",0.4,"[1880a599-93d5-450a-ac3d-0dc6b7b0609a, a61f6f5...",7298,0.041837,"[112.0, 44.0]",fcb92aad-eba5-4053-8606-e9d70856ddd9,Left Foot,Normal,Open Play,Blocked,"[{'location': [115.0, 44.0], 'player': {'id': ...",,,,,,,,,,,,,231.0,1.0,00:05:09.540,...,9.0,Pass,15.0,Chelsea FCW,From Throw In,Chelsea FCW,Maren Nævdal Mjelde,Right Center Midfield,"[102.0, 45.0]",1.56,[9998635e-bd33-4f9e-aa46-81aa674b65b4],7298.0,Francesca Kirby,14.866069,0.737815,Low Pass,"[113.0, 55.0]",Head,,,,4eb844e2-9466-424a-abe3-1ba730afe716,True,,,,,,,,,,,,,,,,,
2,3e41e219-ac37-4e1c-97c3-eca7cd886484,243,1,00:05:41.940,5,41,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,So-Yun Ji,Center Midfield,"[99.0, 52.0]",0.48,"[2fb3a5ea-a6cd-4231-ac7f-ea88e85c109a, 94bbaf4...",7298,0.017603,"[108.0, 51.0]",,Right Foot,Half Volley,Open Play,Blocked,"[{'location': [102.0, 46.0], 'player': {'id': ...",True,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,f1c6e6cb-ab00-4223-8980-09747ce40924,248,1,00:05:43.900,5,43,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Drew Spence,Left Center Midfield,"[107.0, 40.0]",0.16,"[9b005c5d-8621-4764-838b-dd7695925bee, d3d11c9...",7298,0.144138,"[112.0, 37.0]",,Right Foot,Normal,Open Play,Blocked,"[{'location': [99.0, 52.0], 'player': {'id': 4...",,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,33060f55-fd05-4a3b-bf81-4315b9cb6417,256,1,00:05:46.380,5,46,Shot,16,Chelsea FCW,From Corner,Chelsea FCW,Millie Bright,Right Center Back,"[108.0, 32.0]",1.48,"[55631e56-7585-498c-923f-b0f9f47f790a, 56f9c1f...",7298,0.068946,"[120.0, 43.2, 2.0]",bd21493e-5b27-499b-a0ba-367c3b18a70e,Left Foot,Normal,Open Play,Goal,"[{'location': [97.0, 52.0], 'player': {'id': 4...",True,True,,,,,,,,,,,251.0,1.0,00:05:45.020,...,45.0,Pass,16.0,Chelsea FCW,From Corner,Chelsea FCW,Anita Amma Ankyewah Asante,Center Back,"[104.0, 36.0]",0.96,[0f23d443-778e-4c00-9ac3-b23c8d0f6f3b],7298.0,Millie Bright,8.602325,-0.950547,Ground Pass,"[109.0, 29.0]",Right Foot,Recovery,,,33060f55-fd05-4a3b-bf81-4315b9cb6417,,,,,,True,,,,,,,,,,,,


In [65]:
print('Total Extracted Events:',
      len(extracted_data))

Total Extracted Events: 6114


In [66]:
print('Total Extracted Features:',
      extracted_data.shape[1])

Total ExtractedFeatures: 81


In [None]:
# Save extracted_data

extracted_data.to_parquet('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/extracted_data.parquet')

In [None]:
print('extracted_data Filesize:',
      path('/content/drive/MyDrive/expected_goals/data_extraction/dataframes/extracted_data.parquet').stat().st_size,
      'bytes')

extracted_data Filesize: 2247593 bytes


Continued in [expected_goals_data_cleaning_notebook]()

*2 of 7*