# **Expected Goals Classifier**

### Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [milwaukee_rampage_fc](https://github.com/wswager/milwaukee_rampage_fc)

# Data Extraction Notebook

*Notebook 1 of 8*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_extraction/expected_goals_data_extraction_notebook.ipynb)
2. Data organized in [expected_goals_data_organization_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)
3. Features engineered in [expected_goals_feature_engineering_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/feature_engineering/expected_goals_feature_engineering_notebook.ipynb)
4. Data cleaned in [expected_goals_data_cleaning_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_cleaning/expected_goals_data_cleaning_notebook.ipynb)
5. Data explored in [expected_goals_data_exploration_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_exploration/expected_goals_data_exploration_notebook.ipynb)
6. Data preprocessed in [expected_goals_data_preprocessing_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_preprocessing/expected_goals_data_preprocessing_notebook.ipynb)
7. Modeling in [expected_goals_model_fitting_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_modeling/expected_goals_modeling_notebook.ipynb)
8. Conclusions in [expected_goals_model_assessment_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/conclusions/expected_conclusions_notebook.ipynb)

### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb has provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

StatsBomb Open Data is organized in JSON files:
* **[Matches](https://github.com/statsbomb/open-data/tree/master/data/matches)**
  * Folders organized by competition (league or tournament)
    * Files organized by season (year) ID
    * Files contain nested dictionaries with descriptive data for each individual match
* **[Events](https://github.com/statsbomb/open-data/tree/master/data/events)**
  * Files organized by match ID
  * Files contain nested dictionaries with descriptive data for each event within each individual match

# Packages

In [1]:
# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# Statsbombpy package for extracting StatsBomb data
!pip install statsbombpy
from statsbombpy import sb

# Pandas for Dataframes
import pandas as pd

# Numpy for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive
Collecting statsbombpy
  Downloading statsbombpy-1.0.1-py2.py3-none-any.whl (10 kB)
Collecting nose2
  Downloading nose2-0.10.0-py2.py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 9.3 MB/s 
Collecting requests-cache
  Downloading requests_cache-0.7.4-py3-none-any.whl (38 kB)
Collecting coverage>=4.4.1
  Downloading coverage-5.5-cp37-cp37m-manylinux2010_x86_64.whl (242 kB)
[K     |████████████████████████████████| 242 kB 66.4 MB/s 
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 40.6 MB/s 
[?25hCollecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, pyyaml, itsdangerous, coverage, requests-cache, nose2, statsbombpy
  Attempting uninstall: pyyaml
    Found existing 

*See [statsbombpy](https://github.com/statsbomb/statsbombpy) package*

# Extract Data from StatsBomb Open Data

## Matches Data

In [4]:
# View competitions available through StatsBomb Open Data

sb.competitions().head()

credentials were not supplied. open data access only


Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
0,16,4,Europe,Champions League,male,2018/2019,2021-05-19T08:38:06.515138,2021-05-19T08:38:06.515138
1,16,1,Europe,Champions League,male,2017/2018,2021-01-23T21:55:30.425330,2021-01-23T21:55:30.425330
2,16,2,Europe,Champions League,male,2016/2017,2020-08-26T12:33:15.869622,2020-07-29T05:00
3,16,27,Europe,Champions League,male,2015/2016,2020-08-26T12:33:15.869622,2020-07-29T05:00
4,16,26,Europe,Champions League,male,2014/2015,2020-08-26T12:33:15.869622,2020-07-29T05:00


In [5]:
print('Available Competitions:',
      sb.competitions()['competition_name'].unique())

credentials were not supplied. open data access only
Available Competitions: ['Champions League' "FA Women's Super League" 'FIFA World Cup' 'La Liga'
 'NWSL' 'Premier League' "Women's World Cup"]


In [33]:
# Isolate target competions from StatsBomb Open Data
# Women's competitions

target_comp_df = sb.competitions().loc[sb.competitions()['competition_gender'] == 'female']

target_comp_ids = target_comp_df['competition_id'].unique()

target_season_ids = target_comp_df['season_id'].unique()

credentials were not supplied. open data access only
credentials were not supplied. open data access only


In [35]:
print('Target Competitions:',
      target_comp_df['competition_name'].unique(),
      '\n',
      'Target competition_ids:',
      target_comp_ids,
      '\n',
      'Target season_ids:',
      target_season_ids)

Target Competitions: ["FA Women's Super League" 'NWSL' "Women's World Cup"] 
 Target competition_ids: [37 49 72] 
 Target season_ids: [42  4  3 30]


In [36]:
# Refine target competitions
# Women's club competitions

target_comp_df = sb.competitions().loc[sb.competitions()['competition_id'].isin([37, 49])]

target_comp_ids = target_comp_df['competition_id'].unique()

target_season_ids = target_comp_df['season_id'].unique()

credentials were not supplied. open data access only
credentials were not supplied. open data access only


In [37]:
print('Target Competitions:',
      target_comp_df['competition_name'].unique(),
      '\n',
      'Target competition_ids:',
      target_comp_ids,
      '\n',
      'Target season_ids:',
      target_season_ids)

Target Competitions: ["FA Women's Super League" 'NWSL'] 
 Target competition_ids: [37 49] 
 Target season_ids: [42  4  3]


In [40]:
print('Number of Seasons:',
      len(target_season_ids))

Number of Seasons: 3


In [51]:
# Create dataframes for the matches in each season of the target competitions

matches_df_37_42 = sb.matches(competition_id = 37,
                              season_id = 42)

matches_df_37_4 = sb.matches(competition_id = 37,
                             season_id = 4)

matches_df_49_3 = sb.matches(competition_id = 49,
                             season_id = 3)

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only


In [53]:
# Combine dataframes for the matches in each season of the target leagues

matches_df = pd.concat([matches_df_37_42,
                        matches_df_37_4,
                        matches_df_49_3],
                       ignore_index = True)

In [66]:
matches_df.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,match_week,competition_stage,stadium,referee,data_version,shot_fidelity_version,xy_fidelity_version
0,2275054,2020-01-05,15:00:00.000,England - FA Women's Super League,2019/2020,Brighton & Hove Albion WFC,Liverpool WFC,1,0,available,unscheduled,2020-07-29T05:00,,11,Regular Season,The People's Pension Stadium,A. Fearn,1.1.0,2,2
1,2275072,2020-01-05,13:30:00.000,England - FA Women's Super League,2019/2020,Chelsea FCW,Reading WFC,3,1,available,unscheduled,2020-07-29T05:00,,11,Regular Season,The Cherry Red Records Stadium,S. Pearson,1.1.0,2,2
2,2275085,2020-01-05,15:00:00.000,England - FA Women's Super League,2019/2020,Tottenham Hotspur Women,Manchester City WFC,1,4,available,unscheduled,2020-07-29T05:00,,11,Regular Season,The Hive Stadium,H. Conley,1.1.0,2,2
3,2275113,2020-01-19,16:00:00.000,England - FA Women's Super League,2019/2020,West Ham United LFC,Brighton & Hove Albion WFC,2,1,available,unscheduled,2020-07-29T05:00,,13,Regular Season,The Rush Green Stadium,Ryan Atkin,1.1.0,2,2
4,2275142,2020-01-05,13:00:00.000,England - FA Women's Super League,2019/2020,Manchester United,Bristol City WFC,0,1,available,unscheduled,2020-10-20T18:35:32.568528,,11,Regular Season,Leigh Sports Village Stadium,L. Oliver,1.1.0,2,2


In [56]:
print('Total Matches:', len(matches_df))

Total Matches: 230


In [57]:
# Save matches_df

matches_df.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/matches_df.parquet')

In [58]:
print('matches_df Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/matches_df.parquet').stat().st_size,
      'bytes')

matches_df Filesize: 20604 bytes


## Events Data

In [62]:
# Create dataframes for the target events in each season of the target competitions
# Shots

events_df_37_42 = sb.competition_events(country = 'England',
                                        division = "FA Women's Super League",
                                        season = '2018/2019',
                                        gender = 'female',
                                        split = True)['shots']

events_df_37_4 = sb.competition_events(country = 'England',
                                        division = "FA Women's Super League",
                                        season = '2019/2020',
                                        gender = 'female',
                                        split = True)['shots']

events_df_49_3 = sb.competition_events(country = 'United States of America',
                                        division = 'NWSL',
                                        season = '2018',
                                        gender = 'female',
                                        split = True)['shots']

credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data access only
credentials were not supplied. open data acces

In [67]:
# Combine dataframes for the target events in each season of the target leagues

events_df = pd.concat([events_df_37_42,
                        events_df_37_4,
                        events_df_49_3],
                       ignore_index = True)

In [68]:
events_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,duration,under_pressure,related_events,match_id,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_technique,shot_outcome,shot_type,shot_body_part,shot_freeze_frame,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,out,shot_redirect,shot_deflected,off_camera,shot_saved_off_target,shot_saved_to_post,shot_follows_dribble
0,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,258,1,00:04:38.609,4,38,Shot,11,Chelsea FCW,Regular Play,Chelsea FCW,Francesca Kirby,Center Forward,"[109.0, 46.0]",0.2788,True,"[011167bc-9cbc-46a3-9b7b-28065eab7af1, 2c37831...",19743,0.266154,"[112.0, 45.0]",bf82ea91-c3e3-4d8c-b91d-c9d0ccd44f11,Normal,Blocked,Open Play,Left Foot,"[{'location': [104.0, 50.0], 'player': {'id': ...",,,,,,,,,,,
1,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,542,1,00:11:45.046,11,45,Shot,24,Chelsea FCW,From Free Kick,Chelsea FCW,Bethany England,Left Midfield,"[113.0, 35.0]",0.25673,True,"[a4b77cbb-14d0-4bd3-ba8b-7312335098fe, b9b246c...",19743,0.093521,"[120.0, 32.9, 0.4]",b99082e1-812b-48dd-bf94-8856b1ff079b,Normal,Off T,Open Play,Head,"[{'location': [108.0, 45.0], 'player': {'id': ...",True,True,,,,,,,,,
2,f68deb6f-0711-4b9d-8081-122dc3722c55,614,1,00:18:03.461,18,3,Shot,29,Chelsea FCW,Regular Play,Chelsea FCW,Drew Spence,Left Defensive Midfield,"[94.0, 43.0]",1.147883,True,"[3c03553f-3bed-4d21-8096-ed4ef269da62, bb13e23...",19743,0.036171,"[120.0, 42.8, 0.5]",5022d0b3-ea32-42a8-bd41-b46cc244beb9,Normal,Saved,Open Play,Left Foot,"[{'location': [118.0, 41.0], 'player': {'id': ...",,,,,,,,,,,
3,f301190f-cc0a-4f16-8278-27e5279ea24e,877,1,00:23:11.935,23,11,Shot,43,Birmingham City WFC,From Goal Kick,Birmingham City WFC,Chloe Arthur,Right Back,"[86.0, 34.0]",2.161012,True,"[0bfe1b6c-d690-41a6-be3e-f9b6295ddd85, 570e15b...",19743,0.016625,"[119.0, 33.3, 0.5]",fdf4a564-4973-46e5-bc07-d84785f8c183,Normal,Off T,Open Play,Left Foot,"[{'location': [78.0, 58.0], 'player': {'id': 1...",,,,,,,,,,,
4,8558535e-b1ee-4f53-b003-1b5fba2712bd,892,1,00:23:45.810,23,45,Shot,44,Chelsea FCW,From Goal Kick,Chelsea FCW,Bethany England,Left Midfield,"[94.0, 33.0]",1.225187,,[1455cb46-43a3-4e6f-b845-171abcd344bc],19743,0.030716,"[120.0, 34.8, 0.5]",37712221-3b0b-4090-a30c-08a3ee6492be,Normal,Off T,Open Play,Right Foot,"[{'location': [117.0, 40.0], 'player': {'id': ...",,,,,,,,,,,


In [69]:
print('Total Events:',
      len(events_df))

Total Events: 6080


In [72]:
print('Total Features:',
      events_df.shape[1])

Total Features: 37


In [70]:
# Save events_df

events_df.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/events_df.parquet')

In [71]:
print('events_df Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/events_df.parquet').stat().st_size,
      'bytes')

events_df Filesize: 1588328 bytes


Continued in [expected_goals_data_organization_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)

*2 of 8*