# **Expected Goals Classifier**

### Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [milwaukee_rampage_fc](https://github.com/wswager/milwaukee_rampage_fc)

# Data Extraction Notebook

*Notebook 1 of 8*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_extraction/expected_goals_data_extraction_notebook.ipynb)
2. Data organized in [expected_goals_data_organization_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)
3. Features engineered in [expected_goals_feature_engineering_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/feature_engineering/expected_goals_feature_engineering_notebook.ipynb)
4. Data cleaned in [expected_goals_data_cleaning_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_cleaning/expected_goals_data_cleaning_notebook.ipynb)
5. Data explored in [expected_goals_data_exploration_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_exploration/expected_goals_data_exploration_notebook.ipynb)
6. Data preprocessed in [expected_goals_data_preprocessing_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_preprocessing/expected_goals_data_preprocessing_notebook.ipynb)
7. Modeling in [expected_goals_model_fitting_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_modeling/expected_goals_modeling_notebook.ipynb)
8. Conclusions in [expected_goals_model_assessment_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/conclusions/expected_conclusions_notebook.ipynb)

### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb has provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

StatsBomb Open Data is organized in JSON files:
* **[Matches](https://github.com/statsbomb/open-data/tree/master/data/matches)**
  * Folders organized by competition (league or tournament)
    * Files organized by season (year) ID
    * Files contain nested dictionaries with descriptive data for each individual match
* **[Events](https://github.com/statsbomb/open-data/tree/master/data/events)**
  * Files organized by match ID
  * Files contain nested dictionaries with descriptive data for each event within each individual match

# Packages

In [11]:
# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# Pandas for Dataframes
import pandas as pd

# Numpy for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Extract Data from StatsBomb Open Data

## Matches Data

In [2]:
# Define path for stored StatsBomb Open Data Matches

matches_path = '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches'

In [18]:
# Identify target league match data within Statsbomb Open Data
# 37 - FA Womens Superleague
# 49 - NWSL

target_leagues_list =['37',
                      '49']

In [19]:
# Create list of Matches files for target leagues

matches_path_list = []
for tl in target_leagues_list:
    matches_path_list.extend(list((Path((matches_path + '/' + tl + '/'))).glob('*.json')))

In [23]:
print('Total Seasons from Target Leagues:', len(matches_path_list))

Total Seasons from Target Leagues: 4


In [4]:
# Create dataframe from data contained in target league Matches files

matches_list = [] 
for mp in matches_path_list:
    matches_data = pd.read_json(mp)
    matches_list.append(matches_data)

matches_df = pd.concat(matches_list,
                       ignore_index = True)

In [5]:
matches_df.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,metadata,match_week,competition_stage,stadium,referee
0,19743,2018-10-21,13:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 969, 'home_team_name': 'Birmi...","{'away_team_id': 971, 'away_team_name': 'Chels...",0,0,available,processing,2021-04-28T07:08:31.946271,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 5332, 'name': 'SportNation.bet Stadium'...","{'id': 898, 'name': 'A. Fearn', 'country': {'i..."
1,19740,2018-10-21,16:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 972, 'home_team_name': 'West ...","{'away_team_id': 966, 'away_team_name': 'Liver...",0,1,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 4062, 'name': 'The Rush Green Stadium',...","{'id': 568, 'name': 'J. Packman', 'country': {..."
2,19716,2018-09-09,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 974, 'home_team_name': 'Readi...","{'away_team_id': 970, 'away_team_name': 'Yeovi...",4,0,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},1,"{'id': 1, 'name': 'Regular Season'}","{'id': 577, 'name': 'Adams Park', 'country': {...","{'id': 567, 'name': 'H. Conley', 'country': {'..."
3,19800,2019-03-14,20:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 968, 'home_team_name': 'Arsen...","{'away_team_id': 973, 'away_team_name': 'Brist...",4,0,available,unscheduled,2020-08-24T14:34:34.401523,,{'data_version': '1.1.0'},18,"{'id': 1, 'name': 'Regular Season'}","{'id': 456, 'name': 'Meadow Park', 'country': ...","{'id': 915, 'name': 'R. Whitton', 'country': {..."
4,19739,2018-10-21,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 965, 'home_team_name': 'Brigh...","{'away_team_id': 746, 'away_team_name': 'Manch...",0,6,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}",,


In [20]:
print('Total Matches:', len(matches_df))

Total Matches: 231


In [35]:
# Save matches_df

matches_df.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/matches_df.parquet')

In [36]:
print('matches_df Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/matches_df.parquet').stat().st_size,
      'bytes')

matches_df Filesize: 39147 bytes


## Events Data

In [14]:
# Define path for stored StatsBomb Open Data Matches

events_path = '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events'

In [17]:
# Identifying match IDs from matches_df for matches from target leagues

match_ids_list = matches_df['match_id'].values

matches_list = []
for mi in match_ids_list:
    matches_list.append(str(mi))

In [24]:
# Create list of Events files for matches from target leagues

events_path_list = []
for m in matches_list:
    events_path_list.append(events_path +'/' + m + '.json')

In [26]:
# Create dataframe from data contained in target league Events files

events_list = [] 
for ep in events_path_list:
    events_data = pd.read_json(ep)
    events_list.append(events_data)

events_df = pd.concat(events_list,
                      ignore_index = True)

In [27]:
events_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,duration,tactics,related_events,player,position,location,pass,carry,under_pressure,ball_receipt,counterpress,duel,interception,dribble,shot,goalkeeper,off_camera,ball_recovery,50_50,foul_committed,substitution,foul_won,clearance,injury_stoppage,miscontrol,block,out,bad_behaviour,player_off,half_start,half_end
0,c9425423-18d0-4c75-bdf1-cac6ecfef8cd,1,1,2021-08-25 00:00:00.000,0,0,"{'id': 35, 'name': 'Starting XI'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 969, 'name': 'Birmingham City WFC'}",0.0,"{'formation': 4231, 'lineup': [{'player': {'id...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,0a9c1eba-633a-42de-8386-64af450d5d44,2,1,2021-08-25 00:00:00.000,0,0,"{'id': 35, 'name': 'Starting XI'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",0.0,"{'formation': 42211, 'lineup': [{'player': {'i...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,d3bda43e-4172-42e0-8a29-0629fab2a5ac,3,1,2021-08-25 00:00:00.000,0,0,"{'id': 18, 'name': 'Half Start'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 969, 'name': 'Birmingham City WFC'}",0.0,,[1f0b713f-be11-4c49-8a21-d42f1f66ef87],,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,1f0b713f-be11-4c49-8a21-d42f1f66ef87,4,1,2021-08-25 00:00:00.000,0,0,"{'id': 18, 'name': 'Half Start'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",0.0,,[d3bda43e-4172-42e0-8a29-0629fab2a5ac],,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,667dda2e-b35d-4d46-ad09-40b3f491f160,5,1,2021-08-25 00:00:01.324,0,1,"{'id': 30, 'name': 'Pass'}",2,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 9, 'name': 'From Kick Off'}","{'id': 971, 'name': 'Chelsea FCW'}",1.228695,,[8dc92bd7-d6a0-4d60-b24e-b0352d135b62],"{'id': 4641, 'name': 'Francesca Kirby'}","{'id': 23, 'name': 'Center Forward'}","[61.0, 41.0]","{'recipient': {'id': 15549, 'name': 'Sophie In...",,,,,,,,,,,,,,,,,,,,,,,,


In [37]:
#Save events_df

events_df.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/events_df.parquet')

In [38]:
print('events_df Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/dataframes/events_df.parquet').stat().st_size,
      'bytes')

events_df Filesize: 80222381 bytes


Continued in [expected_goals_data_organization_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)

*2 of 8*