<a href="https://colab.research.google.com/github/wswager/expected_goals/blob/main/data_extraction/expected_goals_data_extraction_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Capstone Project Submission**

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Unknown

# Data Extraction Notebook

<a id = 'proposal'></a>
### Proposal

**Problem Statement**

Create an expected goals metric using existing historical data which can be used to analyze future match data and provide specific recommendations to be utilized in following training to help improve the likelihood of goals.

**Supervised Learning Target**

Classification model which predicts the likelihood of a goal (percentage) given data features specific to the shot and preceding play.

**Data Source**

[StatsBomb Open Data](https://github.com/statsbomb/open-data)

# Packages

In [None]:
# Drive  and IO to access saved data
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path

# Pandas for Dataframes
import pandas as pd

# Numpy Math and for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data

Data sourced from [StatsBomb Open Data](https://github.com/statsbomb/open-data)

# Extract Data from StatsBomb Open Data

## Matches Data

In [None]:
# Identify target league match data within Statsbomb Open Data
# 37 - FA Womens Superleague
# 49 - NWSL

matches_path = '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches/'

target_leagues =['37/',
                 '49/']

matches_path_list = []
for tl in target_leagues:
    matches_path_list.extend(list((Path((matches_path + tl))).glob('*.json')))

In [None]:
matches_path_list

[PosixPath('/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches/37/4.json'),
 PosixPath('/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches/37/42.json'),
 PosixPath('/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches/37/90.json'),
 PosixPath('/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/matches/49/3.json')]

In [None]:
# Create dataframe from target league match data

matches_list = [] 
for mpl in matches_path_list:
    matches_data = pd.read_json(mpl)
    matches_list.append(matches_data)

matches_df = pd.concat(matches_list,
                       ignore_index = True)

In [None]:
matches_df.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,metadata,match_week,competition_stage,stadium,referee
0,19743,2018-10-21,13:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 969, 'home_team_name': 'Birmi...","{'away_team_id': 971, 'away_team_name': 'Chels...",0,0,available,processing,2021-04-28T07:08:31.946271,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 5332, 'name': 'SportNation.bet Stadium'...","{'id': 898, 'name': 'A. Fearn', 'country': {'i..."
1,19740,2018-10-21,16:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 972, 'home_team_name': 'West ...","{'away_team_id': 966, 'away_team_name': 'Liver...",0,1,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 4062, 'name': 'The Rush Green Stadium',...","{'id': 568, 'name': 'J. Packman', 'country': {..."
2,19716,2018-09-09,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 974, 'home_team_name': 'Readi...","{'away_team_id': 970, 'away_team_name': 'Yeovi...",4,0,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},1,"{'id': 1, 'name': 'Regular Season'}","{'id': 577, 'name': 'Adams Park', 'country': {...","{'id': 567, 'name': 'H. Conley', 'country': {'..."
3,19800,2019-03-14,20:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 968, 'home_team_name': 'Arsen...","{'away_team_id': 973, 'away_team_name': 'Brist...",4,0,available,unscheduled,2020-08-24T14:34:34.401523,,{'data_version': '1.1.0'},18,"{'id': 1, 'name': 'Regular Season'}","{'id': 456, 'name': 'Meadow Park', 'country': ...","{'id': 915, 'name': 'R. Whitton', 'country': {..."
4,19739,2018-10-21,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 965, 'home_team_name': 'Brigh...","{'away_team_id': 746, 'away_team_name': 'Manch...",0,6,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}",,


In [None]:
matches_df.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/matches_df.csv')

## Events Data

In [None]:
# Identifying target league match events data within Statsbomb Open Data

matches_int = matches_df['match_id'].values

matches = []
for int in matches_int:
    matches.append(str(int))


events_path = '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/'

events_matches_path = []
for m in matches:
    events_matches_path.append(events_path + m + '.json')

In [None]:
events_matches_path

['/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19743.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19740.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19716.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19800.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19739.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19734.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19748.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19822.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19766.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19785.json',
 '/content/drive/MyDrive/flatiron/expected_goals/statsbomb_open_data/events/19749.json',
 '/content/drive/MyDr

In [None]:
# Create dataframe from target league match event data

events_list = [] 
for emp in events_matches_path:
    events_data = pd.read_json(emp)
    events_list.append(events_data)

events_df = pd.concat(events_list,
                      ignore_index = True)

In [None]:
# Drop events not associated with shots from events_df

events_shots_df = events_df[events_df['shot'].notna()]

In [None]:
events_shots_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,duration,tactics,related_events,player,position,location,pass,carry,under_pressure,ball_receipt,counterpress,duel,interception,dribble,shot,goalkeeper,off_camera,ball_recovery,50_50,foul_committed,substitution,foul_won,clearance,injury_stoppage,miscontrol,block,out,bad_behaviour,player_off,half_start,half_end
257,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,258,1,2021-06-11 00:04:38.609,4,38,"{'id': 16, 'name': 'Shot'}",11,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",0.2788,,"[011167bc-9cbc-46a3-9b7b-28065eab7af1, 2c37831...","{'id': 4641, 'name': 'Francesca Kirby'}","{'id': 23, 'name': 'Center Forward'}","[109.0, 46.0]",,,1.0,,,,,,"{'statsbomb_xg': 0.26615402, 'end_location': [...",,,,,,,,,,,,,,,,
541,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,542,1,2021-06-11 00:11:45.046,11,45,"{'id': 16, 'name': 'Shot'}",24,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 3, 'name': 'From Free Kick'}","{'id': 971, 'name': 'Chelsea FCW'}",0.25673,,"[a4b77cbb-14d0-4bd3-ba8b-7312335098fe, b9b246c...","{'id': 15550, 'name': 'Bethany England'}","{'id': 16, 'name': 'Left Midfield'}","[113.0, 35.0]",,,1.0,,,,,,"{'one_on_one': True, 'statsbomb_xg': 0.0935205...",,,,,,,,,,,,,,,,
613,f68deb6f-0711-4b9d-8081-122dc3722c55,614,1,2021-06-11 00:18:03.461,18,3,"{'id': 16, 'name': 'Shot'}",29,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",1.147883,,"[3c03553f-3bed-4d21-8096-ed4ef269da62, bb13e23...","{'id': 4638, 'name': 'Drew Spence'}","{'id': 11, 'name': 'Left Defensive Midfield'}","[94.0, 43.0]",,,1.0,,,,,,"{'statsbomb_xg': 0.036171142, 'end_location': ...",,,,,,,,,,,,,,,,
876,f301190f-cc0a-4f16-8278-27e5279ea24e,877,1,2021-06-11 00:23:11.935,23,11,"{'id': 16, 'name': 'Shot'}",43,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 7, 'name': 'From Goal Kick'}","{'id': 969, 'name': 'Birmingham City WFC'}",2.161012,,"[0bfe1b6c-d690-41a6-be3e-f9b6295ddd85, 570e15b...","{'id': 10193, 'name': 'Chloe Arthur'}","{'id': 2, 'name': 'Right Back'}","[86.0, 34.0]",,,1.0,,,,,,"{'statsbomb_xg': 0.016625367000000002, 'end_lo...",,,,,,,,,,,,,,,,
891,8558535e-b1ee-4f53-b003-1b5fba2712bd,892,1,2021-06-11 00:23:45.810,23,45,"{'id': 16, 'name': 'Shot'}",44,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 7, 'name': 'From Goal Kick'}","{'id': 971, 'name': 'Chelsea FCW'}",1.225187,,[1455cb46-43a3-4e6f-b845-171abcd344bc],"{'id': 15550, 'name': 'Bethany England'}","{'id': 16, 'name': 'Left Midfield'}","[94.0, 33.0]",,,,,,,,,"{'statsbomb_xg': 0.030716168000000002, 'end_lo...",,,,,,,,,,,,,,,,


In [None]:
events_shots_df.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/events_shots_df.csv')

Continued in [expected_goals_data_organization_notebook](https://github.com/wswager/expected_goals/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)