### Capstone Project Submission

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Unknown

# Data Extraction Notebook

<a id = 'proposal'></a>
### Proposal

**Problem Statement**

Create an expected goals metric using existing historical data which can be used to analyze future match data and provide specific recommendations to be utilized in following training to help improve the likelihood of goals.

**Supervised Learning Target**

Classification model which predicts the likelihood of a goal (percentage) given data features specific to the shot and preceding play.

**Data Source**

[StatsBomb Open Data](https://github.com/statsbomb/open-data)

# Contents

* **[Proposal](#proposal)**
* **[Data](#data)**
* **[Packages](#packages)**
* **[Extract Data from StatsBomb open Data](#extract_data)**
    * **[Matches Data](#matches_data)**
    * **[Events Data](#events_data)**
* **[Extract Features from Nested Dictionaries](#extract_features)**
    * **[Shot-Specific Data](#shot_data)**
    * **[Shot-Specific Features](#shot_features)**
    * **[Assist-Specific Features](#assist_features)**
* **[Extracted Data](#extracted_data)**

<a id = 'packages'></a>
# Packages

In [1]:
# Glob for file retrieval
import glob

# Pandas for Dataframes
import pandas as pd

# Numpy and for mathematical functions
import numpy as np

import warnings
warnings.filterwarnings('ignore')

<a id = 'data'></a>
# Data

Data sourced from [StatsBomb Open Data](https://github.com/statsbomb/open-data)

<a id = 'extract_data'></a>
# Extract Data from StatsBomb Open Data

<a id = 'matches_data'></a>
## Matches Data

In [2]:
# Identify target league match data within Statsbomb Open Data
# 37 - FA Womens Superleague
# 49 - NWSL

matches_path = 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/matches/'

target_leagues =['37',
                 '49']

matches_path_list = []
for tl in target_leagues:
    matches_path_list.extend(glob.glob(matches_path + tl + '/*.json'))

In [3]:
matches_path_list

['C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/matches/37\\4.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/matches/37\\42.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/matches/37\\90.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/matches/49\\3.json']

In [4]:
# Create dataframe from target league match data

matches_list = [] 
for mpl in matches_path_list:
    matches_data = pd.read_json(mpl)
    matches_list.append(matches_data)

matches_df = pd.concat(matches_list,
                       ignore_index = True)

In [5]:
matches_df.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,metadata,match_week,competition_stage,stadium,referee
0,19743,2018-10-21,13:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 969, 'home_team_name': 'Birmi...","{'away_team_id': 971, 'away_team_name': 'Chels...",0,0,available,processing,2021-04-28T07:08:31.946271,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 5332, 'name': 'SportNation.bet Stadium'...","{'id': 898, 'name': 'A. Fearn', 'country': {'i..."
1,19740,2018-10-21,16:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 972, 'home_team_name': 'West ...","{'away_team_id': 966, 'away_team_name': 'Liver...",0,1,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}","{'id': 4062, 'name': 'The Rush Green Stadium',...","{'id': 568, 'name': 'J. Packman', 'country': {..."
2,19716,2018-09-09,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 974, 'home_team_name': 'Readi...","{'away_team_id': 970, 'away_team_name': 'Yeovi...",4,0,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},1,"{'id': 1, 'name': 'Regular Season'}","{'id': 577, 'name': 'Adams Park', 'country': {...","{'id': 567, 'name': 'H. Conley', 'country': {'..."
3,19800,2019-03-14,20:30:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 968, 'home_team_name': 'Arsen...","{'away_team_id': 973, 'away_team_name': 'Brist...",4,0,available,unscheduled,2020-08-24T14:34:34.401523,,{'data_version': '1.1.0'},18,"{'id': 1, 'name': 'Regular Season'}","{'id': 456, 'name': 'Meadow Park', 'country': ...","{'id': 915, 'name': 'R. Whitton', 'country': {..."
4,19739,2018-10-21,15:00:00.000,"{'competition_id': 37, 'country_name': 'Englan...","{'season_id': 4, 'season_name': '2018/2019'}","{'home_team_id': 965, 'home_team_name': 'Brigh...","{'away_team_id': 746, 'away_team_name': 'Manch...",0,6,available,unscheduled,2020-07-29T05:00,,{'data_version': '1.0.3'},6,"{'id': 1, 'name': 'Regular Season'}",,


In [6]:
matches_df.to_csv(r'C:\Users\westi\Documents\github\expected_goals\data\saved_dataframes\data_extraction\matches_df.csv')
%store matches_df

Stored 'matches_df' (DataFrame)


<a id = 'events_data'></a>
## Events Data

In [7]:
# Identifying target league match events data within Statsbomb Open Data

matches_int = matches_df['match_id'].values

matches = []
for int in matches_int:
    matches.append(str(int))

events_path = 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/'

events_matches_path = []
for m in matches:
    events_matches_path.append(events_path + m + '.json')

In [8]:
events_matches_path

['C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19743.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19740.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19716.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19800.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19739.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19734.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19748.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19822.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19766.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsbomb_open_data/events/19785.json',
 'C:/Users/westi/Documents/github/expected_goals/data/statsb

In [9]:
# Create dataframe from target league match event data

events_list = [] 
for file in events_matches_path:
    data = pd.read_json(file)
    events_list.append(data)

events_df = pd.concat(events_list,
                      ignore_index = True)

In [10]:
# Drop events not associated with shots from events_df

events_shots_df = events_df[events_df['shot'].notna()]

In [11]:
events_shots_df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,...,foul_won,clearance,injury_stoppage,miscontrol,block,out,bad_behaviour,player_off,half_start,half_end
257,8f5a3b7c-db0b-42ec-bac0-adc0bedca2ea,258,1,2021-06-05 00:04:38.609,4,38,"{'id': 16, 'name': 'Shot'}",11,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 1, 'name': 'Regular Play'}",...,,,,,,,,,,
541,60ead7a6-4aa2-41ab-85a1-21357f50e4e0,542,1,2021-06-05 00:11:45.046,11,45,"{'id': 16, 'name': 'Shot'}",24,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 3, 'name': 'From Free Kick'}",...,,,,,,,,,,
613,f68deb6f-0711-4b9d-8081-122dc3722c55,614,1,2021-06-05 00:18:03.461,18,3,"{'id': 16, 'name': 'Shot'}",29,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 1, 'name': 'Regular Play'}",...,,,,,,,,,,
876,f301190f-cc0a-4f16-8278-27e5279ea24e,877,1,2021-06-05 00:23:11.935,23,11,"{'id': 16, 'name': 'Shot'}",43,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 7, 'name': 'From Goal Kick'}",...,,,,,,,,,,
891,8558535e-b1ee-4f53-b003-1b5fba2712bd,892,1,2021-06-05 00:23:45.810,23,45,"{'id': 16, 'name': 'Shot'}",44,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 7, 'name': 'From Goal Kick'}",...,,,,,,,,,,


In [12]:
events_shots_df.to_csv(r'C:\Users\westi\Documents\github\expected_goals\data\saved_dataframes\data_extraction\events_shots_df.csv')
%store events_shots_df

Stored 'events_shots_df' (DataFrame)


<a id = 'shot_data'></a>
# Extract Shot-Specific Data

In [13]:
# Extracting shot specific data from 
# events_df nested dictionaries

shots_df = events_shots_df[['index',
                            'timestamp',
                            'shot',
                            'location',
                            'player',
                            'possession_team']]

In [14]:
shots_df.head()

Unnamed: 0,index,timestamp,shot,location,player,possession_team
257,258,2021-06-05 00:04:38.609,"{'statsbomb_xg': 0.26615402, 'end_location': [...","[109.0, 46.0]","{'id': 4641, 'name': 'Francesca Kirby'}","{'id': 971, 'name': 'Chelsea FCW'}"
541,542,2021-06-05 00:11:45.046,"{'one_on_one': True, 'statsbomb_xg': 0.0935205...","[113.0, 35.0]","{'id': 15550, 'name': 'Bethany England'}","{'id': 971, 'name': 'Chelsea FCW'}"
613,614,2021-06-05 00:18:03.461,"{'statsbomb_xg': 0.036171142, 'end_location': ...","[94.0, 43.0]","{'id': 4638, 'name': 'Drew Spence'}","{'id': 971, 'name': 'Chelsea FCW'}"
876,877,2021-06-05 00:23:11.935,"{'statsbomb_xg': 0.016625367000000002, 'end_lo...","[86.0, 34.0]","{'id': 10193, 'name': 'Chloe Arthur'}","{'id': 969, 'name': 'Birmingham City WFC'}"
891,892,2021-06-05 00:23:45.810,"{'statsbomb_xg': 0.030716168000000002, 'end_lo...","[94.0, 33.0]","{'id': 15550, 'name': 'Bethany England'}","{'id': 971, 'name': 'Chelsea FCW'}"


In [15]:
shots_df.to_csv(r'C:\Users\westi\Documents\github\expected_goals\data\saved_dataframes\data_extraction\shots_df.csv')
%store shots_df

Stored 'shots_df' (DataFrame)


<a id = 'extract_features'></a>
# Extract Features from Nested Dictionaries

<a id = 'shot_features'></a>
## Shot-Specific Features

In [16]:
# Defining and extracting shot specific features from
# shots_df nested dictionaries

# Shot location
location_list = []
location_list.extend(list(shots_df['location'].values))

# Create dataframe of shot features
extracted_data = pd.DataFrame(location_list)
extracted_data.columns = ['location_x',
                          'location_y']

# Shot timestamp
time_list = []
time_list.extend(list(shots_df['timestamp'].values))
extracted_data['time'] = time_list

# StatBombs' xG metric
statsbomb_xg_list = []
for i in range(0, len(shots_df)):
    statsbomb_xg_list.append(shots_df.iloc[i]['shot']['statsbomb_xg'])
extracted_data['statsbomb_xg'] = statsbomb_xg_list

# Outcome of shot
outcome_list = []
for i in range(0, len(shots_df)):
    outcome_list.append(shots_df.iloc[i]['shot']['outcome']['name'])
extracted_data['outcome'] = outcome_list
        
# Player who shot
player_shot_list = []
for i in range(0, len(shots_df)):
    player_shot_list.append(shots_df.iloc[i]['player']['name'])
extracted_data['player_shot'] = player_shot_list
        
# Player who shot's team
team_list = []
for i in range(0, len(shots_df)):
    team_list.append(shots_df.iloc[i]['possession_team']['name'])
extracted_data['team'] = team_list
        
# Bodypart used to shoot
bodypart_list = []
for i in range(0, len(shots_df)):
    bodypart_list.append(shots_df.iloc[i]['shot']['body_part']['name'])
extracted_data['bodypart'] = bodypart_list
        
# Technique used for shot
technique_list = []
for i in range(0, len(shots_df)):
    technique_list.append(shots_df.iloc[i]['shot']['technique']['name'])
extracted_data['technique'] = technique_list
        
# If the shot was taken with the player's 1st-touch
first_time_list = []
for i in range(0, len(shots_df)):
    try:
        first_time_list.append(shots_df.iloc[i]['shot']['first_time'])
    except:
        first_time_list.append(False)
extracted_data['first_time'] = first_time_list
        
# State of play
state_of_play_list = []
for i in range(0, len(shots_df)):
    state_of_play_list.append(shots_df.iloc[i]['shot']['type']['name'])
extracted_data['state_of_play'] = state_of_play_list

In [17]:
extracted_data.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player_shot,team,bodypart,technique,first_time,state_of_play
0,109.0,46.0,2021-06-05 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play
1,113.0,35.0,2021-06-05 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play
2,94.0,43.0,2021-06-05 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play
3,86.0,34.0,2021-06-05 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play
4,94.0,33.0,2021-06-05 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play


<a id = 'assist_features'></a>
## Assist-Specific Features

In [18]:
# Defining and extracting features specific to
# the pass which lead to the shot from
# shots_df nested dictionaries

# Add pass features to dataframe

# Type of pass which lead to the shot
assist_list = []

for i in range(0, len(shots_df)):
    try:
        # Define 'key pass' within shots_df and events_df
        key_pass = events_df['id'] == shots_df.iloc[i]['shot']['key_pass_id']
        
        # Define assist in events_df
        assist_id = events_df[key_pass].dropna(axis = 'columns')['pass']
        
        assist_list.append(assist_id.iloc[0]['height']['name'])
        
    except KeyError:
        assist_list.append(np.nan)
        
extracted_data['assist'] = assist_list

# Second alternative source for type of pass
# which lead to the shot
assist2_list = []

for i in range(0, len(shots_df)):
    try:
        # Define 'key pass' within shots_df and events_df
        key_pass = events_df['id'] == shots_df.iloc[i]['shot']['key_pass_id']
        
        # Define assist in events_df
        assist_id = events_df[key_pass].dropna(axis = 'columns')['pass']
        
        assist2_list.append(assist_id.iloc[0]['technique']['name'])
        
    except KeyError:
        assist2_list.append(np.nan)

extracted_data['assist2'] = assist2_list

# Third alternative source for type of pass
# which lead to the shot
assist3_list = []

for i in range(0, len(shots_df)):
    try:
        # Define 'key pass' within shots_df and events_df
        key_pass = events_df['id'] == shots_df.iloc[i]['shot']['key_pass_id']
        
        # Define assist in events_df
        assist_id = events_df[key_pass].dropna(axis = 'columns')['pass']
        
        if 'cross' in assist_id.iloc[0]:
            assist3_list.append('Cross')
        
        elif 'cut_back' in assist_id.iloc[0]:
            assist3_list.append('Cut Back')
        
        elif 'through_ball' in assist_id.iloc[0]:
            assist3_list.append('Through Ball')
        
        else:
            assist3_list.append(np.nan)
        
    except KeyError:
        assist3_list.append(np.nan)

extracted_data['assist3'] = assist3_list

# State of play for pass which lead to the shot
assist_state_of_play_list = []
for i in range(0, len(shots_df)):
    try:
        # Define 'key pass' within shots_df and events_df
        key_pass = events_df['id'] == shots_df.iloc[i]['shot']['key_pass_id']
        
        # Define assist in events_df
        assist_play_id = events_df[key_pass]['play_pattern']
        
        assist_state_of_play_list.append(assist_play_id.iloc[0]['name'])

    except KeyError:
        assist_state_of_play_list.append(np.nan)

extracted_data['assist_state_of_play'] = assist_state_of_play_list

<a id = 'extracted_data'></a>
# Extracted Data

In [19]:
extracted_data.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player_shot,team,bodypart,technique,first_time,state_of_play,assist,assist2,assist3,assist_state_of_play
0,109.0,46.0,2021-06-05 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play
1,113.0,35.0,2021-06-05 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play,High Pass,,,From Free Kick
2,94.0,43.0,2021-06-05 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,,Regular Play
3,86.0,34.0,2021-06-05 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick
4,94.0,33.0,2021-06-05 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play,Ground Pass,,,From Goal Kick


In [20]:
extracted_data.to_csv(r'C:\Users\westi\Documents\github\expected_goals\data\saved_dataframes\data_extraction\extracted_data.csv')
%store extracted_data

Stored 'extracted_data' (DataFrame)
