**Capstone Project Submission**

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Friday, June 11, 2021, 2:30pm CST
    * Tuesday, June 15, 2021, 2:30pm CST
    * Thursday, June 17, 2021, 4:30pm CST

# **Expected Goals Classifier**

## Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [milwaukee_rampage_fc](https://github.com/wswager/milwaukee_rampage_fc)

# Data Organization Notebook

*Notebook 2 of 8*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_extraction/expected_goals_data_extraction_notebook.ipynb)
2. Data organized in [expected_goals_data_organization_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_organization/expected_goals_data_organization_notebook.ipynb)
3. Features engineered in [expected_goals_feature_engineering_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/feature_engineering/expected_goals_feature_engineering_notebook.ipynb)
4. Data cleaned in [expected_goals_data_cleaning_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_cleaning/expected_goals_data_cleaning_notebook.ipynb)
5. Data explored in [expected_goals_data_exploration_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_exploration/expected_goals_data_exploration_notebook.ipynb)
6. Data preprocessed in [expected_goals_data_preprocessing_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_preprocessing/expected_goals_data_preprocessing_notebook.ipynb)
7. Modeling in [expected_goals_modeling_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/data_modeling/expected_goals_modeling_notebook.ipynb)
8. Conclusions in [expected_goals_conclusions_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/conclusions/expected_conclusions_notebook.ipynb)

### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [None]:
# Import events_df from expected_goals_data_extraction_notebook

events_df = pd.read_csv('/content/drive/MyDrive/flatiron/expected_goals/data_extraction/events_df.csv')

In [None]:
events_df.head()

Unnamed: 0.1,Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,duration,tactics,related_events,player,position,location,pass,carry,under_pressure,ball_receipt,counterpress,duel,interception,dribble,shot,goalkeeper,off_camera,ball_recovery,50_50,foul_committed,substitution,foul_won,clearance,injury_stoppage,miscontrol,block,out,bad_behaviour,player_off,half_start,half_end
0,0,c9425423-18d0-4c75-bdf1-cac6ecfef8cd,1,1,2021-06-12 00:00:00.000,0,0,"{'id': 35, 'name': 'Starting XI'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 969, 'name': 'Birmingham City WFC'}",0.0,"{'formation': 4231, 'lineup': [{'player': {'id...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,0a9c1eba-633a-42de-8386-64af450d5d44,2,1,2021-06-12 00:00:00.000,0,0,"{'id': 35, 'name': 'Starting XI'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",0.0,"{'formation': 42211, 'lineup': [{'player': {'i...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,d3bda43e-4172-42e0-8a29-0629fab2a5ac,3,1,2021-06-12 00:00:00.000,0,0,"{'id': 18, 'name': 'Half Start'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 969, 'name': 'Birmingham City WFC'}",0.0,,['1f0b713f-be11-4c49-8a21-d42f1f66ef87'],,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,3,1f0b713f-be11-4c49-8a21-d42f1f66ef87,4,1,2021-06-12 00:00:00.000,0,0,"{'id': 18, 'name': 'Half Start'}",1,"{'id': 969, 'name': 'Birmingham City WFC'}","{'id': 1, 'name': 'Regular Play'}","{'id': 971, 'name': 'Chelsea FCW'}",0.0,,['d3bda43e-4172-42e0-8a29-0629fab2a5ac'],,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,4,667dda2e-b35d-4d46-ad09-40b3f491f160,5,1,2021-06-12 00:00:01.324,0,1,"{'id': 30, 'name': 'Pass'}",2,"{'id': 971, 'name': 'Chelsea FCW'}","{'id': 9, 'name': 'From Kick Off'}","{'id': 971, 'name': 'Chelsea FCW'}",1.228695,,['8dc92bd7-d6a0-4d60-b24e-b0352d135b62'],"{'id': 4641, 'name': 'Francesca Kirby'}","{'id': 23, 'name': 'Center Forward'}","[61.0, 41.0]","{'recipient': {'id': 15549, 'name': 'Sophie In...",,,,,,,,,,,,,,,,,,,,,,,,


# Packages

In [None]:
# Drive  and IO to access saved data
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path

# Pandas for Dataframes
import pandas as pd

# Numpy for mathematical functions
import numpy as np

import math
from math import atan2

# Shapely for geometric functions
import shapely
from shapely import wkt
from shapely.geometry import Point, Polygon, LineString, GeometryCollection

import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive


# Extract Shot-Specific Data

In [None]:
# Drop events not associated with shots from events_df

shots_df = events_df[events_df['shot'].notna()]

In [None]:
# Extracting shot specific data from events_df nested dictionaries

shots_df = shots_df[['index',
                     'timestamp',
                     'shot',
                     'location',
                     'player',
                     'possession_team']]

In [None]:
shots_df.head()

Unnamed: 0,index,timestamp,shot,location,player,possession_team
257,258,2021-06-13 00:04:38.609,"{'statsbomb_xg': 0.26615402, 'end_location': [...","[109.0, 46.0]","{'id': 4641, 'name': 'Francesca Kirby'}","{'id': 971, 'name': 'Chelsea FCW'}"
541,542,2021-06-13 00:11:45.046,"{'one_on_one': True, 'statsbomb_xg': 0.0935205...","[113.0, 35.0]","{'id': 15550, 'name': 'Bethany England'}","{'id': 971, 'name': 'Chelsea FCW'}"
613,614,2021-06-13 00:18:03.461,"{'statsbomb_xg': 0.036171142, 'end_location': ...","[94.0, 43.0]","{'id': 4638, 'name': 'Drew Spence'}","{'id': 971, 'name': 'Chelsea FCW'}"
876,877,2021-06-13 00:23:11.935,"{'statsbomb_xg': 0.016625367000000002, 'end_lo...","[86.0, 34.0]","{'id': 10193, 'name': 'Chloe Arthur'}","{'id': 969, 'name': 'Birmingham City WFC'}"
891,892,2021-06-13 00:23:45.810,"{'statsbomb_xg': 0.030716168000000002, 'end_lo...","[94.0, 33.0]","{'id': 15550, 'name': 'Bethany England'}","{'id': 971, 'name': 'Chelsea FCW'}"


# Extract Features from Nested Dictionaries

## Shot-Specific Features

In [None]:
# Defining and extracting shot specific features from shots_df nested dictionaries

# Shot location

location_list = []
location_list.extend(list(shots_df['location'].values))

# Create dataframe of shot features

extracted_data = pd.DataFrame(location_list)
extracted_data.columns = ['location_x',
                          'location_y']

# Shot timestamp

time_list = []
time_list.extend(list(shots_df['timestamp'].values))
extracted_data['time'] = time_list

# StatBombs' xG metric

statsbomb_xg_list = []
for i in range(0, len(shots_df)):
    statsbomb_xg_list.append(shots_df.iloc[i]['shot']['statsbomb_xg'])
extracted_data['statsbomb_xg'] = statsbomb_xg_list

# Outcome of shot

outcome_list = []
for i in range(0, len(shots_df)):
    outcome_list.append(shots_df.iloc[i]['shot']['outcome']['name'])
extracted_data['outcome'] = outcome_list
        
# Player who shot

player_list = []
for i in range(0, len(shots_df)):
    player_list.append(shots_df.iloc[i]['player']['name'])
extracted_data['player'] = player_list
        
# Player who shot's team

team_list = []
for i in range(0, len(shots_df)):
    team_list.append(shots_df.iloc[i]['possession_team']['name'])
extracted_data['team'] = team_list
        
# Bodypart used to shoot

bodypart_list = []
for i in range(0, len(shots_df)):
    bodypart_list.append(shots_df.iloc[i]['shot']['body_part']['name'])
extracted_data['bodypart'] = bodypart_list
        
# Technique used for shot

technique_list = []
for i in range(0, len(shots_df)):
    technique_list.append(shots_df.iloc[i]['shot']['technique']['name'])
extracted_data['technique'] = technique_list
        
# If the shot was taken with the player's 1st-touch
# Shot directly from receiving the pass
# without any preceding touches

first_touch_list = []
for i in range(0, len(shots_df)):
    try:
        first_touch_list.append(shots_df.iloc[i]['shot']['first_time'])
    except:
        first_touch_list.append(False)
extracted_data['first_touch'] = first_touch_list
        
# State of play
# If the play was open play or a set piece

state_of_play_list = []
for i in range(0, len(shots_df)):
    state_of_play_list.append(shots_df.iloc[i]['shot']['type']['name'])
extracted_data['state_of_play'] = state_of_play_list

In [None]:
extracted_data.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player,team,bodypart,technique,first_touch,state_of_play
0,109.0,46.0,2021-06-13 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play
1,113.0,35.0,2021-06-13 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play
2,94.0,43.0,2021-06-13 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play
3,86.0,34.0,2021-06-13 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play
4,94.0,33.0,2021-06-13 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play


## Assist-Specific Features

In [None]:
# Defining and extracting shot specific features from shots_df nested dictionaries
# The pass to the player who shot

# Add assist features to dataframe

# Due to subjectivity in data recording, assist data recorded in
# variety of locations and methods

# 1st source for type of pass

assist_list = []

for i in range(0, len(shots_df)):
    try:
        # Define 'key pass' within shots_df and events_df
        key_pass = events_df['id'] == shots_df.iloc[i]['shot']['key_pass_id']
        
        # Define assist in events_df
        assist_id = events_df[key_pass].dropna(axis = 'columns')['pass']
        
        assist_list.append(assist_id.iloc[0]['height']['name'])
        
    except KeyError:
        assist_list.append(np.nan)
        
extracted_data['assist'] = assist_list

# 2nd alternative source for type of pass

assist2_list = []

for i in range(0, len(shots_df)):
    try:
        assist2_list.append(assist_id.iloc[0]['technique']['name'])
        
    except KeyError:
        assist2_list.append(np.nan)

extracted_data['assist2'] = assist2_list

# Third alternative source for type of pass

assist3_list = []

for i in range(0, len(shots_df)):
    try:
        if 'cross' in assist_id.iloc[0]:
            assist3_list.append('Cross')
        
        elif 'cut_back' in assist_id.iloc[0]:
            assist3_list.append('Cut Back')
        
        elif 'through_ball' in assist_id.iloc[0]:
            assist3_list.append('Through Ball')
        
        else:
            assist3_list.append(np.nan)
        
    except KeyError:
        assist3_list.append(np.nan)

extracted_data['assist3'] = assist3_list

# State of play for pass

assist_state_of_play_list = []
for i in range(0, len(shots_df)):
  # Define assist in events_df
  assist_play_id = events_df[key_pass]['play_pattern']

  try:
      assist_state_of_play_list.append(assist_play_id.iloc[0]['name'])

  except KeyError:
      assist_state_of_play_list.append(np.nan)

extracted_data['assist_state_of_play'] = assist_state_of_play_list

# Organized Data

In [None]:
organized_data = extracted_data
extracted_data.head()

Unnamed: 0,location_x,location_y,time,statsbomb_xg,outcome,player,team,bodypart,technique,first_touch,state_of_play,assist,assist2,assist3,assist_state_of_play
0,109.0,46.0,2021-06-13 00:04:38.609,0.266154,Blocked,Francesca Kirby,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,Cross,From Free Kick
1,113.0,35.0,2021-06-13 00:11:45.046,0.093521,Off T,Bethany England,Chelsea FCW,Head,Normal,False,Open Play,High Pass,,Cross,From Free Kick
2,94.0,43.0,2021-06-13 00:18:03.461,0.036171,Saved,Drew Spence,Chelsea FCW,Left Foot,Normal,False,Open Play,Ground Pass,,Cross,From Free Kick
3,86.0,34.0,2021-06-13 00:23:11.935,0.016625,Off T,Chloe Arthur,Birmingham City WFC,Left Foot,Normal,False,Open Play,Ground Pass,,Cross,From Free Kick
4,94.0,33.0,2021-06-13 00:23:45.810,0.030716,Off T,Bethany England,Chelsea FCW,Right Foot,Normal,False,Open Play,Ground Pass,,Cross,From Free Kick


In [None]:
organized_data.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_organization/organized_data.csv')

Continued in [expected_goals_feature_engineering_notebook](https://github.com/wswager/milwaukee_rampage_fc/blob/main/feature_engineering/expected_goals_feature_engineering_notebook.ipynb)

*3 of 8*