# **Expected Goals Classifier**

## Overview

Create an Expected Goals (xG) classification model using existing historical match data to produce actionable recommendations which can be utilized in technical and tactical analysis to improve goal-scoring.

Project detailed on Github: [Expected Goals Classifier]()

# Feature Engineering Notebook

Continued from [expected_goals_data_exploration_notebook]()

*Notebook 3 of 7*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data cleaned in [expected_goals_data_cleaning_notebook]()
3. Data explored in [expected_goals_data_exploration_notebook]()
4. Features engineered in [expected_goals_feature_engineering_notebook]()
5. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
6. Modeling in [expected_goals_model_fitting_notebook]()
7. Conclusions in [expected_goals_model_assessment_notebook]()

# Packages

In [17]:
# rpy2 to run R
%load_ext rpy2.ipython

# Drive  and IO to access saved files
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pathlib for file retrieval
import pathlib
from pathlib import Path as path

# PyPy to improve speed
!apt-get install pypy

# warnings to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Pandas for dataframes
import pandas as pd

# Numpy for mathematical functions
import numpy as np

import math
from math import atan2

# Shapely for geometric functions
import shapely
from shapely import wkt
from shapely.geometry import Point, Polygon, LineString, GeometryCollection

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading package lists... Done
Building dependency tree       
Reading state information... Done
pypy is already the newest version (5.10.0+dfsg-3build2).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [3]:
# Import cleaned_data from expected_goals_data_extraction_notebook

cleaned_data = pd.read_parquet('/content/drive/MyDrive/flatiron/expected_goals/data_cleaning/dataframes/cleaned_data.parquet')

In [4]:
cleaned_data.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,under_pressure_x,shot_statsbomb_xg,shot_end_location,shot_technique,goal,shot_type,shot_body_part,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,shot_redirect,shot_deflected,shot_follows_dribble,play_pattern_y,pass_length,pass_angle,pass_height,pass_body_part,pass_type,pass_switch,pass_through_ball,pass_technique,pass_backheel,pass_cross,counterpress,pass_cut_back,pass_deflected,pass_inswinging,pass_straight,pass_outswinging,pass_no_touch,shot_location_y,shot_location_x
0,1,00:04:38.609,Regular Play,True,0.266154,45.0,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,Regular Play,11.18034,0.463648,Ground Pass,Left Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,109.0,46.0
1,1,00:11:45.046,From Free Kick,True,0.093521,32.9,Normal,False,Open Play,Head,True,True,False,False,False,False,False,From Free Kick,37.735924,-0.558599,High Pass,Right Foot,Free Kick,False,False,Standard,False,False,False,False,False,False,False,False,False,113.0,35.0
2,1,00:18:03.461,Regular Play,True,0.036171,42.8,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,Regular Play,11.18034,-2.034444,Ground Pass,Right Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,94.0,43.0
3,1,00:23:11.935,From Goal Kick,True,0.016625,33.3,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,From Goal Kick,13.892444,2.098871,Ground Pass,Right Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,86.0,34.0
4,1,00:23:45.810,From Goal Kick,False,0.030716,34.8,Normal,False,Open Play,Right Foot,False,False,False,False,False,False,False,From Goal Kick,14.56022,1.292497,Ground Pass,Left Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,94.0,33.0


# Distance

In [5]:
# Define goal center
# Note: Field coordinates for events measured for in-possession team

goal_center = (120, 40)

In [6]:
# Use location_x and location_y to define shot coordinates

shot_location_list = []
for i in range(0, len(cleaned_data)):
  shot_location_list.append((cleaned_data.iloc[i]['shot_location_x'],
                             cleaned_data.iloc[i]['shot_location_y']))

In [13]:
# Calculate distance from shot_location to shot_end_location

shot_distance_list = []
for sl in shot_location_list:
  shot_distance_list.append(Point(sl).distance(Point((cleaned_data.iloc[i]['shot_end_location'],
                                                      120))))

In [14]:
# Create new feature in cleaned_data for shot_distance

cleaned_data['shot_distance'] = shot_distance_list

In [15]:
cleaned_data['shot_distance'].describe()

count    6080.000000
mean       20.617300
std         9.187915
min         1.400000
25%        13.300376
50%        19.798990
75%        27.221356
max        71.478668
Name: shot_distance, dtype: float64

# Angle

In [19]:
# Calculate angle between the shot location and shot_end_location

shot_angle_list = []
for i in range(0, len(cleaned_data)):
  shot_angle_list.append(round(math.degrees(math.atan2(((cleaned_data.iloc[i]['shot_end_location']) - (cleaned_data.iloc[i]['shot_location_x'])),
                                                       (120 - (cleaned_data.iloc[i]['shot_location_y'])))), 2))

In [20]:
# Create new feature in cleaned_data for shot_angle

cleaned_data['shot_angle'] = shot_angle_list

In [22]:
pd.DataFrame(cleaned_data['shot_angle'].describe())

Unnamed: 0,shot_angle
count,6080.0
mean,-0.806775
std,31.290013
min,-90.0
25%,-20.6925
50%,0.0
75%,19.25
max,90.0


# Drop Location Features

In [25]:
# Location features no longer necessary after shot_distance and shot_angle engineered

cleaned_data.drop(['shot_location_x',
                   'shot_location_y',
                   'shot_end_location'],
                  axis = 1,
                  inplace = True)

# Data with Engineered Features

In [26]:
data_with_engineered_features = cleaned_data

In [27]:
data_with_engineered_features.head()

Unnamed: 0,period_x,timestamp_x,play_pattern_x,under_pressure_x,shot_statsbomb_xg,shot_technique,goal,shot_type,shot_body_part,shot_one_on_one,shot_aerial_won,shot_open_goal,shot_first_time,shot_redirect,shot_deflected,shot_follows_dribble,play_pattern_y,pass_length,pass_angle,pass_height,pass_body_part,pass_type,pass_switch,pass_through_ball,pass_technique,pass_backheel,pass_cross,counterpress,pass_cut_back,pass_deflected,pass_inswinging,pass_straight,pass_outswinging,pass_no_touch,shot_distance,shot_angle
0,1,00:04:38.609,Regular Play,True,0.266154,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,Regular Play,11.18034,0.463648,Ground Pass,Left Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,17.804494,-5.19
1,1,00:11:45.046,From Free Kick,True,0.093521,Normal,False,Open Play,Head,True,True,False,False,False,False,False,From Free Kick,37.735924,-0.558599,High Pass,Right Foot,Free Kick,False,False,Standard,False,False,False,False,False,False,False,False,False,7.615773,-16.7
2,1,00:18:03.461,Regular Play,True,0.036171,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,Regular Play,11.18034,-2.034444,Ground Pass,Right Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,28.231188,-0.44
3,1,00:23:11.935,From Goal Kick,True,0.016625,Normal,False,Open Play,Left Foot,False,False,False,False,False,False,False,From Goal Kick,13.892444,2.098871,Ground Pass,Right Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,34.058773,-1.18
4,1,00:23:45.810,From Goal Kick,False,0.030716,Normal,False,Open Play,Right Foot,False,False,False,False,False,False,False,From Goal Kick,14.56022,1.292497,Ground Pass,Left Foot,Open Play,False,False,Standard,False,False,False,False,False,False,False,False,False,26.019224,3.96


In [28]:
print('Total Events:',
      len(data_with_engineered_features))

Total Events: 6080


In [29]:
print('Total Features:',
      data_with_engineered_features.shape[1])

Total Features: 36


In [30]:
# Save cleaned_data

cleaned_data.to_parquet('/content/drive/MyDrive/flatiron/expected_goals/feature_engineering/dataframes/data_with_engineered_features.parquet')

In [31]:
print('extracted_data Filesize:',
      path('/content/drive/MyDrive/flatiron/expected_goals/feature_engineering/dataframes/data_with_engineered_features.parquet').stat().st_size,
      'bytes')

extracted_data Filesize: 308442 bytes


Continued in [expected_goals_data_exploration_notebook]()

*4 of 7*