**Capstone Project Submission**

* Student Name: Wes Swager
* Student Pace: Full Time
* Instructor Name: Claude Fried
* Scheduled Project Review Date/Time
    * Friday, June 11, 2021, 2:30pm CST
    * Monday, June 13, 2021, 2:30pm CST

# **Expected Goals Classifier**

# Overview

Create an expected goals classification model using existing historical match data for use with future match analysis and actionable recommendations which can be utilized in training to help improve goal-scoring.

Project detailed on Github: [milwaukee_fc](https://github.com/wswager/milwaukee_fc)

# Data Preprocessing Notebook



*Notebook 6 of 8*

### Index

1. Data extracted in [expected_goals_data_extraction_notebook]()
2. Data organized in [expected_goals_data_organization_notebook]()
3. Features engineered in [expected_goals_feature_engineering_notebook]()
4. Data cleaned in [expected_goals_data_cleaning_notebook]()
5. Data explored in [expected_goals_data_exploration_notebook]()
6. Data preprocessed in [expected_goals_data_preprocessing_notebook]()
7. Model fitting and refinement in [expected_goals_model_fitting_notebook]()
8. Model assessment in [expected_goals_model_assessment_notebook]()

### Data

Data sourced from [StatsBomb](https://statsbomb.com/), a United Kingdom based football (soccer) data analytics company.

StatsBomb have provided free access to their proprietary dataset via GitHub: [StatsBomb Open Data](https://github.com/statsbomb/open-data)

In [2]:
# Import cleaned_data from expected_goals_data_cleaning_notebook

cleaned_data = pd.read_csv('/content/drive/MyDrive/flatiron/expected_goals/data_cleaning/cleaned_data.csv')

In [3]:
cleaned_data = cleaned_data.iloc[: , 1:]

In [4]:
cleaned_data.head()

Unnamed: 0,statsbomb_xg,goal,time,player,team,shot_distance,inside_18,shot_angle,bodypart,bodypart_angle,technique,first_touch,assist,state_of_play
0,0.266154,False,4,Francesca Kirby,Chelsea FCW,12.529964,True,118.61,Left Foot,Right - Inside Foot,Ground,False,Ground Pass,Open Play
1,0.093521,False,11,Bethany England,Chelsea FCW,8.602325,True,54.46,Head,Left - Head,Ground,False,High Pass,Set Piece - Free Kick
2,0.036171,False,18,Drew Spence,Chelsea FCW,26.172505,False,96.58,Left Foot,Right - Inside Foot,Ground,False,Ground Pass,Open Play
3,0.016625,False,23,Chloe Arthur,Birmingham City WFC,34.525353,False,79.99,Left Foot,Left - Outside Foot,Ground,False,Ground Pass,Set Piece - Goal Kick
4,0.030716,False,23,Bethany England,Chelsea FCW,26.925824,False,74.93,Right Foot,Left - Inside Foot,Ground,False,Ground Pass,Set Piece - Goal Kick


<a id = 'packages'></a>
# Packages

In [1]:
# Drive  and IO to access saved data
from google.colab import drive, files
drive.mount('/content/drive')

import io

# Pandas for Dataframes
import pandas as pd

# Numpy and for mathematical functions
import numpy as np

# Import Scikit-learn for modeling
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive


# Drop Features Not Relevant to Modeling

In [5]:
# Drop unique variables, 'player' and 'team'

# Drop statsbomb_xg as modeling will generate new xG

cleaned_data.drop(['statsbomb_xg',
                   'player',
                   'team'],
                  axis = 1,
                  inplace = True)

# Separate Target Variable

In [6]:
y = cleaned_data['goal']
X = cleaned_data.drop('goal',
                      axis = 1)

In [23]:
y = pd.DataFrame(y)

# Train/Test Split

In [7]:
# Split data in train and test

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    stratify = y,
                                                    random_state = 13)

# Preprocessing

## Label Encode Target Variable

In [8]:
le = LabelEncoder()

In [9]:
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [10]:
# Convert encoded target variables to dataframe

y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

## Variable Encoding

### Label Encode Boolean Variables

In [11]:
# Define boolean variables

boolean_variables = ['inside_18',
                     'first_touch']

In [12]:
# Label encode boolean variables

le_train = X_train[boolean_variables]
le_test = X_test[boolean_variables]
le_X = X[boolean_variables]

for i in boolean_variables:
  le_train[i] = le.fit_transform(le_train[i])
  le_test[i] = le.transform(le_test[i])
  le_X[i] = le.transform(le_X[i])

le_train = pd.DataFrame(le_train)
le_test = pd.DataFrame(le_test)
le_X = pd.DataFrame(le_X)

### One Hot Encode Categorical Variables

In [13]:
# Define categorical variables

categorical_variables = ['bodypart',
                         'technique',
                         'assist',
                         'state_of_play']

In [14]:
ohe = OneHotEncoder(categories = 'auto',
                    handle_unknown = 'ignore')

In [15]:
# One Hot Encode categorical variables

for i in categorical_variables:
  ohe_train_features = ohe.fit_transform(X_train[[i]]).toarray()
  ohe_train_labels = ohe.categories_
  ohe_train_labels = np.array(ohe_train_labels).ravel()
  ohe_train = pd.DataFrame(ohe_train_features,
                           columns = ohe_train_labels)

for i in categorical_variables:
  ohe_test_features = ohe.transform(X_test[[i]]).toarray()
  ohe_test_labels = ohe.categories_
  ohe_test_labels = np.array(ohe_test_labels).ravel()
  ohe_test = pd.DataFrame(ohe_test_features,
                           columns = ohe_test_labels)

for i in categorical_variables:
  ohe_X_features = ohe.transform(X[[i]]).toarray()
  ohe_X_labels = ohe.categories_
  ohe_X_labels = np.array(ohe_X_labels).ravel()
  ohe_X = pd.DataFrame(ohe_X_features,
                           columns = ohe_X_labels)

## Scale Numerical Variables

In [16]:
# Define numerical variables

numerical_variables = ['time',
                       'shot_distance',
                       'shot_angle']

In [17]:
ss = StandardScaler()

In [18]:
ct = ColumnTransformer([('ss', ss, numerical_variables)])

In [19]:
# Scale numerical variables

ss_train = ct.fit_transform(X_train)
ss_test = ct.transform(X_test)
ss_X = ct.transform(X)

ss_train = pd.DataFrame(ss_train,
                        columns = numerical_variables)
ss_test = pd.DataFrame(ss_test,
                        columns = numerical_variables)
ss_X = pd.DataFrame(ss_X,
                    columns = numerical_variables)

# Create New Dataframe from Preprocessed Data

In [20]:
le_train.reset_index(drop = True,
                     inplace = True)
ohe_train.reset_index(drop = True,
                     inplace = True)
ss_train.reset_index(drop = True,
                     inplace = True)

le_test.reset_index(drop = True,
                     inplace = True)
ohe_test.reset_index(drop = True,
                     inplace = True)
ss_test.reset_index(drop = True,
                     inplace = True)

le_X.reset_index(drop = True,
                     inplace = True)
ohe_X.reset_index(drop = True,
                     inplace = True)
ss_X.reset_index(drop = True,
                     inplace = True)

In [22]:
X_train = pd.concat([le_train, ohe_train, ss_train],
                    axis = 1)

X_test = pd.concat([le_test, ohe_test, ss_test],
                   axis = 1)

X = pd.concat([le_X, ohe_X, ss_X],
                   axis = 1)

In [24]:
X_train.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/X_train.csv')
X_test.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/X_test.csv')
X.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/X.csv')

y_train.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/y_train.csv')
y_test.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/y_test.csv')
y.to_csv('/content/drive/MyDrive/flatiron/expected_goals/data_preprocessing/y.csv')

Continued in [expected_goals_data_modeling_notebook](https://github.com/wswager/expected_goals/blob/main/data_modeling/expected_goals_data_modeling_notebook.ipynb)