# Problem
Our problem is a classification problem, our target is whether a shot is goal or not.

# Where Can We Use This Model In Real World ?

We can use this model to analyze player performances, understand player habits on the field, determining which players to change in game or find out in advance whether a new player will be compatible with the team or not during transfer seasons.

# Imports

In [186]:
import pandas as pd
import numpy as np
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
import rpy2.robjects.packages as rpackages
from rpy2.robjects.conversion import localconverter
from imblearn.over_sampling import SMOTE
import warnings
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from numpy import ravel
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import h2o
from h2o.automl import H2OAutoML
import pickle

In [187]:
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)

<rpy2.rinterface_lib.sexp.NULLType object at 0x1304db350> [0]

# Getting Dataset Over R from Python

In [188]:
pandas2ri.activate()
ro.r('''
        library("worldfootballR")
        laliga <- load_understat_league_shots(league = "La liga")
     ''')
laliga = pandas2ri.rpy2py(ro.r['laliga'])
laliga.drop('league', axis=1, inplace=True)
data=laliga[(laliga['date'] > '2020-06-10') & (laliga['date'] < '2024-06-10')]

→ Data last updated 2024-05-30 18:34:46.012307882309 UTC


We are going to use La Liga dataset between 2020-06-10 and 2024-06-10.

# Data Manipulations and Fixes

We have a problem about NaN values and duplicate features, we are going to fix these problems by manipulating the data.

In [189]:
warnings.filterwarnings('ignore')

def fixDataNaN(df):
    with localconverter(ro.default_converter + pandas2ri.converter):
        df = ro.conversion.py2rpy(df)
pairs = [['x','X'],['y','Y'],['x_g','xG'],['h_a','home_away'],['shot_type','shotType'],['last_action','lastAction']]

def camel_case_columns(df):
    def camel_case(column_name):
        parts = column_name.split('_')
        return str(parts[0] + ''.join(x.title() for x in parts[1:]))
    
    new_columns = []
    for column in df.columns:
        if '_' in column:
            new_columns.append(camel_case(column))
        else:
            new_columns.append(str(column))
    
    df.columns = new_columns
    return df

def fixMergeColumns(dataList, pairs):
    for targetData in dataList:
        for pair in pairs:
            if pair[0] in targetData.columns and pair[1] in targetData.columns:
                targetData['{}'.format(pair[1])].fillna(targetData['{}'.format(pair[0])], inplace=True)
                targetData.drop(columns=['{}'.format(pair[0])], inplace=True)
        targetData = camel_case_columns(targetData)
        fixDataNaN(targetData)

fixMergeColumns([data], pairs)

We are going to train our model to predict whether the position ends up to a goal or not so we need to convert our goal and not goal situations to binary tags.

In [190]:
replacement_dict = {
    'Goal': 1,
    'BlockedShot': 0,
    'MissedShots': 0,
    'SavedShot': 0,
    'ShotOnPost': 0,
    'OwnGoal': 0
}

data['result'] = pd.DataFrame(data['result'].map(replacement_dict))

We changed string values to numerical values for the training to work.

In [191]:
X = data.drop('result', axis=1)
Y = data['result']
X = pd.get_dummies(X)
data = pd.concat([pd.DataFrame(Y),pd.DataFrame(X)], axis=1)

We split the data into training and validation data.

In [192]:
data_shuffled = data.sample(frac=1, random_state=42)
quarter_length = len(data_shuffled) // 4
df_half_val = data_shuffled.iloc[:quarter_length]
df_half_train = data_shuffled.iloc[quarter_length:]

data = df_half_train
validation_data = df_half_val

We split the validation data into X and Y.

In [193]:
Y_validation = validation_data['result']
x_validation = validation_data.drop('result', axis=1)

X = pd.DataFrame(data.drop('result', axis=1))
Y = pd.DataFrame(data['result'])

# Exploring Data

In [194]:
data.head(2)

Unnamed: 0,result,id,minute,X,Y,xG,playerId,season,matchId,homeGoals,...,lastAction_Save,lastAction_ShieldBallOpp,lastAction_Standard,lastAction_Start,lastAction_SubstitutionOn,lastAction_Tackle,lastAction_TakeOn,lastAction_Throughball,homeAway_a,homeAway_h
78415,0,519766.0,53.0,0.719,0.671,0.014868,6917.0,2022.0,19245.0,0.0,...,False,False,False,False,False,False,False,False,False,True
62159,1,422675.0,87.0,0.9,0.666,0.097758,4146.0,2020.0,15144.0,1.0,...,False,False,False,False,False,False,False,False,True,False


### Target Variable

Our target variable is "result", this feature represents whether a shot is a goal or not.

### Feature Variables

Our feature variables are going to help our model to learn and predict the target variable.

In [195]:
print(data.drop('result',axis=1).columns.tolist())

['id', 'minute', 'X', 'Y', 'xG', 'playerId', 'season', 'matchId', 'homeGoals', 'awayGoals', 'player_Aarón Martín', 'player_Abdelkabir Abqar', 'player_Abderrahmane Rebbach', 'player_Abdessamad Ezzalzouli', 'player_Abdoulaye Diaby', 'player_Abdul Mumin', 'player_Abdón Prats', 'player_Abner', 'player_Adama Traoré', 'player_Adnan Januzaj', 'player_Adrià Alti', 'player_Adrià Pedrosa', 'player_Adrián', 'player_Adrián Embarba', 'player_Adrián Marín', 'player_Aiham Ousou', 'player_Aihen Muñoz', 'player_Aimar Oroz', 'player_Aingeru Olabarrieta', 'player_Aissa Mandi', 'player_Aitor Fernández', 'player_Aitor Paredes', 'player_Aitor Ruibal', 'player_Alberto Mari', 'player_Alberto Moleiro', 'player_Alberto Moreno', 'player_Alberto Perea', 'player_Alberto Risco', 'player_Alberto Rodríguez', 'player_Alberto Soro', 'player_Aleix García', 'player_Aleix Vidal', 'player_Alejandro Asensio', 'player_Alejandro Cantero', 'player_Alejandro Catena', 'player_Alejandro Gomez', 'player_Alejandro Pozo', 'player_Al

# Imbalancedness

Our data has imbalancedness problem which might cause our model to learn features of majority target variables better than the minority target variables, it might cause inaccurate predictions.

In [196]:
print(data['result'].value_counts())

result
0    25655
1     2980
Name: count, dtype: int64


### SMOTE

We are tried to use SMOTE method for oversampling to fix the gap between minority and majority target variables. This method uses clustering methods to create new observations based on original ones.

But SMOTE did not work efficient on our data, it caused overfitting problem because synthetic generated minority observations did not represent our minority class well.

We are going to disable block below since we do not use it as our primary method and for the sake of process time.

In [197]:
# sm = SMOTE(random_state=42,n_jobs=-1)
# x_res, y_res = sm.fit_resample(X, Y)

# data_res = pd.concat([pd.DataFrame(y_res), pd.DataFrame(x_res)], axis=1)
# print(y_res.value_counts())

# scaler = MinMaxScaler()

# x_res = scaler.fit_transform(x_res)

# X_validation = scaler.fit_transform(x_validation)

# Y_validation = validation_data['result']

### Undersampling

We used undersampling method to solve imbalancedness problem by decreasing majority count with sampling.

In [198]:
rus = RandomUnderSampler(random_state=42)
X_rus, Y_rus = rus.fit_resample(X, Y)
x_rus = pd.DataFrame(X_rus)
y_rus = pd.DataFrame(Y_rus)

print(y_rus.value_counts())

scaler = MinMaxScaler()

x_rus = scaler.fit_transform(x_rus)

X_validation = scaler.fit_transform(x_validation)

Y_validation = validation_data['result']

result
0         2980
1         2980
Name: count, dtype: int64


# Model Comparison Over Sampling Method

We used both SMOTE and Undersampling methods, we figured out that the SMOTE method is not working effectively on our data and causing overfitting problem, so we choose to use Undersampling method.

In [199]:
x_train_rus, x_test_rus, y_train_rus, y_test_rus = train_test_split(x_rus, y_rus, test_size=0.25, random_state=42)
# x_train_res, x_test_res, y_train_res, y_test_res = train_test_split(x_res, y_res, test_size=0.25, random_state=42)

### Model with SMOTE

In [200]:
# from sklearn.ensemble import GradientBoostingClassifier


# model_res = GradientBoostingClassifier(random_state=42)
# model_res.fit(x_train_res, y_train_res)

# y_pred_res = model_res.predict(x_test_res)

# y_train_res = y_train_res.astype(y_pred_res.dtype)

5-Fold Cross Validation score:

In [201]:
# print(cross_val_score(model_res, x_train_res, y_train_res, cv=5, scoring='balanced_accuracy').mean())

### Model with Undersampling

In [202]:
from sklearn.ensemble import GradientBoostingClassifier


model_rus = GradientBoostingClassifier(random_state=42)
model_rus.fit(x_train_rus, y_train_rus)

y_pred_rus = model_rus.predict(x_test_rus)

y_train_rus = y_train_rus.astype(y_pred_rus.dtype)

5-Fold Cross Validation score:

In [203]:
print(cross_val_score(model_rus, x_train_rus, y_train_rus, cv=5, scoring='balanced_accuracy').mean())

0.7857051326801918


# Checking Overfitting

### Model with SMOTE

We see that our model is overfitting because of the synthetic generated observations of minority class are not representing our minority class well.

In [204]:
# y_validation_pred_res = model_res.predict(X_validation)

# print(confusion_matrix(Y_validation, y_validation_pred_res))

# print(balanced_accuracy_score(Y_validation, y_validation_pred_res))

### Model with Undersampling

Our Undersampling method works well, our balanced accuracy score on validation data is good, there is no sign of overfitting because our test and validation predictions' balanced accuracy scores are close to each other also we fixed imbalancedness.

In [205]:
y_validation_pred_rus = model_rus.predict(X_validation)

print(confusion_matrix(Y_validation, y_validation_pred_rus))

print(balanced_accuracy_score(Y_validation, y_validation_pred_rus))

[[7007 1519]
 [ 247  771]]
0.7896032337465845


# Best Model and Hyperparameter Tuning with AutoML

We are going to use the H2O AutoML to find best hyperparameters and model to use.

In [206]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,1 hour 35 mins
H2O_cluster_timezone:,Europe/Istanbul
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.2
H2O_cluster_version_age:,27 days
H2O_cluster_name:,H2O_from_python_sezaiufukoral_e8ahb6
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,889 Mb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [208]:
new_data = pd.DataFrame(x_train_rus)
new_data['result'] = y_train_rus.values

In [209]:
hf = h2o.H2OFrame(new_data)

train_hf, test_hf = hf.split_frame(ratios=[0.75], seed = 1)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [211]:
aml = H2OAutoML(max_models = 12,
                balance_classes=True,
		seed=1,max_runtime_secs=220,verbosity='none')

aml.train(training_frame = train_hf, y='result')

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,blending
Number of base models (used / total),4/5
# GBM base models (used / total),1/1
# XGBoost base models (used / total),1/1
# DRF base models (used / total),2/2
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,AUTO
Metalearner nfolds,0
Metalearner fold_column,


# Choosing the Final Model

We are going to test our models with a new, unseen data to check balanced accuracy scores and confusion matrices to choose our final model.

In [212]:
y_validation_pred = model_rus.predict(X_validation)

print(confusion_matrix(Y_validation, y_validation_pred))

print(balanced_accuracy_score(Y_validation, y_validation_pred))

[[7007 1519]
 [ 247  771]]
0.7896032337465845


In [213]:
warnings.filterwarnings('ignore')

validation_x_hf = h2o.H2OFrame(pd.DataFrame(X_validation))
validation_y_hf = h2o.H2OFrame(pd.DataFrame(Y_validation))

validation_y_hf.columns = ['result']

preds_validation = aml.leader.predict(validation_x_hf)

y_validation_test_se = validation_y_hf['result'].as_data_frame().values.flatten()
y_validation_pred_se = preds_validation['predict'].round().as_data_frame().values.flatten()

balanced_acc = balanced_accuracy_score(y_validation_test_se, y_validation_pred_se)
print("Balanced Accuracy Score: ", balanced_acc)

print(confusion_matrix(y_validation_test_se, y_validation_pred_se))

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Balanced Accuracy Score:  0.7877924084748051
[[7085 1441]
 [ 260  758]]


Both the Stacked Ensemble and vanilla Gradient Boosting Machines algorithms' metrics show similar performance to each other. However, given that knowing if a shot results in a goal is the dominant information we want to obtain according to domain knowledge, also for the sake of process time, we choose our model to be vanilla Gradient Boosting Machines.

In [215]:
filename = 'model.pkl'
pickle.dump(model_rus, open(filename, 'wb'))