# Problem
Our problem is a classification problem, our target is whether a shot is goal or not.

# Where Can We Use This Model In Real World ?

We can use this model to analyze player performances, understand player habits on the field, determining which players to change in game or find out in advance whether a new player will be compatible with the team or not during transfer seasons.

# Imports

In [166]:
import pandas as pd
import numpy as np
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
import rpy2.robjects.packages as rpackages
from rpy2.robjects.conversion import localconverter
from imblearn.over_sampling import SMOTE
from collections import Counter
import warnings
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
import h2o
from h2o.automl import H2OAutoML
import pickle

In [167]:
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)

<rpy2.rinterface_lib.sexp.NULLType object at 0x13faad3d0> [0]

# Getting Dataset Over R from Python

In [168]:
pandas2ri.activate()
ro.r('''
        library("worldfootballR")
        laliga <- load_understat_league_shots(league = "La liga")
     ''')
laliga = pandas2ri.rpy2py(ro.r['laliga'])
laliga.drop('league', axis=1, inplace=True)

→ Data last updated 2024-05-30 18:34:46.012307882309 UTC


We are going to use La Liga dataset between 2023-01-01 and 2024-06-10.

We are going to validate our model's performance with the same league data between different dates.

In [169]:
data = laliga[(laliga['date'] > '2023-01-01') & (laliga['date'] < '2024-06-10')]
validation_data = laliga[(laliga['date'] >= '2020-01-01') & (laliga['date'] <= '2020-01-10')]

# Exploring Data

In [170]:
data.head(2)

Unnamed: 0,id,minute,result,X,Y,xG,player,h_a,player_id,situation,...,shotType,match_id,home_team,away_team,home_goals,away_goals,date,player_assisted,lastAction,home_away
75128,502616.0,12.0,SavedShot,0.893,0.693,0.300541,Roger,,2566.0,OpenPlay,...,LeftFoot,19118.0,Elche,Celta Vigo,0.0,1.0,2023-01-06 17:30:00,Lucas Boyé,Pass,h
75129,502618.0,18.0,BlockedShot,0.798,0.503,0.045027,Pere Milla,,4175.0,OpenPlay,...,LeftFoot,19118.0,Elche,Celta Vigo,0.0,1.0,2023-01-06 17:30:00,Roger,Pass,h


### Target Variable

Our target variable is "result", this feature represents whether a shot is a goal or not.

### Feature Variables

Our feature variables are going to help our model to learn and predict the target variable.

In [171]:
print(data.drop('result',axis=1).columns.tolist())

['id', 'minute', 'X', 'Y', 'xG', 'player', 'h_a', 'player_id', 'situation', 'season', 'shotType', 'match_id', 'home_team', 'away_team', 'home_goals', 'away_goals', 'date', 'player_assisted', 'lastAction', 'home_away']


# Data Manipulations and Fixes

We have a problem about NaN values and duplicate features, we are going to fix these problems by manipulating the data.

In [172]:
warnings.filterwarnings('ignore')

def fixDataNaN(df):
    with localconverter(ro.default_converter + pandas2ri.converter):
        df = ro.conversion.py2rpy(df)
pairs = [['x','X'],['y','Y'],['x_g','xG'],['h_a','home_away'],['shot_type','shotType'],['last_action','lastAction']]

def camel_case_columns(df):
    def camel_case(column_name):
        parts = column_name.split('_')
        return str(parts[0] + ''.join(x.title() for x in parts[1:]))
    
    new_columns = []
    for column in df.columns:
        if '_' in column:
            new_columns.append(camel_case(column))
        else:
            new_columns.append(str(column))
    
    df.columns = new_columns
    return df

def fixMergeColumns(dataList, pairs):
    for targetData in dataList:
        for pair in pairs:
            if pair[0] in targetData.columns and pair[1] in targetData.columns:
                targetData['{}'.format(pair[1])].fillna(targetData['{}'.format(pair[0])], inplace=True)
                targetData.drop(columns=['{}'.format(pair[0])], inplace=True)
        targetData = camel_case_columns(targetData)
        fixDataNaN(targetData)

fixMergeColumns([data,validation_data], pairs)

We are going to train our model to predict whether the position ends up to a goal or not so we need to convert our goal and not goal situations to binary tags.

In [173]:
replacement_dict = {
    'Goal': 'Goal',
    'BlockedShot': 'No Goal',
    'MissedShots': 'No Goal',
    'SavedShot': 'No Goal',
    'ShotOnPost': 'No Goal',
    'OwnGoal': 'No Goal'
}

data['result'] = data['result'].map(replacement_dict)
validation_data['result'] = validation_data['result'].map(replacement_dict)

# Imbalancedness

Our data has imbalancedness problem which might cause our model to learn features of majority target variables better than the minority target variables, it might cause inaccurate predictions.

In [174]:
print(data['result'].value_counts())

result
No Goal    13477
Goal        1531
Name: count, dtype: int64


We are going to use SMOTE method for oversampling to fix the gap between minority and majority target variables. This method uses clustering methods to create new observations based on original ones.

In [175]:
Y = data['result']
x = data.drop('result', axis=1)

X = pd.get_dummies(x)
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

print('Original dataset shape %s' % Counter(Y))
sm = SMOTE(random_state=42,n_jobs=-1)
x_res, y_res = sm.fit_resample(X, Y)
print('Resampled dataset shape %s' % Counter(y_res))

Original dataset shape Counter({'No Goal': 13477, 'Goal': 1531})
Resampled dataset shape Counter({'No Goal': 13477, 'Goal': 13477})


In [176]:
Y_validation = validation_data['result']
x_validation = validation_data.drop('result', axis=1)

X_validation = pd.get_dummies(x_validation)
X_validation = scaler.fit_transform(X_validation)

# Model

In [177]:
x_train, x_test, y_train, y_test = train_test_split(x_res, y_res, test_size=0.25, random_state=42, stratify=y_res)

We have found the best model to use is Random Forest after comparing Decision Tree, Logistic Regression,XGBoost and Random Forest.

In [178]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

5-Fold Cross Validation score:

In [179]:
print(cross_val_score(model, x_train, y_train, cv=5, scoring='accuracy').mean())

0.9562206282463517


### Train - Test Accuracy Comparison to Check Overfitting

Our model seems to have learned train set perfectly, we might have suspected of overfitting much more than the current situation if the test accuracies were bad but accuracy and balanced accuracy scores on train and test sets are very close to each other. Our model is good to go, but we are going to check if hyperparameter tuning or automl could improve our performance.

In [180]:
test_accuracy = accuracy_score(y_test, y_pred)
test_balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
test_confusion_matrix = confusion_matrix(y_test, y_pred)

train_accuracy = accuracy_score(y_train, model.predict(x_train))
train_balanced_accuracy = balanced_accuracy_score(y_train, model.predict(x_train))
train_confusion_matrix = confusion_matrix(y_train, model.predict(x_train))

In [181]:
print('Test Accuracy: {}\nTrain Accuracy: {}\n'.format(test_accuracy, train_accuracy))
print('Test Balanced Accuracy: {}\nTrain Balanced Accuracy: {}\n'.format(test_balanced_accuracy, train_balanced_accuracy))
print('Test Confusion Matrix: \n{}\n\nTrain Confusion Matrix: \n{}'.format(test_confusion_matrix, train_confusion_matrix))

Test Accuracy: 0.9615669980709304
Train Accuracy: 1.0

Test Balanced Accuracy: 0.9615675036750684
Train Balanced Accuracy: 1.0

Test Confusion Matrix: 
[[3229  141]
 [ 118 3251]]

Train Confusion Matrix: 
[[10107     0]
 [    0 10108]]


### Hyperparameter Tuning

We are going to use the Halving Random Search method and 5-Fold Cross Validation together to find the best hyperparameters, evaluating our iterations by balanced accuracy scores.

The Halving Random Search method uses an elimination system as its base idea: the best of the two compared hyperparameter sets rises above on the leaderboard, and we get the best hyperparameters after the final round. Another pro of Halving Random Search is that it is much faster than Random Search.

In [182]:
param_grid = {
    'n_estimators': [int(x) for x in np.linspace(start=50, stop=5000, num=10)],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [int(x) for x in np.linspace(2, 50)],
    'min_samples_split': [int(x) for x in np.linspace(2, 5)],
    'min_samples_leaf': [int(x) for x in np.linspace(2, 5)],
    'bootstrap': [True, False],
}

halving = HalvingRandomSearchCV(model, param_grid, factor=3, resource='n_samples', max_resources=1000, random_state=42, verbose=0,scoring='balanced_accuracy', n_jobs=-1)
halving.fit(x_train, y_train)
print("Best Params:{}/nBest Balanced Accuracy:{}".format(halving.best_params_,halving.best_score_))

KeyboardInterrupt: 

We did not get what we wanted from hyperparameter tuning, our balanced accuracy decreased and it might not be ideal to consume more time to random searching, so let's see how AutoML is going to work.

In [None]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,2 hours 36 mins
H2O_cluster_timezone:,Europe/Istanbul
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.2
H2O_cluster_version_age:,27 days
H2O_cluster_name:,H2O_from_python_sezaiufukoral_x1b2rc
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,943 Mb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [None]:
hf = h2o.H2OFrame(pd.concat([pd.DataFrame(x_res), pd.DataFrame(y_res)], axis=1))

train_hf, test_hf = hf.split_frame(ratios=[0.75], seed = 1)

### AutoML

We are going to use the H2O library because of its ease of use. We are going to set the maximum runtime to 300 seconds so our process won't consume too much time. AutoML is going to use 5-Fold Cross Validation by default and compare balanced accuracies and more metrics to choose the best model for our data.

In [None]:
aml = H2OAutoML(max_models = 20,
                balance_classes=True,
		seed =1,max_runtime_secs=300,verbosity='none')

aml.train(training_frame = train_hf, y = 'result')

AutoML found the best model to be Gradient Boosting Machines. This algorithm is based on the idea of creating an ensemble of weak learners, typically decision trees, in a sequential manner. Each new model attempts to correct the errors of the previous models.

Our new model's performance is almost the same as vanilla Random Forest. However, AutoML checked more metrics to validate that this is the right model for our data. Both models are black box models, so it does not affect us which one we choose for the sake of interpretability. We can trust more in the one that AutoML found for its generalizability because it checked more metrics when building the model. So, we are good to go now.

In [None]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)

In [None]:
warnings.filterwarnings('ignore')
preds = aml.leader.predict(test_hf)

y_test_gbm = test_hf['result'].as_data_frame().values.flatten()
y_pred_gbm = preds['predict'].as_data_frame().values.flatten()

balanced_acc = balanced_accuracy_score(y_test_gbm, y_pred_gbm)
print("Balanced Accuracy Score: ", balanced_acc)

AttributeError: 'NoneType' object has no attribute 'predict'

# Validation of the Model

We are going to test our model with a new, unseen data.

In [None]:
warnings.filterwarnings('ignore')

validation_x_hf = h2o.H2OFrame(pd.DataFrame(X_validation))
validation_y_hf = h2o.H2OFrame(pd.DataFrame(Y_validation))

                               
preds_validation = aml.leader.predict(validation_x_hf)

y_validation_test_gbm = validation_y_hf['result'].as_data_frame().values.flatten()
y_validation_pred_gbm = preds_validation['predict'].as_data_frame().values.flatten()

balanced_acc = balanced_accuracy_score(y_validation_test_gbm, y_validation_pred_gbm)
print("Balanced Accuracy Score: ", balanced_acc)

print(confusion_matrix(y_validation_test_gbm, y_validation_pred_gbm))

Balanced Accuracy Score:  0.6909547738693467
[[192   7]
 [116  83]]


In [183]:
y_pred = model.predict(X_validation)

print(confusion_matrix(Y_validation, y_pred))

ValueError: X has 293 features, but RandomForestClassifier is expecting 1882 features as input.