# EPL Result Classification

For this project, I will try to predict full-time result of the English Premier League match.

The data comes from a Kaggle (https://www.kaggle.com/irkaal/english-premier-league-results) and I will use the "results.csv" file.

In [1]:
import pandas as pd
import numpy as np

from   sklearn.compose         import *
from   sklearn.impute          import *
from   sklearn.preprocessing   import *
from   sklearn.ensemble        import RandomForestClassifier
from   sklearn.neighbors       import KNeighborsClassifier
from   sklearn.pipeline        import Pipeline
from   sklearn.model_selection import train_test_split
from   sklearn.model_selection import RandomizedSearchCV
from   sklearn.metrics         import f1_score
from   sklearn.metrics         import confusion_matrix

import warnings
warnings.filterwarnings('ignore')

## Load and inspect the data

In [2]:
data = pd.read_csv('results.csv', parse_dates=['DateTime'])

In [3]:
data.head()

Unnamed: 0,Season,DateTime,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,HST,AST,HC,AC,HF,AF,HY,AY,HR,AR
0,1993-94,1993-08-14 00:00:00+00:00,Arsenal,Coventry,0,3,A,,,,...,,,,,,,,,,
1,1993-94,1993-08-14 00:00:00+00:00,Aston Villa,QPR,4,1,H,,,,...,,,,,,,,,,
2,1993-94,1993-08-14 00:00:00+00:00,Chelsea,Blackburn,1,2,A,,,,...,,,,,,,,,,
3,1993-94,1993-08-14 00:00:00+00:00,Liverpool,Sheffield Weds,2,0,H,,,,...,,,,,,,,,,
4,1993-94,1993-08-14 00:00:00+00:00,Man City,Leeds,1,1,D,,,,...,,,,,,,,,,


In [4]:
# From 1993 to 2000, there is only result data, therefore, I will use data from 2000
data = data.iloc[2824:].reset_index(drop=True)

In [5]:
# Extract month from DateTime column
data['Month'] = data['DateTime'].dt.month

In [6]:
# set the target
y = data[['FTR']].rename(columns={'FTR':'label'})

In [7]:
# Drop unnecessary columns
X = data.drop(['Season', 'DateTime', 'FTR', 'FTHG', 'FTAG'], axis=1)

In [8]:
# Encode the target
le = LabelEncoder()
le.fit(y['label'])
y = le.transform(y['label'])

## Fit scikit-learn model

In [9]:
categorical = ['HomeTeam', 'AwayTeam', 'HTR', 'Month', 'Referee']
cat_pipe = Pipeline(steps = [('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                             ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))])

numerical = ['HTHG','HTAG','HS','AS','HST','AST','HC','AC','HF','AF','HY','AY','HR','AR']
num_pipe = Pipeline(steps = [('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
                             ('scaler',  StandardScaler(with_mean=False))])

preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, categorical),
                                               ('num', num_pipe, numerical)],
                                 remainder='passthrough')

In [10]:
# split train set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### KNeighborsClassifier

In [53]:
pipe = Pipeline(steps = [('pre', preprocessor),
                         ('kn', KNeighborsClassifier())])

In [54]:
hyperparameters = dict(kn__algorithm        = ['auto', 'ball_tree', 'kd_tree', 'brute'],
                       kn__weights          = ['uniform', 'distance'],
                       kn__leaf_size        = range(1, 50),
                       kn__n_neighbors      = range(1, 100))

kn_rand_cv = RandomizedSearchCV(estimator=pipe, 
                                 param_distributions=hyperparameters, 
                                 n_iter=25,
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=False)

In [55]:
kn_rand_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('pre',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(strategy='most_frequent')),
                                                                                               ('encoder',
                                                                                                OneHotEncoder(handle_unknown='ignore',
                                                                                                              sparse=False))]),
                                                                               ['HomeTeam',
                                        

In [56]:
kn_rand_cv.best_estimator_.fit(X_train, y_train)
y_pred   = kn_rand_cv.best_estimator_.predict(X_test)
f1_test  = f1_score(y_test, y_pred, average='micro')

In [57]:
round(f1_test,3)

0.632

### RandomForestClassifier

In [45]:
pipe = Pipeline(steps = [('pre', preprocessor),
                         ('rf', RandomForestClassifier())])

In [46]:
hyperparameters = dict(rf__max_depth        = range(1, 100),
                       rf__criterion        = ['gini', 'entropy'],
                       rf__min_samples_leaf = range(1, 15),
                       rf__n_estimators     = [20, 50, 80, 100, 200])

rf_rand_cv = RandomizedSearchCV(estimator=pipe, 
                                 param_distributions=hyperparameters, 
                                 n_iter=25,
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=False)

In [47]:
rf_rand_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('pre',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(strategy='most_frequent')),
                                                                                               ('encoder',
                                                                                                OneHotEncoder(handle_unknown='ignore',
                                                                                                              sparse=False))]),
                                                                               ['HomeTeam',
                                        

In [49]:
rf_rand_cv.best_estimator_.fit(X_train, y_train)
y_pred   = rf_rand_cv.best_estimator_.predict(X_test)
f1_test  = f1_score(y_test, y_pred, average='micro')

In [51]:
round(f1_test,3)

0.657

RandomForestClassifier shows better F1 score than KNeighborsClassifier

## Result

RandomForestClassifier's prediction seems better than the KNeighborsClassifier's prediction. One of the reason I think is that, there are huge number of categorical values in the data and they are the most important features in the model. Therefore, as the RandomForestClassifier set the max_depth as 93 which is high, it processed categorical features very well compared to the KNeighborsClassifier. Hyperparamers used are summarized below. It used maximum n_estimators that I set. It is clear that as random forest increase n_estimator, it works well because it observe more train sets. For KNeighborsClassifier, it chose n_neighbors=93, it is pretty high, considering that default is 5. I think it is because we have a lot of categorical features. Overall, I expected to RandomForestClassifier works better than KNeighborsClassifier and, in this practice, it was actually true. However, I got f1_score 0.657, which is not good accuracy for the dataset. In my opinion, features were not significantly affected the target. One of the reason for the low f1 score is imbalanced data but I don't think it is in my project. My data is distributed in a good shape. If you see from the confusion matrix, you can see that for the "draw" game, it is relatively high false positive and false negative rate compared to "Home team win" and "Away team win" games. It shows that predicting "draw" game is harder and it affected the accuracy of the model.

In [58]:
# hyperparameters used
print(kn_rand_cv.best_estimator_.get_params()['kn'])
print(rf_rand_cv.best_estimator_.get_params()['rf'])

KNeighborsClassifier(algorithm='brute', leaf_size=19, n_neighbors=93,
                     weights='distance')
RandomForestClassifier(max_depth=93, min_samples_leaf=2, n_estimators=200)


In [95]:
confusion_matrix(y_test, y_pred)

array([[299,  96,  44],
       [111, 158, 109],
       [ 59, 145, 546]])