### Selected Dataset 

**Dataset** https://www.kaggle.com/taranguyen/form-in-bundesliga-la-liga-mls-premier-league

**Context**

These datasets were created as part of a class project on Soccer Competitiveness, which aimed to compare the competitiveness of four different soccer leagues: the Bundesliga (in Germany), the La Liga (in Spain), the Premier League (in the U.K.), and the Major League Soccer (in the U.S.).

**Content**

Each of the four datasets contains form tables for five seasons, from Season 2015/16 to Season 2019/20. Match results are denoted as "W" for a win, "D" for a draw, or "L" for a loss.

**Acknowledgements**

All data came from https://www.transfermarkt.us/.

**Inspiration**

Which soccer league is the most competitive?
How competitive is the Major League Soccer compared to the other leagues, all of which are well-known for their competitiveness (as well as prestige)?

### Predicting Competitiveness in Soccer Leagues 

**Abstract**: As part of the Data Science Course Project, one of Kaggle dataset is selected to be predicted by implementing classification models. The datasets which were created as part of the Soccer Competitiveness project, aimed to compare the competitiveness of four different soccer leagues: the Bundesliga (in Germany), the La Liga (in Spain), the Premier League (in the U.K.), and the Major League Soccer (in the U.S.). Using these datasets, we are motivated to find “how early the Teams’ 1st, 2nd and 3rd positions at the end of the Season can be predicted?”. Since each League has different dynamics, the datasets are processed independently from each other. After evaluation step is performed with several binary classification model trials, La Liga league has found the most competitive among the others with the lowest balanced accuracy value during the season, the other leagues are found as Bundesliga, Premier League, Major League Soccer respectively. However, when we examined the final scores of the teams and the distribution of the differences between them, we observed that the competitive environment of the Major League Soccer was higher than the others, similar to the La Liga.

**Keywords**: soccer competitiveness, binary classification, predictive analysis

**Programming Language**: Python 3.7

**Introduction**: Predicting the competitiveness of football leagues is a vital problem in the football industry, which can affect the interests and emotions of football fans, the ambitions of the players and the odds.All teams start with zero points and earn points according to the result of the game by 3 points for Win, 1 point for Draw, 0 point for Loss. For each season all teams play games with each other and once all rounds are finalized the winner is the one which earns the highest point and in the 1st position. There might be variety of features cause the change in competitiveness levels considering several leagues dynamics, but in these given datasets, we are focusing on the provided features as described below:

Each of the four datasets contains form tables for five seasons, from Season 2015/16 to Season 2019/20. “Match” results are denoted as "W" for a win, "D" for a draw, or "L" for a loss. All data came from https://www.transfermarkt.us/. The “final positions”, “Total goals scored” and “cancelled” during the season are also provided by the datasets.  

All match results which are given as categorical features are converted to their respective score values as M={‘W’:3,’D’:1,’L’:0} then created as cumulative sum of final points for each week. For example, if we are in week 24, we have 24 matches in our table, including all W, D, L values, and this week's Final Point column consists of the cumulative sum of these results. Additionally, for each week cumulative sum of the number of total wins, draws and losses are calculated. 

Point differences might be an indicator factor to predict competitiveness in the league such that if the final points for the given week are close to each other then the predictiveness become harder because of the uncertainty level is increased. If the differences are more diverse, in other words are more separated and the gaps are increased then we can conclude with the predictiveness can be easier for this league. In order to observe the effect of point differences in the given weeks, after the cumulative totals were calculated, the Point Differences attributes were created by ranking the final scores of each week and calculating the differences between consecutive teams.

Target column is defined as binary outcome which is 1 for the teams scores in the 1st, 2nd and 3rd positions, and 0 for the rest. So, our predictions will be based on the first 3 positions in the given league and season. 

Since each of the data sets has 5 seasons, the train and test sets are separated, with the training set for the first four seasons and the test set for the last season. The training sets are used for grid search analysis and base models, after which the final estimation is performed on the test sets.

**Project Objective**: The minimum requirement of this project includes “Model Selection”, “Application of at least four machine learning algorithms”, “Feature selection”, “Overfit/Underfit analysis”, “Hyperparameter optimization”. Using these data science steps, we are motivated to predict competitiveness levels of given leagues by predictability scores with classification models.


In [None]:
import os
os.getcwd()
print(os.listdir("../input"))

### Solution and Methods 


In [None]:
! pip install openpyxl

In [None]:
#Importing required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from scipy.stats import norm

from sklearn.metrics import classification_report, precision_recall_fscore_support, balanced_accuracy_score
from sklearn.compose import make_column_selector,make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score , cross_validate, GridSearchCV, RepeatedStratifiedKFold, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import make_column_selector,make_column_transformer, ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectFromModel , mutual_info_classif,RFECV

from collections import Counter

from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN, RandomOverSampler
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.pipeline import make_pipeline as imbmake_pipeline
from imblearn.base import BaseSampler
from imblearn.under_sampling import (ClusterCentroids, RandomUnderSampler,
                                     NearMiss,
                                     InstanceHardnessThreshold,
                                     CondensedNearestNeighbour,
                                     EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours,
                                     AllKNN,
                                     NeighbourhoodCleaningRule,
                                     OneSidedSelection)

import tqdm
import time
def hms_string(sec_elapsed):
        h = int(sec_elapsed / (60 * 60))
        m = int((sec_elapsed % (60 * 60)) / 60)
        s = sec_elapsed % 60
        return "{}:{:>02}:{:>05.2f}".format(h, m, s)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_colwidth', 1000)

### Read & Merge all datasets

In [None]:
#Read Datasets. Here we are dropping Matches, because we will merge with all leagues' dataframes 
df = pd.read_excel('../input/form-in-bundesliga-la-liga-mls-premier-league/all-leaguetables.xlsx')
df['Position'] = df['Position'].astype(int)
df.sort_values(by=['Season','Position','Team'], inplace=True)


#Find all Final Points differences between winner and the teams & differences between all teams 
news = []

for i in list(df.League.unique()):
    temp = df[df['League']==i]
    for j in list(temp.Season.unique()):
        temp_ = temp[temp['Season']==j]
        temp_['winner'] = temp_[temp_['Position']==1]['FinalPoints'].values[0]
        temp_['FinalPointsDiff_Winner'] = temp_['winner'] - temp_['FinalPoints']
        temp_.drop(['winner'], axis=1, inplace=True)
        temp_['FinalPoints_Diff'] = temp_.FinalPoints.diff(-1).shift().fillna(0)
        news.append(temp_)
        
df = pd.concat(news)

# Read all separated dataframes 
bundesliga = pd.read_csv('../input/form-in-bundesliga-la-liga-mls-premier-league/form-bundesliga.csv').drop('Matches', axis=1)
epl = pd.read_csv('../input/form-in-bundesliga-la-liga-mls-premier-league/form-epl.csv').drop('Matches', axis=1)
laliga = pd.read_csv('../input/form-in-bundesliga-la-liga-mls-premier-league/form-laliga.csv').drop('Matches', axis=1)
mls = pd.read_csv('../input/form-in-bundesliga-la-liga-mls-premier-league/form-mls.csv').drop('Matches', axis=1)

#Merge separated League datasets with the all league table. 
bundesliga_ = df[df['League']=='Bundesliga'].merge(bundesliga, on=['Season','Team'], how='left')
laliga_ = df[df['League']=='La Liga'].merge(laliga, on=['Season','Team'], how='left')
epl_ = df[df['League']=='Premier League'].merge(epl, on=['Season','Team'], how='left')
mls_ = df[df['League']=='Major League Soccer'].merge(mls, on=['Season','Team'], how='left')

#Position column indicates the final ranking of the team, to be able to sort , this column converted to integer.
# All categorical and integer typed columns are extracted in a list format to be handled during pipelines. 
bundesliga_['Position'] = bundesliga_['Position'].astype(int)
laliga_['Position'] = laliga_['Position'].astype(int)
epl_['Position'] = epl_['Position'].astype(int)
mls_['Position'] = mls_['Position'].astype(int)

#Sorting all dataframe values by 'Season','Position','Team' 
bundesliga_.sort_values(by=['Season','Position','Team'], inplace=True)
laliga_.sort_values(by=['Season','Position','Team'], inplace=True)
epl_.sort_values(by=['Season','Position','Team'], inplace=True)
mls_.sort_values(by=['Season','Position','Team'], inplace=True)

#MLS dataframe season column is different than other dataframes. So, here we map all the values to commonly used values. 
mapper = dict(zip(list(mls_.Season.unique()),list(bundesliga_.Season.unique())))
mls_['Season'] = mls_['Season'].map(mapper)

#Concat all prepared dataframes
df_ = pd.concat([bundesliga_,epl_,laliga_,mls_], axis=0)


display(df.head(5))        

#Select all numerical and categorical columns
cat_selector = make_column_selector(dtype_exclude=np.number)
num_selector = make_column_selector(dtype_include=np.number)

cats = cat_selector(df)
nums = num_selector(df)

print(f'categorical columns:{cats}')
print(f'numerical columns{nums}')

display(df_.head(5))  

### Missing Values

In [None]:
all_dfs = [df,bundesliga_,epl_,laliga_,mls_]
labels = ['all_dataframes','bundesliga','epl','laliga','mls']

plt.figure(figsize=(20,10))
#plt.subplots_adjust(wspace=0.05, hspace=0.5)
a = 3  # number of rows
b = 2  # number of columns
c = 1  # initialize plot counter

for t,z in enumerate(zip(labels,all_dfs)):
    plt.subplot(a, b, c+t)
    sns.heatmap(z[1].isnull())
    plt.title(z[0])
    plt.axis('off')

plt.show()    


There are lots of missing values in especially La Liga dataframe. When we check the names of the Teams, it seems they are represented differently in two dataframes which we use to merge them. So, we will singularize the values of the Teams then merge two dataframe

In [None]:
to_bundesliga_ = list(bundesliga_[(bundesliga_['Match1'].isna())].Team.unique())
from_bundesliga = list(bundesliga['Team'].unique())

to_laliga_ = list(laliga_[(laliga_['Match1'].isna())].Team.unique())
from_laliga = list(laliga['Team'].unique())

for i in from_bundesliga:
    for j in to_bundesliga_:
        if i in j:
            bundesliga.replace(i,j, inplace=True)
            
for i in from_laliga:
    for j in to_laliga_:
        if i in j:
            laliga.replace(i,j, inplace=True)    

In [None]:
bundesliga_ = df[df['League']=='Bundesliga'].merge(bundesliga, on=['Season','Team'], how='left')
laliga_ = df[df['League']=='La Liga'].merge(laliga, on=['Season','Team'], how='left')

all_dfs = [df,bundesliga_,epl_,laliga_,mls_]
labels = ['all_dataframes','bundesliga','epl','laliga','mls']

plt.figure(figsize=(20,10))
#plt.subplots_adjust(wspace=0.05, hspace=0.5)
a = 3  # number of rows
b = 2  # number of columns
c = 1  # initialize plot counter

for t,z in enumerate(zip(labels,all_dfs)):
    plt.subplot(a, b, c+t)
    sns.heatmap(z[1].isnull())
    plt.title(z[0])
    plt.axis('off')

plt.show()

In [None]:
for t,z in enumerate(zip(labels,all_dfs)):
    for i in z[1].columns:
        null_rate = z[1][i].isna().sum() / len(z[1]) * 100 
        if null_rate > 0 :
            print("{} dataframe,{}'s null rate :{}%".format(z[0],i,round(null_rate,2)))


With this we checked there are no missing values left!

In [None]:
#Concat all prepared dataframes
df_ = pd.concat([bundesliga_,epl_,laliga_,mls_], axis=0);df_.head()

### Correlation Matrix among all Numeric Features

In [None]:
all_dfs = [bundesliga_,epl_,laliga_,mls_]
labels = ['bundesliga','epl','laliga','mls']

plt.figure(figsize=(15,15))
plt.subplots_adjust(wspace=0.5, hspace=0.5)
a = 2  # number of rows
b = 2  # number of columns
c = 1  # initialize plot counter

for t,z in enumerate(zip(labels,all_dfs)):
    plt.subplot(a, b, c+t)
    mask = np.triu(np.ones_like(z[1].corr(), dtype=bool))
    cmap = sns.diverging_palette(230, 20, as_cmap=True)
    sns.heatmap(z[1].corr(),annot=True, mask=mask, cmap=cmap, vmax=1.0, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.title(z[0],fontweight="bold")
    #plt.axis('off')

plt.show()

From the graphs above, we observe that there are strong correlations between some of the features. This is a predictably meaningful result, as many features come from the teams' final standings at the end of the season. 

For example;

The wins column includes all wins for the season. This column is highly correlated with the Position, which is the last position of the team, so if there are many wins then the position will be 1 for the given season. It also explains the relationship between Final Points and Position.


#### Distribution between the winner and the rest of the teams 

Distribution of the final points is one of the indicators of uncertainty levels. To observe this, we prepared distribution considering 4 factors as shown in below "Distribution of League Final Points" graph. All charts have been prepared for each team and the colors in the chart correspond to the seasons’ representation. The first charts, titled "Overall", show the distribution of teams among Final Points for each season. We observe that the MLS League has the most skewed distribution compared to the others, which shows us that MLS is the most competitive league given the final scores. The competitiveness between the top 3 leading teams shown in the graphs in the second row of the same Figure is a strong factor, as is the competitiveness between the other teams in the league. We can observe that the features of the graphics change between leagues and seasons. From the results, it can be seen that La Liga has more competitive power for the 2015/16 season than other seasons, similar to MLS. Therefore, we can conclude that the competition within the seasons can change and is not fixed. Another factor would be Final Points differences with the winner team which shows how the leader team moves away from the rest. The results are slightly changes when compared to the first Overall graph. Finally, the final point differences between the teams, which is the last factor where we observe the differences between the distributions, are shown in the final graphs. This confirms that the MLS league is more skewed than the others.

In [None]:
g = sns.displot(data=df_, x="FinalPoints", hue="Season", row="League",kind="kde",  height=3, aspect=4)
g.set_axis_labels("Density", "Final Points") 
g.set_titles("{row_name}")


In [None]:
fig = plt.figure(figsize=(20,15))
plt.subplots_adjust(wspace=0.2, hspace=0.5)
gs = fig.add_gridspec(4,4)

ax00 = fig.add_subplot(gs[0,0])#, sharey = ax9)
ax01 = fig.add_subplot(gs[0,1], sharex = ax00)#, sharey = ax7, sharex = ax6)
ax02 = fig.add_subplot(gs[0,2], sharex = ax00)#, sharey = ax0, sharex = ax6)
ax03 = fig.add_subplot(gs[0,3], sharex = ax00)
ax10 = fig.add_subplot(gs[1,0], sharex = ax00)#, sharey = ax9)
ax11 = fig.add_subplot(gs[1,1], sharex = ax00)#, sharey = ax7, sharex = ax6)
ax12 = fig.add_subplot(gs[1,2], sharex = ax00)#, sharex = ax6, sharey = ax2)
ax13 = fig.add_subplot(gs[1,3], sharex = ax00)
ax20 = fig.add_subplot(gs[2,0], sharex = ax00)#, sharey = ax9)
ax21 = fig.add_subplot(gs[2,1], sharex = ax00)#, sharex = ax6)
ax22 = fig.add_subplot(gs[2,2], sharex = ax00)#, sharex = ax6, sharey = ax2)
ax23 = fig.add_subplot(gs[2,3], sharex = ax00)
ax30 = fig.add_subplot(gs[3,0], sharex = ax00)#, sharex = ax6)
ax31 = fig.add_subplot(gs[3,1], sharex = ax00)#, sharey = ax7, sharex = ax6)
ax32 = fig.add_subplot(gs[3,2], sharex = ax00)#, sharex = ax6, sharey = ax2)
ax33 = fig.add_subplot(gs[3,3], sharex = ax00)

sns.kdeplot(x='FinalPoints',hue='Season', data=epl_,fill=True, alpha=.5,shade=True, ax=ax00, palette='gist_gray_r').set(title='Premier League | Overall')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=epl_[epl_['Position'].isin(list(np.arange(1,4)))],fill=True, alpha=.5,shade=True, ax=ax01, palette='gist_gray_r').set(title='Premier League | Final Points - First 3 Teams')
sns.despine()

sns.kdeplot(x='FinalPointsDiff_Winner',hue='Season', data=epl_,fill=True, alpha=.5,shade=True, ax=ax02, palette='gist_gray_r').set(title='Premier League | Final Points Diff Winner')
sns.despine()

sns.kdeplot(x='FinalPoints_Diff',hue='Season', data=epl_,fill=True, alpha=.5,shade=True, ax=ax03, palette='gist_gray_r').set(title='Premier League | Final Points Differences')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=bundesliga_,fill=True, alpha=.5,shade=True, ax=ax10, palette='gist_gray_r').set(title='Bundesliga | Overall')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=bundesliga_[bundesliga_['Position'].isin(list(np.arange(1,4)))],fill=True, alpha=.5,shade=True, ax=ax11, palette='gist_gray_r').set(title='Bundesliga | Final Points - First 3 Teams')
sns.despine()

sns.kdeplot(x='FinalPointsDiff_Winner',hue='Season', data=bundesliga_,fill=True, alpha=.5,shade=True, ax=ax12, palette='gist_gray_r').set(title='Bundesliga | Final Points Diff Winner')
sns.despine()

sns.kdeplot(x='FinalPoints_Diff',hue='Season', data=bundesliga_,fill=True, alpha=.5,shade=True, ax=ax13, palette='gist_gray_r').set(title='Bundesliga | Final Points Differences')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=laliga_,fill=True, alpha=.5,shade=True, ax=ax20, palette='gist_gray_r').set(title='La Liga | Overall')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=laliga_[laliga_['Position'].isin(list(np.arange(1,4)))],fill=True, alpha=.5,shade=True, ax=ax21, palette='gist_gray_r').set(title='La Liga | Final Points - First 3 Teams')
sns.despine()

sns.kdeplot(x='FinalPointsDiff_Winner',hue='Season', data=laliga_,fill=True, alpha=.5,shade=True, ax=ax22, palette='gist_gray_r').set(title='La Liga | Final Points Diff Winner')
sns.despine()

sns.kdeplot(x='FinalPoints_Diff',hue='Season', data=laliga_,fill=True, alpha=.5,shade=True, ax=ax23, palette='gist_gray_r').set(title='La Liga | Final Points Differences')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=mls_,fill=True, alpha=.5,shade=True, ax=ax30, palette='gist_gray_r').set(title='Major League Soccer | Overall')
sns.despine()

sns.kdeplot(x='FinalPoints',hue='Season', data=mls_[mls_['Position'].isin(list(np.arange(1,4)))],fill=True, alpha=.5,shade=True, ax=ax31, palette='gist_gray_r').set(title='MLS | Final Points - First 3 Teams')
sns.despine()

sns.kdeplot(x='FinalPointsDiff_Winner',hue='Season', data=mls_,fill=True, alpha=.5,shade=True, ax=ax32, palette='gist_gray_r').set(title='MLS | Final Points Diff Winner')
sns.despine()

sns.kdeplot(x='FinalPoints_Diff',hue='Season', data=mls_,fill=True, alpha=.5,shade=True, ax=ax33, palette='gist_gray_r').set(title='MLS | Final Points Differences')
sns.despine()

# Title & Subtitle    
fig.text(0.4, 0.93, 'Distribution of League Final Points', fontsize=20, fontweight='bold', fontfamily='serif', ha='left') 
plt.show();
fig.savefig('Distributionof.LeagueFinalPoints.jpeg', dpi=350, bbox_inches='tight')



From the charts above, we can observe that La Liga has a high competition between the top three teams in 2015/16 Season, in contrast to the general distributions of the same year.

In the 2016/17 Season, MLS has high competition among the top three teams.

For other Leagues, the situation is more dispersed, considering the top three teams and the entire league.

The MLS league has the most skewed plot values compared to the others. Based on this result, the first observation we can make is that MLS may be the most competitive league compared to the others.

Final Points gaps are tighter in MLS than in the others. This indicates that all consecutive teams have close differences. So, we can conclude that this league is competitive in this sense.

### Feature Preperation & Encoding

Here we encode the categorical features corresponding to the match result as follows:
When the team wins it will earn 3 points, Draws gives 1 point, and Loss does not give any point so results  with 0. 

- W:3 
- D:1
- L:0

Numerical features are scaled between [-1,1]

- Although the **GoalsScored** and **GoalsConceded** columns are capable of carrying information about the success of the team in a particular season, we cannot add it to the forecasting model, since in our approach we will add the information of the matches played so far to the model to predict the final positions, and these columns only carry the end-of-season information. Adding the end-of-season information to the model may distort the model predictions since it will be highly correlated.Since **GoalDiff** is related to these two columns, we can drop it before the model phase.


- Total **Wins**, **Draws** and **Losses** for the end of the season are also Match1, Match2, etc. It does not need to be added to the model because it is given in sequential order in the columns. The information of these columns will be reproduced for the prediction model.


- **Matches** column is constant for the given League so we drop it since it will not contribute the final prediction. 


- **Position** gives the final state of the Team, so this column will be our target. But, we set up our problem as being in the first 3 Position or not, thus we  encode this column to be in the form of our target 


- **League**: We will predict Positions with given League information, so each league will have different models, so this column will not be encoded.

In [None]:
#Create Score columns 
for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
    df_['Point_'+str(i+1)] = df_['Match'+str(i+1)].copy()
    df_['Point_'+str(i+1)] = df_['Point_'+str(i+1)].map({'W':3,'D':1,'L':0})

#Cumsum all Point scores
df_.loc[:,'Point_1':'Point_38']= df_.loc[:,'Point_1':'Point_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):    
    df_['Win_'+str(i+1)] = df_['Match'+str(i+1)].copy()
    df_['Win_'+str(i+1)] = df_['Win_'+str(i+1)].map({'W':1,'D':0,'L':0})

#Cumsum all Win scores
df_.loc[:,'Win_1':'Win_38']= df_.loc[:,'Win_1':'Win_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
    df_['Loss_'+str(i+1)] = df_['Match'+str(i+1)].copy()
    df_['Loss_'+str(i+1)] = df_['Loss_'+str(i+1)].map({'W':0,'D':0,'L':1})

#Cumsum all Loss scores
df_.loc[:,'Loss_1':'Loss_38']= df_.loc[:,'Loss_1':'Loss_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())
    
for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
    df_['Draw_'+str(i+1)] = df_['Match'+str(i+1)].copy()
    df_['Draw_'+str(i+1)] = df_['Draw_'+str(i+1)].map({'W':0,'D':1,'L':0})

#Cumsum all Draw scores     
df_.loc[:,'Draw_1':'Draw_38']= df_.loc[:,'Draw_1':'Draw_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

#Create difference columns by sorting each score column to have positions, then taking the differences between consecutive scores. 
dists = []
for leg in list(df_.League.unique()):
    temp = df_[df_['League']==leg]
    for ses in list(temp.Season.unique()):
        temp_ = temp[temp['Season']==ses]
        for i,j in enumerate(list(temp_.loc[:,'Point_1':'Point_38'].columns)):
            temp_['PointDiff_'+str(i+1)] = temp_.sort_values(by=[j], ascending=False)[j].diff(-1).shift()
            dists.append(temp_)
            
df_ = pd.concat(dists).drop_duplicates().sort_values(by=['League','Season','Position','Team'])
df_ = df_.reset_index().drop('index', axis=1)

#Preperae target column 
df_['target'] = np.where(df_['Position'].isin([1,2,3]), 1, 0) 

display(df_.head())

In [None]:
#Create all separated dataframes corresponding to the Leagues
bundesliga_ = df_[df_['League']=='Bundesliga']
epl_ = df_[df_['League']=='Premier League']
laliga_ = df_[df_['League']=='La Liga']
mls_ = df_[df_['League']=='Major League Soccer']

#Create below lists to be able to iterate over 
all_dfs = [bundesliga_,epl_,laliga_,mls_]

### Select the columns which will be in the model and drop all missing values .
for i in all_dfs:
    #i.drop(removed_cols, axis=1, inplace=True)
    i.dropna(axis=1, how='all',inplace=True)
    i.fillna(0, inplace=True)


seasons = ['2015/16', '2016/17', '2017/18', '2018/19', '2019/20']

#removed_cols are selected to be removed since they are highly correlated with the final position and final points. Also since we will predict the positions of each team before the season will be finalized, we cannot include them in the model. 
removed_cols = ['League','Position','Matches','Wins','Draws','Losses','GoalsScored','GoalsConceded','GoalDiff','FinalPoints','FinalPointsDiff_Winner','FinalPoints_Diff']

### Building model with {percentage}% of the first matches 

In [None]:
# Decide on percentage and prep related columns 
percentage = 0.7
train_number = round(bundesliga_.Matches.unique()[0]*percentage)

#Remove columns not to be included in the model 
bundesliga_.drop(removed_cols, axis=1, inplace=True)

bundesliga_.head()

In [None]:
# Here we will prepare dataframe columns which we will be using during model phase 
# Create columns index name for the given dataset

cols = dict()
for i,j in enumerate(list(bundesliga_.columns)):
    cols[j] = i

# Prepare a list of related indexes extracted from the above dictionary    

a=['Season','Team','Win_'+str(train_number),'Loss_'+str(train_number),'Draw_'+str(train_number) , 'Point_'+str(train_number),'PointDiff_1','PointDiff_'+str(train_number+1),'target']

search = []

for k,j in cols.items():
    for s in a:
        if k==s:
            search.append(j)   


# Define the new dataframe with the selected order     
bundesliga_ = bundesliga_.iloc[:, np.r_[search[0],search[1],search[2],search[3],search[4],search[5], search[6]:search[7],search[8]]]



bundesliga_.head()

### Train Test Split

For each dataset we will use last season as validation of the model. 

In [None]:
#Train Test Split
bundesliga_train = bundesliga_[bundesliga_['Season']!=seasons[-1]].drop('Season', axis=1)
bundesliga_test = bundesliga_[bundesliga_['Season']==seasons[-1]].drop('Season', axis=1)

### Create prep function to apply all datasets

Here we define a function to iterate over all datasets then have a train and test sets. 

In [None]:

path_df = '../input/form-in-bundesliga-la-liga-mls-premier-league/all-leaguetables.xlsx'
path_bundesliga = '../input/form-in-bundesliga-la-liga-mls-premier-league/form-bundesliga.csv'
path_epl = '../input/form-in-bundesliga-la-liga-mls-premier-league/form-epl.csv'
path_laliga = '../input/form-in-bundesliga-la-liga-mls-premier-league/form-laliga.csv'
path_mls = '../input/form-in-bundesliga-la-liga-mls-premier-league/form-mls.csv'

def prep_step_1(path_df,path_bundesliga,path_epl,path_laliga,path_mls):
    #Read Datasets. Here we are dropping Matches, because we will merge with the all leagues dataframes 
    df = pd.read_excel(path_df)
    df['Position'] = df['Position'].astype(int)
    df.sort_values(by=['Season','Position','Team'], inplace=True)


    #Find all Final Points differences between winner and the teams & differences between all teams 
    news = []

    for i in list(df.League.unique()):
        temp = df[df['League']==i]
        for j in list(temp.Season.unique()):
            temp_ = temp[temp['Season']==j]
            temp_['winner'] = temp_[temp_['Position']==1]['FinalPoints'].values[0]
            temp_['FinalPointsDiff_Winner'] = temp_['winner'] - temp_['FinalPoints']
            temp_.drop(['winner'], axis=1, inplace=True)
            temp_['FinalPoints_Diff'] = temp_.FinalPoints.diff(-1).shift().fillna(0)
            news.append(temp_)

    df = pd.concat(news)

    # Read all separated dataframes 
    bundesliga = pd.read_csv(path_bundesliga).drop('Matches', axis=1)
    epl = pd.read_csv(path_epl).drop('Matches', axis=1)
    laliga = pd.read_csv(path_laliga).drop('Matches', axis=1)
    mls = pd.read_csv(path_mls).drop('Matches', axis=1)

    #Merge separated League datasets with the all league table. 
    bundesliga_ = df[df['League']=='Bundesliga'].merge(bundesliga, on=['Season','Team'], how='left')
    laliga_ = df[df['League']=='La Liga'].merge(laliga, on=['Season','Team'], how='left')
    epl_ = df[df['League']=='Premier League'].merge(epl, on=['Season','Team'], how='left')
    mls_ = df[df['League']=='Major League Soccer'].merge(mls, on=['Season','Team'], how='left')

    #Position column indicates the final ranking of the team, to be able to sort , this column converted to integer.
    # All categorical and integer typed columns are extracted in a list format to be handled during pipelines. 
    bundesliga_['Position'] = bundesliga_['Position'].astype(int)
    laliga_['Position'] = laliga_['Position'].astype(int)
    epl_['Position'] = epl_['Position'].astype(int)
    mls_['Position'] = mls_['Position'].astype(int)

    #Sorting all dataframe values by 'Season','Position','Team' 
    bundesliga_.sort_values(by=['Season','Position','Team'], inplace=True)
    laliga_.sort_values(by=['Season','Position','Team'], inplace=True)
    epl_.sort_values(by=['Season','Position','Team'], inplace=True)
    mls_.sort_values(by=['Season','Position','Team'], inplace=True)

    #MLS dataframe season column is different than other dataframes. So, here we map all the values to commonly used values. 
    mapper = dict(zip(list(mls_.Season.unique()),list(bundesliga_.Season.unique())))
    mls_['Season'] = mls_['Season'].map(mapper)

    to_bundesliga_ = list(bundesliga_[(bundesliga_['Match1'].isna())].Team.unique())
    from_bundesliga = list(bundesliga['Team'].unique())

    to_laliga_ = list(laliga_[(laliga_['Match1'].isna())].Team.unique())
    from_laliga = list(laliga['Team'].unique())

    for i in from_bundesliga:
        for j in to_bundesliga_:
            if i in j:
                bundesliga.replace(i,j, inplace=True)

    for i in from_laliga:
        for j in to_laliga_:
            if i in j:
                laliga.replace(i,j, inplace=True)    

    bundesliga_ = df[df['League']=='Bundesliga'].merge(bundesliga, on=['Season','Team'], how='left')
    laliga_ = df[df['League']=='La Liga'].merge(laliga, on=['Season','Team'], how='left')

    all_dfs = [bundesliga_,epl_,laliga_,mls_]
    labels = ['bundesliga','epl','laliga','mls']

    #Concat all prepared dataframes
    df_ = pd.concat([bundesliga_,epl_,laliga_,mls_], axis=0);df_.head()

    #Create Score columns 
    for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
        df_['Point_'+str(i+1)] = df_['Match'+str(i+1)].copy()
        df_['Point_'+str(i+1)] = df_['Point_'+str(i+1)].map({'W':3,'D':1,'L':0})

    #Cumsum all Point scores
    df_.loc[:,'Point_1':'Point_38']= df_.loc[:,'Point_1':'Point_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

    for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):    
        df_['Win_'+str(i+1)] = df_['Match'+str(i+1)].copy()
        df_['Win_'+str(i+1)] = df_['Win_'+str(i+1)].map({'W':1,'D':0,'L':0})

    #Cumsum all Win scores
    df_.loc[:,'Win_1':'Win_38']= df_.loc[:,'Win_1':'Win_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

    for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
        df_['Loss_'+str(i+1)] = df_['Match'+str(i+1)].copy()
        df_['Loss_'+str(i+1)] = df_['Loss_'+str(i+1)].map({'W':0,'D':0,'L':1})

    #Cumsum all Loss scores
    df_.loc[:,'Loss_1':'Loss_38']= df_.loc[:,'Loss_1':'Loss_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

    for i,j in enumerate(list(df_.loc[:,'Match1':'Match38'].columns)):
        df_['Draw_'+str(i+1)] = df_['Match'+str(i+1)].copy()
        df_['Draw_'+str(i+1)] = df_['Draw_'+str(i+1)].map({'W':0,'D':1,'L':0})

    #Cumsum all Draw scores     
    df_.loc[:,'Draw_1':'Draw_38']= df_.loc[:,'Draw_1':'Draw_38'].cumsum(axis = 1, skipna = True).astype(pd.Int64Dtype())

    #Create difference columns by sorting each score column to have positions, then taking the differences between consecutive scores. 
    dists = []
    for leg in list(df_.League.unique()):
        temp = df_[df_['League']==leg]
        for ses in list(temp.Season.unique()):
            temp_ = temp[temp['Season']==ses]
            for i,j in enumerate(list(temp_.loc[:,'Point_1':'Point_38'].columns)):
                temp_['PointDiff_'+str(i+1)] = temp_.sort_values(by=[j], ascending=False)[j].diff(-1).shift()
                dists.append(temp_)

    df_ = pd.concat(dists).drop_duplicates().sort_values(by=['League','Season','Position','Team'])
    df_ = df_.reset_index().drop('index', axis=1)

    #Preperae target column 
    df_['target'] = np.where(df_['Position'].isin([1,2,3]), 1, 0) 

    #Create all separated dataframes corresponding to the Leagues
    bundesliga_ = df_[df_['League']=='Bundesliga']
    epl_ = df_[df_['League']=='Premier League']
    laliga_ = df_[df_['League']=='La Liga']
    mls_ = df_[df_['League']=='Major League Soccer']

    #Create below lists to be able to iterate over 
    all_dfs = [bundesliga_,epl_,laliga_,mls_]
    
    return all_dfs

In [None]:
def prep_step_2 (df, percentage=0.7):
    df.dropna(axis=1, how='all',inplace=True)
    df.fillna(0, inplace=True)
    
    seasons = ['2015/16', '2016/17', '2017/18', '2018/19', '2019/20']
    #removed_cols are selected to be removed since they are highly correlated with the final position and final points. Also since we will predict the positions of each team before the season will be finalized, we cannot include them in the model. 
    removed_cols = ['League','Position','Matches','Wins','Draws','Losses','GoalsScored','GoalsConceded','GoalDiff','FinalPoints','FinalPointsDiff_Winner','FinalPoints_Diff']
    
    # Decide on percentage and prep related columns 
    percentage = percentage
    train_number = round(df.Matches.unique()[0]*percentage)
    
    #Remove columns not to be included in the model 
    df.drop(removed_cols, axis=1, inplace=True)
    
    # Here we will prepare dataframe columns which we will be using during model phase 
    # Create columns index name for the given dataset
    cols = dict()
    for i,j in enumerate(list(df.columns)):
        cols[j] = i    
        
    # Prepare a list of related indexes extracted from the above dictionary    
    a=['Season','Team','Win_'+str(train_number),'Loss_'+str(train_number),'Draw_'+str(train_number) , 'Point_'+str(train_number),'PointDiff_1','PointDiff_'+str(train_number+1),'target']

    search = []
    for k,j in cols.items():
        for s in a:
            if k==s:
                search.append(j)   
                
    # Define the given     
    df = df.iloc[:, np.r_[search[0],search[1],search[2],search[3],search[4],search[5], search[6]:search[7],search[8]]]

    #Train Test Split
    df_train = df[df['Season']!=seasons[-1]].drop('Season', axis=1)
    df_test = df[df['Season']==seasons[-1]].drop('Season', axis=1)
    return df_train, df_test

### Dealing with imbalanced dataset

- Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/
- As stated in the paper, feature selection preferred to be handled before SMOTE, since Oversampling the minority class with SMOTE violates the independence assumption.

Target columns only contains 3 True values so all datasets are imbalanced. To deal with this we try different SMOTE techniques 

Here are the ones that gave better results as a result of all our trials below:

- BorderlineSMOTE
- SVMSMOTE
- OverSampling (SMOTE) + UnderSampling (InstanceHardnessThreshold) 

In [None]:
# Make an identity sampler
class FakeSampler(BaseSampler):

    _sampling_type = 'bypass'

    def _fit_resample(self, X, y):
        return X, y
    
def plot_resampling(X, y, sampling, ax):
    X_res, y_res = sampling.fit_resample(X, y)
    # summarize the new class distribution
    counter = Counter(y_res)
    print(f'Target distribution: {sampling.__class__.__name__}, {counter}')  

    # scatter plot of examples by class label
    for label, _ in counter.items():
        row_ix = np.where(y_res == label)[0]
        ax.scatter(X_res[row_ix, 0], X_res[row_ix, 1], label=str(label))
    ax.legend()
    #ax.title('Target distribution after Resampling')
    #ax.show()
    return Counter(y_res)

In [None]:
# Oversampling for imbalanced datasets
cat_selector = make_column_selector(dtype_exclude=np.number)
num_selector = make_column_selector(dtype_include=np.number)
cats = cat_selector(bundesliga_train)
nums = num_selector(bundesliga_train)
nums.remove('target')
cat_linear_processor = OneHotEncoder(handle_unknown="ignore")
num_linear_processor = make_pipeline(StandardScaler())

linear_preprocessor = make_column_transformer(
    (num_linear_processor, nums), (cat_linear_processor, cats))

# Features and Target are defined 
X = bundesliga_train.drop('target', axis=1)
y = bundesliga_train['target'].values

steps_ = [(('linear_preprocessor', linear_preprocessor))]
pipeline_sp_ = Pipeline(steps=steps_)

X = pipeline_sp_.fit_transform(X)  


fig, ((ax1,ax2,ax3, ax4), (ax5, ax6,ax7,ax8), (ax9, ax10, ax11, ax12), (ax13, ax14, ax15, ax16), (ax17, ax18, ax19,ax20)) = plt.subplots(5, 4, figsize=(20, 20))
sampler = FakeSampler()
clf = imbmake_pipeline(sampler, LinearSVC())
plot_resampling(X, y, sampler, ax1)
ax1.set_title('Original data - y={}'.format(Counter(y)))

# Over + Under Sampling with RandomUnderSampler
over = SMOTE(k_neighbors=5,n_jobs=-1) #sampling_strategy=0.1,
under = RandomUnderSampler() #sampling_strategy=0.5
steps_sp = [('o', over), ('u', under)]
pipeline_sp = imbpipeline(steps=steps_sp)
X_sp, y_sp = pipeline_sp.fit_resample(X, y)
counter = Counter(y_sp)
print(f'target distribution with Over + Under Sampling: {counter}')
for label, _ in counter.items():
    row_ix = np.where(y_sp == label)[0]
    ax2.scatter(X_sp[row_ix, 0], X_sp[row_ix, 1], label=str(label))
ax2.legend()
ax2.set_title('Over (SMOTE) + Under(RandomUnderSampler) Sampling')


ax_arr = (ax3, ax4,ax5, ax6, ax7, ax8, ax9, ax10, ax11, ax12, ax13, ax14, ax15, ax16, ax17, ax18, ax19)

for ax, sampler in zip(ax_arr,   
                               (
                                BorderlineSMOTE(k_neighbors=5,n_jobs=-1),
                                SVMSMOTE(n_jobs=-1),
                                RandomOverSampler(random_state=0),
                                SMOTE(random_state=0),
                                ADASYN(random_state=0),
                                SMOTEENN(),
                                
                                ClusterCentroids(),
                                NearMiss(version=1),
                                NearMiss(version=2),
                                NearMiss(version=3),
                                EditedNearestNeighbours(),
                                RepeatedEditedNearestNeighbours(),
                                AllKNN(allow_minority=True),
                                CondensedNearestNeighbour(random_state=0),
                                OneSidedSelection(random_state=0),
                                NeighbourhoodCleaningRule(),
                                InstanceHardnessThreshold(
    random_state=0, estimator=LogisticRegression(solver='lbfgs',
                                                 multi_class='auto'))
                               )
                      ):
    clf = imbmake_pipeline(sampler, LinearSVC())
    clf.fit(X, y)
    plot_resampling(X, y, sampler, ax)
    ax.set_title('Resampling using {}'.format(sampler.__class__.__name__))
    
# Over + Under Sampling with InstanceHardnessThreshold
over = SMOTE(k_neighbors=5,n_jobs=-1) #sampling_strategy=0.1,
underr = InstanceHardnessThreshold(
    random_state=0, estimator=LogisticRegression(solver='lbfgs',
                                                 multi_class='auto'))
steps_sp = [('o', over), ('u', underr)]
pipeline_sp = imbpipeline(steps=steps_sp)
clf = imbmake_pipeline(over,underr, LinearSVC())
clf.fit(X, y)
X_res, y_res = pipeline_sp.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y_res)
print(f'Target distribution: Over (SMOTE) + Under(InstanceHardnessThreshold) Sampling, {counter}')  

# scatter plot of examples by class label
for label, _ in counter.items():
    row_ix = np.where(y_res == label)[0]
    ax20.scatter(X_res[row_ix, 0], X_res[row_ix, 1], label=str(label))
ax20.legend()
ax20.set_title('Over (SMOTE) + Under(InstanceHardnessThreshold) Sampling')

fig.tight_layout()

In [None]:
# Features and Target are defined 
X = bundesliga_train.drop('target', axis=1)
y = bundesliga_train['target'].values

steps_ = [(('linear_preprocessor', linear_preprocessor))]
pipeline_sp_ = Pipeline(steps=steps_)

X = pipeline_sp_.fit_transform(X)  
fig, ((ax1,ax2),(ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 15))

sampler = FakeSampler()
clf = imbmake_pipeline(sampler, LinearSVC())
plot_resampling(X, y, sampler, ax1)
ax1.set_title('Original data - y={}'.format(Counter(y)))

# Over + Under Sampling with InstanceHardnessThreshold
over = SMOTE(k_neighbors=5,n_jobs=-1) #sampling_strategy=0.1,
underr = InstanceHardnessThreshold(
    random_state=0, estimator=LogisticRegression(solver='lbfgs',
                                                 multi_class='auto'))
steps_sp = [('o', over), ('u', underr)]
pipeline_sp = imbpipeline(steps=steps_sp)
clf = imbmake_pipeline(over,underr, LinearSVC(class_weight="balanced"))
clf.fit(X, y)
X_res, y_res = pipeline_sp.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y_res)
print(f'Target distribution: Over (SMOTE) + Under(InstanceHardnessThreshold) Sampling, {counter}')  

# scatter plot of examples by class label
for label, _ in counter.items():
    row_ix = np.where(y_res == label)[0]
    ax2.scatter(X_res[row_ix, 0], X_res[row_ix, 1], label=str(label))
ax2.legend()
ax2.set_title('Over (SMOTE) + Under(InstanceHardnessThreshold) Sampling')

ax_arr = (ax3, ax4)

for ax, sampler in zip(ax_arr,   
                               (
                                BorderlineSMOTE(k_neighbors=5,n_jobs=-1),
                                SVMSMOTE(n_jobs=-1)
                                
                               )
                      ):
    clf = imbmake_pipeline(sampler, LinearSVC())
    clf.fit(X, y)
    plot_resampling(X, y, sampler, ax)
    ax.set_title('Resampling using {}'.format(sampler.__class__.__name__))
    
fig.tight_layout()

In [None]:
# Below plots show us to select  borderline-1
X = bundesliga_train.drop('target', axis=1)
y = bundesliga_train['target'].values

steps_ = [(('linear_preprocessor', linear_preprocessor))]
pipeline_sp_ = Pipeline(steps=steps_)

X = pipeline_sp_.fit_transform(X)  
fig, ((ax1,ax2, ax3)) = plt.subplots(1, 3, figsize=(12, 5))

sampler = FakeSampler()
clf = imbmake_pipeline(sampler, LinearSVC())
plot_resampling(X, y, sampler, ax1)
ax1.set_title('Original data - y={}'.format(Counter(y)))

ax_arr = (ax2,ax3)

for ax, sampler in zip(ax_arr,   
                               (
                                BorderlineSMOTE(k_neighbors=5,n_jobs=-1,random_state=0, kind='borderline-1'),
                                BorderlineSMOTE(k_neighbors=5,n_jobs=-1,random_state=0, kind='borderline-2')
                                
                               )
                      ):
    clf = imbmake_pipeline(sampler, LinearSVC())
    clf.fit(X, y)
    plot_resampling(X, y, sampler, ax)
    ax.set_title('Resampling using {}'.format(sampler.__class__.__name__))
    
fig.tight_layout()

### Feature Selection 

Here we will try different feature selection models to remove the high importance ones and reduce the size of the dataset.

The results from the defined feature selection methods will be recorded in a data frame. Voting from the outputs will be done by summing up all the selected features and defining a threshold.

With this method we will be able to process different feature selection strategies and then select the most important ones.

In [None]:
def feature_selection(X,y,feature_names, RF, PIRF,XGB,PIXGB,SFMRF,SFMXGB, MI,RFECVLinearSVC):
    #Define a Voting dataframe
    feature_voting = pd.DataFrame(columns=['Feature_names','RF', 'PI+RF','XGB','PI+XGB', 'SFM+RF', 'SFM+XGB', 'MI', 'RFECV+LinearSVC', 'Voting'])
    feature_voting['Feature_names'] = feature_names

    #Model Based Selectors 
    #Feature Selection with  RandomForestClassifier
    clf = RandomForestClassifier()
    clf.fit(X, y)
    if RF:
        print("\nRF accuracy: %0.3f" % clf.score(X, y))
        print(f'\nTotal # of features found with RandomForestClassifier: {len([i for i in list(clf.feature_importances_) if i>0])}')

        y_ticks = np.arange(0, len(feature_names))
        fig, ax = plt.subplots(figsize=(5,5))
        sorted_idx = clf.feature_importances_.argsort()
        ax.barh(y_ticks, clf.feature_importances_[sorted_idx])
        ax.set_yticks(y_ticks)
        ax.set_yticklabels(feature_names[sorted_idx])
        ax.set_title("Random Forest Feature Importances")
        fig.tight_layout()
        plt.show()
    selection = list(feature_names[(np.asarray([True if i >0 else False for i in list(clf.feature_importances_)]))])
    feature_voting['RF'] = np.where(feature_voting['Feature_names'].isin(selection), 1, 0)


    #Feature Selection with  permutation_importance
    result = permutation_importance(clf, X, y, n_repeats=10,
                                    random_state=42, n_jobs=2)
    if PIRF:
        print(f'\nTotal # of features found with RandomForestClassifier + Permutation_importance: {len([i for i in list(result.importances_mean) if i>0])}')
        y_ticks = np.arange(0, len(feature_names))
        fig, ax = plt.subplots(figsize=(5,5))
        sorted_idx = result.importances_mean.argsort()
        ax.barh(y_ticks, result.importances_mean[sorted_idx])
        ax.set_yticks(y_ticks)
        ax.set_yticklabels(feature_names[sorted_idx])
        ax.set_title("RandomForestClassifier + Permutation Importances Feature Importances")
        fig.tight_layout()
        plt.show()
    selection = list(feature_names[(np.asarray([True if i >0 else False for i in list(result.importances_mean)]))])
    feature_voting['PI+RF'] = np.where(feature_voting['Feature_names'].isin(selection), 1,0)


    #Feature Selection with  XGBClassifier

    xgb = XGBClassifier()
    xgb.fit(X, y)
    if XGB:
        print(f'\nTotal # of features found with XGBClassifier: {len([i for i in list(xgb.feature_importances_) if i>0])}')
        y_ticks = np.arange(0, len(feature_names))
        fig, ax = plt.subplots(figsize=(5,5))
        sorted_idx = xgb.feature_importances_.argsort()
        ax.barh(y_ticks, xgb.feature_importances_[sorted_idx])
        ax.set_yticks(y_ticks)
        ax.set_yticklabels(feature_names[sorted_idx])
        ax.set_title("XGB Feature Importances")
        fig.tight_layout()
        plt.show()
    selection = list(feature_names[(np.asarray([True if i >0 else False for i in list(xgb.feature_importances_)]))])
    feature_voting['XGB'] = np.where(feature_voting['Feature_names'].isin(selection), 1,0)

    #Feature Selection with  permutation_importance
    result = permutation_importance(xgb, X, y, n_repeats=10,
                                    random_state=42, n_jobs=2)
    if PIXGB:
        print(f'\nTotal # of features found with XGBClassifier + Permutation_importance: {len([i for i in list(result.importances_mean) if i>0])}')
        y_ticks = np.arange(0, len(feature_names))
        fig, ax = plt.subplots(figsize=(5,5))
        sorted_idx = result.importances_mean.argsort()
        ax.barh(y_ticks, result.importances_mean[sorted_idx])
        ax.set_yticks(y_ticks)
        ax.set_yticklabels(feature_names[sorted_idx])
        ax.set_title("XGBClassifier + Permutation Importances Feature Importances")
        fig.tight_layout()
        plt.show()
    selection = list(feature_names[(np.asarray([True if i >0 else False for i in list(result.importances_mean)]))])
    feature_voting['PI+XGB'] = np.where(feature_voting['Feature_names'].isin(selection), 1,0)

    #Feature Selection with SelectFromModel RandomForestClassifier

    rf = RandomForestClassifier(n_estimators=100,class_weight ="balanced", random_state=42)
    select = SelectFromModel(rf)  #max_features=20
    selection = select.fit(X, y)
    selected_feat=feature_names[(selection.get_support())]
    if SFMRF:
        print(f'\nOptimal # features with SelectFromModel RandomForestClassifier:{len(selected_feat)}')
        print(f'\nSelected features from SelectFromModel with RandomForestClassifier{selected_feat}')
    feature_voting['SFM+RF'] = np.where(feature_voting['Feature_names'].isin(selected_feat), 1,0)

    #Feature Selection with SelectFromModel XGBClassifier

    rf = XGBClassifier(n_jobs=-1)
    select = SelectFromModel(rf)  #max_features=20
    selection = select.fit(X, y)
    selected_feat=feature_names[(selection.get_support())]
    if SFMXGB:
        print(f'\nOptimal # features SelectFromModel XGBClassifier:{len(selected_feat)}')
        print(f'\nSelected features from SelectFromModel with XGBClassifier {selected_feat}')
    feature_voting['SFM+XGB'] = np.where(feature_voting['Feature_names'].isin(selected_feat), 1,0)
   

    #Filter Methods
    #Feature Selection with  mutual_info_classif
    mutual_information = mutual_info_classif(X, y)
    if MI:
        print(f'\nTotal # of features found with mutual_info_classif: {len([i for i in list(mutual_information) if i>0])}')
        plt.subplots(1, figsize=(26, 1))
        sns.heatmap(mutual_information[:, np.newaxis].T, cmap='Blues', cbar=False, linewidths=1, annot=True)
        plt.yticks([], [])
        plt.gca().set_xticklabels(feature_names, rotation=45, ha='right', fontsize=12)
        plt.suptitle("Variable Importance (mutual_info_classif)", fontsize=18, y=1.2)
        plt.gcf().subplots_adjust(wspace=0.2)
        plt.show()
        pass
    selection = list(feature_names[(np.asarray([True if i >0 else False for i in list(mutual_information)]))])
    feature_voting['MI'] = np.where(feature_voting['Feature_names'].isin(selection), 1,0)

    #Wrapper 
    #Feature Selection with  RFECV + LinearSVC
    min_features_to_select = 1  # Minimum number of features to consider
    rfecv = RFECV(estimator=LinearSVC(class_weight='balanced'), step=1, cv=StratifiedKFold(2),
                  scoring='accuracy',
                  min_features_to_select=min_features_to_select)
    rfecv.fit(X, y)
    if RFECVLinearSVC:
        print("\nOptimal number of features with RFECV + LinearSVC: %d" % rfecv.n_features_)
        print(f'\nSelected Features with RFECV + LinearSVC : {list(feature_names[np.asarray([i for i in list(rfecv.ranking_) if i !=1])])}')
        # Plot number of features VS. cross-validation scores
        plt.figure()
        plt.xlabel("Number of features selected")
        plt.ylabel("Cross validation score (nb of correct classifications)")
        plt.plot(range(min_features_to_select,
                       len(rfecv.grid_scores_) + min_features_to_select),
                 rfecv.grid_scores_)
        plt.show()
    
    selection = list(feature_names[(np.asarray([True if i!=1 else False for i in list(rfecv.ranking_)]))])
    feature_voting['RFECV+LinearSVC'] = np.where(feature_voting['Feature_names'].isin(selection), 1,0)

    feature_voting['Voting'] = feature_voting.iloc[:,1:].sum(axis=1).astype(int)
    feature_voting.sort_values(by='Voting', ascending=False, inplace=True)
    return feature_voting

In [None]:
#Prepare X and y 
cat_selector = make_column_selector(dtype_exclude=np.number)
num_selector = make_column_selector(dtype_include=np.number)
cats = cat_selector(bundesliga_train)
nums = num_selector(bundesliga_train)
nums.remove('target')

cat_linear_processor = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
num_linear_processor = Pipeline(steps=[('scaler', StandardScaler())])

linear_preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_linear_processor, nums),
        ('cat', cat_linear_processor, cats)])

# Features and Target are defined 
X = bundesliga_train.drop('target', axis=1)
X = X[cats + nums]
y = bundesliga_train['target'].values

steps_ = [(('linear_preprocessor', linear_preprocessor))]
pipeline_sp_ = Pipeline(steps=steps_)

X = pipeline_sp_.fit_transform(X)

#Define feature names in the correct order
feature_names =  np.r_[pipeline_sp_.named_steps['linear_preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names(cats), nums]

df_ = pd.DataFrame(X, columns=feature_names)
df_['target'] = y

feature_voting = feature_selection(X,y,feature_names,True, True,True,True,True,True, True,True)

In [None]:
feature_voting

### Baseline Models:

Before implementing grid search and hypermeter optimization, all models are tested on the training sets by defining a pipeline which includes Categorical Feature Encoding by One Hot Encoder and Numerical Feature Standard Scaling. Feature Selection is performed before oversampling as indicated in the SMOTE paper.
Models which are compared and kept in a data frame are: 'Decision Tree', 'Neural Network', 'Logistic Regression', 'Linear SVC', 'K-Nearest Neighbors', 'Gradient Boosting Classifier', 'Extra Tree Classifier', 'Adaptive Boosting Classifier', 'Random Forest', 'Extreme Gradient Boosting' and lastly 'Stacking Classifier' as an ensemble method. For the test of these models, we used shuffled K Fold Cross Validation with scoring ROC-AUC. Each fold’s results are saved as in a data frame. Then by taking the mean and standard deviations of each model results, we prepared a final table which then ranked considering test accuracy and standard deviations. The results can also be seen below.     


In [None]:
def baseline_models(df):

    print(f'Shape of the dataset:{df.shape}')
    #Here we are defining categoric and numerical variables, Target column is removed from numerical list
    cat_selector = make_column_selector(dtype_exclude=np.number)
    num_selector = make_column_selector(dtype_include=np.number)
    cats = cat_selector(df)
    nums = num_selector(df)
    nums.remove('target')

    #print('Categorical Variables:',cats)
    #print('Numerical Variables:',nums)

    # Features and Target are defined 
    X = df.drop('target', axis=1)
    y = df['target'].values
    # summarize the new class distribution
    counter = Counter(y)
    print(f'Target distribution: {counter}')

    # To prevent data leakage during cross validation we prepare a pipeline to handle each splits. Categorical columns are transformed with OneHotEncoding, and numerical columns are standard scaled. 

    cat_linear_processor = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
    num_linear_processor = Pipeline(steps=[('scaler', StandardScaler())])
        
    linear_preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_linear_processor, nums),
            ('cat', cat_linear_processor, cats)])  

    #Create to observe resulted features, targets before cv
    steps_ = [(('linear_preprocessor', linear_preprocessor))]
    pipeline_sp_ = Pipeline(steps=steps_)
    X = pipeline_sp_.fit_transform(X)
    
    #Define feature names in the correct order
    feature_names =  np.r_[pipeline_sp_.named_steps['linear_preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names(cats), nums]

    df_ = pd.DataFrame(X, columns=feature_names)
    df_['target'] = y
    
    #Feature Selection 
    feature_voting = feature_selection(X,y,feature_names,False, False,False,False,False,False, False,False)
    selected_feature = list(feature_voting[feature_voting['Voting']>=3].Feature_names.unique()) + ['target']
    
    # Prepare a dataframe consist of selected features 
    df_ = df_[selected_feature]
    cats = cat_selector(df_)
    nums = num_selector(df_)
    nums.remove('target')
    
    linear_preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_linear_processor, nums),
            ('cat', cat_linear_processor, cats)])  
    
    # Prep X-Features and y-Target
    X = df_.drop('target', axis=1)
    #X = X[cats + nums]
    y = df_['target'].values
    #########################################
    # To observe target distribution prep X_ 
    steps_ = [(('linear_preprocessor', linear_preprocessor))]
    pipeline_sp_ = Pipeline(steps=steps_)

    X_ = pipeline_sp_.fit_transform(X)
    
    # SMOTE
    sampler = BorderlineSMOTE(k_neighbors=5,n_jobs=-1,random_state=123, kind='borderline-1')
    
    # Extract Target distribution from selected SMOTE result  
    clf_ = imbmake_pipeline(sampler, LinearSVC())
    clf_.fit(X_, y) 
    stepss = [(('sampler', sampler))]
    pipeline_sp = imbpipeline(steps=stepss)
    X_res, y_res = pipeline_sp.fit_resample(X_, y)
    counter = Counter(y_res)
    print(f'Target distribution -> BorderlineSMOTE Sampling, {counter}')
    print(f'Dataset Shape after -> Feature Selection and BorderlineSMOTE Sampling, {X_res.shape}')
    #########################################
    def get_stacking():
        # define the base models
        level0 = list()
        level0.append(('lr', LogisticRegression()))
        level0.append(('knn', KNeighborsClassifier()))
        level0.append(('cart', DecisionTreeClassifier()))
        level0.append(('svm', LinearSVC()))
        level0.append(('bayes', GaussianNB()))
        level0.append(('rf', RandomForestClassifier()))
        level0.append(('gbc', GradientBoostingClassifier()))
        level0.append(('ec', ExtraTreesClassifier()))
        level0.append(('abc', AdaBoostClassifier(DecisionTreeClassifier())))
        level0.append(('xgb', XGBClassifier()))
        # define meta learner model
        level1 = XGBClassifier()
        # define the stacking ensemble
        model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
        return model
    

    #For each classification model we prepare a pipeline, then combine them in a list. 
    dt_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', DecisionTreeClassifier(random_state=42))])

    nn_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', MLPClassifier(random_state=42))])

    lg_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', LogisticRegression(random_state=42))])

    lsvc_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', LinearSVC(class_weight="balanced",random_state=42))])

    knn_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', KNeighborsClassifier())])
    
    gbc_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', GradientBoostingClassifier())])
    ec_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', ExtraTreesClassifier())])
    abc_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', AdaBoostClassifier(DecisionTreeClassifier()))])
    
    xgb_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', XGBClassifier())])
    
    rf_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', RandomForestClassifier())])
    
    Stacked = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf',  get_stacking())])


    pp = [dt_pipeline,nn_pipeline,lg_pipeline,lsvc_pipeline,knn_pipeline,gbc_pipeline,ec_pipeline,abc_pipeline,xgb_pipeline,rf_pipeline,Stacked]
    pp_dict = {0: 'Decision Tree', 1: 'Neural Network',2: 'Logistic Regression',3:'LinearSVC',4:'KNN', 5:'Gradient BC', 6:'ExtraTree C', 7:'AdaBoost C',8:'XGB',9:'Random Forest', 10:'Stacked'}

    # 5 folds are defined with shuffling. We select stratified to have the same distribution in target 
    kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    clfs = []
    train_all = []
    trains = []
    trains_std = []
    tests = []
    test_all = []
    tests_std = []

    for idx, gs in enumerate(pp):
        results = cross_validate(gs, X, y,cv=kf, return_train_score=True, scoring='roc_auc',n_jobs=-1)
        clfs.append(pp_dict[idx])
        trains.append(results['train_score'].mean())
        train_all.append(results['train_score'])
        trains_std.append(results['train_score'].std())
        tests.append(results['test_score'].mean())
        test_all.append(results['test_score'])
        tests_std.append(results['test_score'].std())

    folds = pd.DataFrame(columns=clfs)
    for i,j in zip(clfs,test_all):
        folds[i] = j  
        
    display(folds)    
        
    tables = pd.DataFrame()
    tables['model'] = clfs
    tables['train'] = trains
    tables['train_std'] = trains_std
    tables['test'] = tests
    tables['test_std'] = tests_std
    tables['test_rank'] = (tables['test'].rank(ascending=False) + tables['test_std'].rank(ascending=True)).rank(ascending=True).astype(int)
    tables['type'] = 'baseline'
    tables.sort_values(by='test_rank', inplace=True)
    display(tables)
    # plot model performance for comparison
    plt.boxplot(test_all, labels=clfs, showmeans=True)
    plt.title('Test Results')
    plt.xticks(rotation=45)
    plt.show()

    return tables


In [None]:
tables = baseline_models(bundesliga_train)

**Hypermeter Optimization, Overfit/Underfit Analysis**: For each models hypermeter optimization is performed with given parameter grids in a Grid Search Cross Validation method. Selected model parameters are kept in a dictionary as like below, which then used to predict the validation sets that is defined in the beginning of the split as the last season. Model accuracies are combined including balanced accuracy and classification reports of precision, recall and f1 scores for true and false classes. An example from one of the reports is shown below. This final result table is used to analyze overfitting and create a leaderboard for the given leagues. 

In [None]:
def hyp_optimization_model_selection(df, df_test):

    print(f'Shape of the dataset:{df.shape}')
    #Here we are defining categoric and numerical variables, Target column is removed from numerical list
    cat_selector = make_column_selector(dtype_exclude=np.number)
    num_selector = make_column_selector(dtype_include=np.number)
    cats = cat_selector(df)
    nums = num_selector(df)
    nums.remove('target')

    #print('Categorical Variables:',cats)
    #print('Numerical Variables:',nums)

    # Features and Target are defined 
    X = df.drop('target', axis=1)
    y = df['target'].values
    # summarize the new class distribution
    counter = Counter(y)
    print(f'Target distribution: {counter}')

    # To prevent data leakage during cross validation we prepare a pipeline to handle each splits. Categorical columns are transformed with OneHotEncoding, and numerical columns are standard scaled. 

    cat_linear_processor = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
    num_linear_processor = Pipeline(steps=[('scaler', StandardScaler())])
        
    linear_preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_linear_processor, nums),
            ('cat', cat_linear_processor, cats)])  

    #Create to observe resulted features, targets before cv
    steps_ = [(('linear_preprocessor', linear_preprocessor))]
    pipeline_sp_ = Pipeline(steps=steps_)
    X = pipeline_sp_.fit_transform(X)
    
    #Define feature names in the correct order
    feature_names =  np.r_[pipeline_sp_.named_steps['linear_preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names(cats), nums]

    df_ = pd.DataFrame(X, columns=feature_names)
    df_['target'] = y
    
    #Feature Selection 
    feature_voting = feature_selection(X,y,feature_names,False, False,False,False,False,False, False,False)
    selected_feature = list(feature_voting[feature_voting['Voting']>=3].Feature_names.unique()) + ['target']
    
    # Prepare a dataframe consist of selected features 
    df_ = df_[selected_feature]
    cats = cat_selector(df_)
    nums = num_selector(df_)
    nums.remove('target')
    
    linear_preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_linear_processor, nums),
            ('cat', cat_linear_processor, cats)])  
    
    # Prep X-Features and y-Target
    X = df_.drop('target', axis=1)
    #X = X[cats + nums]
    y = df_['target'].values
 #########################################
    # To observe target distribution prep X_ 
    steps_ = [(('linear_preprocessor', linear_preprocessor))]
    pipeline_sp_ = Pipeline(steps=steps_)

    X_ = pipeline_sp_.fit_transform(X)
    
    # SMOTE
    sampler = BorderlineSMOTE(k_neighbors=5,n_jobs=-1,random_state=123, kind='borderline-1')
    
    # Extract Target distribution from selected SMOTE result  
    clf_ = imbmake_pipeline(sampler, LinearSVC())
    clf_.fit(X_, y)  
    stepss = [(('sampler', sampler))]
    pipeline_sp = imbpipeline(steps=stepss)  
    X_res, y_res = pipeline_sp.fit_resample(X_, y)
    counter = Counter(y_res)
    print(f'Target distribution -> BorderlineSMOTE Sampling, {counter}')
    print(f'Dataset Shape after -> Feature Selection and BorderlineSMOTE Sampling, {X_res.shape}')
 #########################################

    
    def get_stacking():
        # define the base models
        level0 = list()
        level0.append(('lr', LogisticRegression()))
        level0.append(('knn', KNeighborsClassifier()))
        level0.append(('cart', DecisionTreeClassifier()))
        level0.append(('svm', LinearSVC()))
        level0.append(('bayes', GaussianNB()))
        level0.append(('rf', RandomForestClassifier()))
        level0.append(('gbc', GradientBoostingClassifier()))
        level0.append(('ec', ExtraTreesClassifier()))
        level0.append(('xgb', XGBClassifier()))
        # define meta learner model
        level1 = XGBClassifier()
        # define the stacking ensemble
        model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
        return model


    
    grid_params_dt = [ {"clf__criterion" : ["gini", "entropy"],
                  "clf__splitter" :   ["best", "random"],
                  "clf__max_depth" :[None,2,4,6,8,10,12]} ]    
    
    
    grid_params_nn = [{'clf__hidden_layer_sizes': [(10,10,10), (10,10,10,10), (10,10,10,10,10), (10,10,10,10,10,10)], 'clf__alpha': list(10.0 ** -np.arange(1, 6))}]
    
    grid_params_lr = [{"clf__C": [1.0],  'clf__class_weight':[None,'balanced']}] 

    
    grid_params_svc= [{'clf__penalty': ['l2','l1'], 
                       'clf__class_weight': [None,'balanced'],
                       'clf__dual': [True, False],
                       'clf__C': [0.01,0.1,1.0,10],
                       'clf__tol':[0.001,0.0008,0.0009,0.0011]}]
    
    
    grid_params_knn= [{'clf__n_neighbors': [4,5,6,7], 
                  'clf__leaf_size': [10,20,30,40]}]
    
    
    grid_params_gbc = [{'clf__loss' : ["deviance"],
                 'clf__n_estimators' : [450,460,500],
                 'clf__learning_rate': [0.1,0.11],
                 'clf__max_depth': [7,8],
                 'clf__min_samples_leaf': [30,40],
                 'clf__max_features': [0.1,0.4,0.6]}]    

    
    grid_params_ec = [{"clf__max_depth": [3, 4, 5],
                 "clf__max_features": [3, 10, 15],
                 "clf__min_samples_split": [2, 3, 4],
                 "clf__min_samples_leaf": [1, 2],
                 "clf__bootstrap": [False,True],
                 "clf__n_estimators" :[100,200,300],
                 "clf__criterion": ["gini","entropy"]} ] 

    
    grid_params_xgb = [{'clf__learning_rate': [0.1,0.04,0.01], 
                  'clf__max_depth': [3,4,5,6,7],
                  'clf__n_estimators': [100,350,400,450,2000], 
                  'clf__gamma': [0,1,5,8],
                  'clf__subsample': [0.8,0.95,1.0]}]
    
    grid_params_rf = [{'clf__n_estimators': [10, 50, 100,250,500,1000], 'clf__max_depth': [50, 150, 250], 'clf__min_samples_split':[2,3,4], 'clf__min_samples_leaf':[1,2,3,4], "clf__bootstrap": [True],
           "clf__n_estimators" :[50,80],
           "clf__criterion": ["gini","entropy"],
           "clf__max_leaf_nodes":[26,28],
           "clf__min_impurity_decrease":[0.0],
           "clf__min_weight_fraction_leaf":[0.0]}]

    grid_params_s = [{'clf__passthrough':[True,False]}] 

    
    #For each classification model we prepare a pipeline, then combine them in a list. 
    dt_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', DecisionTreeClassifier(random_state=42))])

    nn_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', MLPClassifier(random_state=42))])

    lg_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', LogisticRegression(random_state=42))])

    lsvc_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', LinearSVC(class_weight="balanced",random_state=42))])

    knn_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', KNeighborsClassifier())])
    
    gbc_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', GradientBoostingClassifier())])
    ec_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', ExtraTreesClassifier())])
    
    xgb_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', XGBClassifier())])

    rf_pipeline = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf', RandomForestClassifier())])
    
    Stacked = imbpipeline(
        steps=[('linear_preprocessor', linear_preprocessor),
                                    ('sampler', sampler),
                                    ('clf',  get_stacking())])
    
    pp = [dt_pipeline, nn_pipeline, lg_pipeline, lsvc_pipeline, knn_pipeline, gbc_pipeline,  xgb_pipeline, rf_pipeline, Stacked] #ec_pipeline,

    # 5 folds are defined with shuffling. We select stratified to have the same distribution in target 
    kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    


    scoring='roc_auc'
    gs_dt = GridSearchCV(estimator=dt_pipeline, param_grid=grid_params_dt, scoring=scoring, cv=kf)
    gs_nn = GridSearchCV(estimator=nn_pipeline, param_grid=grid_params_nn, scoring=scoring, cv=kf)
    gs_lr = GridSearchCV(estimator=lg_pipeline, param_grid=grid_params_lr, scoring=scoring, cv=kf)
    gs_svc = GridSearchCV(estimator=lsvc_pipeline, param_grid=grid_params_svc, scoring=scoring, cv=kf)
    gs_knn = GridSearchCV(estimator=knn_pipeline, param_grid=grid_params_knn, scoring=scoring, cv=kf)
    gs_gbc = GridSearchCV(estimator=gbc_pipeline, param_grid=grid_params_gbc, scoring=scoring, cv=kf)
    gs_ec = GridSearchCV(estimator=ec_pipeline, param_grid=grid_params_ec, scoring=scoring, cv=kf) 
    gs_xgb = GridSearchCV(estimator=xgb_pipeline, param_grid=grid_params_xgb, scoring=scoring, cv=kf)
    gs_rf = GridSearchCV(estimator=rf_pipeline, param_grid=grid_params_rf, scoring=scoring, cv=kf)
    gs_s = GridSearchCV(estimator=Stacked, param_grid=grid_params_s, scoring=scoring, cv=kf)

    # List of pipelines for ease of iteration
    grids = [gs_dt, gs_nn, gs_lr, gs_svc, gs_knn, gs_gbc,gs_xgb, gs_rf, gs_s] #

    # Dictionary of pipelines and classifier types for ease of reference
    grid_dict = {0: 'Decision Tree', 1: 'Neural Network',2: 'Logistic Regression',3:'LinearSVC',4:'KNN', 5:'Gradient BC', 6:'XGB',7:'Random Forest', 8:'Stacked'}  #6:'ExtraTree C', 7:'XGB',8:'Random Forest', 9:'Stacked'

    clfs = []
    train_all = []
    trains = []
    trains_std = []
    tests = []
    test_all = []
    tests_std = []
    best_params_results = dict()    
    
    start_time=time.time()
    for idx, gs in enumerate(grids):
        start_time_=time.time()
        print('\nTask: %s is processing!' % grid_dict[idx])    

        # Non_nested parameter search and scoring
        gs.fit(X, y)
        best_regressor = gs.best_estimator_
        best_params = gs.best_params_
        print('Best parameters for {} = {}'.format(grid_dict[idx],best_params))
        print(f'Best score with {grid_dict[idx]} is {gs.best_score_}')
        best_params_results[grid_dict[idx]] = best_regressor
        results  = cross_validate(best_regressor, X, y, cv=kf, return_train_score=True, scoring=scoring, n_jobs=-1)
        clfs.append(grid_dict[idx])
        trains.append(results['train_score'].mean())
        train_all.append(results['train_score'])
        trains_std.append(results['train_score'].std())
        tests.append(results['test_score'].mean())
        test_all.append(results['test_score'])
        tests_std.append(results['test_score'].std())
        print('\n')
        time_took = time.time() - start_time_
        print(f"Total runtime for {grid_dict[idx]} Grid Search: {hms_string(time_took)}")
    
    time_took = time.time() - start_time
    print(f"Total runtime for Grid Search: {hms_string(time_took)}")


    folds = pd.DataFrame(columns=clfs)
    for i,j in zip(clfs,test_all):
        folds[i] = j  
        
    display(folds)    
        
    tables = pd.DataFrame()
    tables['model'] = clfs
    tables['train'] = trains
    tables['train_std'] = trains_std
    tables['test'] = tests
    tables['test_std'] = tests_std
    tables['test_rank'] = (tables['test'].rank(ascending=False) + tables['test_std'].rank(ascending=True)).rank(ascending=True).astype(int)
    tables['type'] = 'gridsearch'
    tables.sort_values(by='test_rank', inplace=True)
    display(tables)
    # plot model performance for comparison
    plt.boxplot(test_all, labels=clfs, showmeans=True)
    plt.title('Test Results')
    plt.xticks(rotation=45)
    plt.show()
    
    #Get ranked 1 model and predict testvalue 
    winner_model = tables['model'].values[0] #tables[tables['test_rank']==1]['model'].values[0]
    model = best_params_results[winner_model]
    
    # Prep X-Features and y-Target
    test_results = df_test.copy()
    test_results = test_results[['Team','target']]

    print(f'Shape of the test dataset:{df_test.shape}')

    cats = cat_selector(df_test)
    nums = num_selector(df_test)
    nums.remove('target')

    # Features and Target are defined 
    X = df_test.drop('target', axis=1)
    #X = [cats+nums]
    y = df_test['target'].values
    # summarize the new class distribution
    counter = Counter(y)
    print(f'Target distribution of Test Set: {counter}')

    cat_linear_processor = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
    num_linear_processor = Pipeline(steps=[('scaler', StandardScaler())])


    linear_preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_linear_processor, nums),
            ('cat', cat_linear_processor, cats)]) 

    #Create to observe resulted features, targets before cv
    steps_ = [(('linear_preprocessor', linear_preprocessor))]
    pipeline_sp_ = Pipeline(steps=steps_)
    X = pipeline_sp_.fit_transform(X)

    #Define feature names in the correct order
    feature_names =  np.r_[pipeline_sp_.named_steps['linear_preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names(cats), nums]

    df_test = pd.DataFrame(X, columns=feature_names)
    df_test['target'] = y

    for i in list(df_test.columns):
        if i not in selected_feature:
            df_test.drop(i, axis=1, inplace=True)

    for i in selected_feature:
        if i not in list(df_test.columns):
            df_test[i] = 0

    df_test=df_test[selected_feature]

    # Prep X-Features and y-Target
    X = df_test.drop('target', axis=1)
    #X = X[cats + nums]
    y = df_test['target'].values
    
    preds = model.predict(X)
    test_results["Predicted"] = list(preds)
    print('Pedicted Test Set Results:')
    display(test_results)

    print('Classification report from test results:\n',classification_report(y, preds))

    balanced_accuracy = balanced_accuracy_score(y, preds)
    precision_fc = precision_recall_fscore_support(y, preds)[0][0]
    precision_tc = precision_recall_fscore_support(y, preds)[0][1]
    recall_fc = precision_recall_fscore_support(y, preds)[1][0]
    recall_tc = precision_recall_fscore_support(y, preds)[1][1]
    fscore_fc = precision_recall_fscore_support(y, preds)[2][0]
    fscore_tc = precision_recall_fscore_support(y, preds)[2][1]



    all_results = {}
    all_results['winner_model_name'] = [winner_model]
    all_results['time_for_model_selection'] = [hms_string(time_took)]
    all_results['balanced_accuracy'] = balanced_accuracy
    all_results['precision_tc'] = precision_tc
    all_results['recall_tc'] = recall_tc
    all_results['fscore_tc'] = fscore_tc
    all_results['precision_fc'] = precision_fc
    all_results['recall_fc'] = recall_fc
    all_results['fscore_fc'] = fscore_fc
    
    
    all_results = pd.DataFrame.from_dict(all_results)
    
    return tables, test_results, all_results, winner_model





In [None]:
tables, test_results, all_results =  hyp_optimization_model_selection(bundesliga_train,bundesliga_test)

### Model Selection: 

As described in the motivation of the project, to find out competitiveness levels, we prepared a percentage list from 0.3 to 0.9 increasing by 0.1 steps to cut the datasets from these levels and predict the target for each league. For example, 0.3 means we use the first 30% of the total matches during the season to predict the target. For all the leagues it took 13h39m to prepare final results, then saved as in csv format. Model prediction results based on Baseline model, Grid Search and Balanced Accuracy Score are represented. 

### Model pipelines for all datasets 


- We will research the prediction results of all leagues separately from each other to find out how competitive each league is.
- Being able to predict the score ranking of any league at the end of the season at the beginning of the season will show us that the competition level of the league may be low from this point of view.
- The prediction results of each model will be recorded in a table in order to be compared with the values obtained from the grid search and baseline model results.
- Similarly, the accuracy rates of the successful models on the validation sets of the last season, which were separated in the previous steps, will also be recorded in this table.



In [None]:
#Create below lists to be able to iterate over 

columns = ['League','Percentage','winner_model_name','time_for_model_selection','time_for_percentage_evaluation','Baseline_model_accuracy','Gridsearch_model_accuracy','balanced_accuracy','precision_tc', 'recall_tc','fscore_tc','precision_fc','recall_fc','fscore_fc']

leaderboard =[]
percentages = list(np.round(np.arange(0.3,1.0,0.1),2))

start_time_=time.time()
for i in percentages:
    all_dfs = prep_step_1(path_df,path_bundesliga,path_epl,path_laliga,path_mls)
    labels = ['bundesliga','epl','laliga','mls']
    for j, k  in zip(labels,all_dfs):
        start_time=time.time()
        print(f'{j} with {round(i*100)}% is processing!')
        df_train, df_test = prep_step_2(k, percentage=i)
        print(f'{j} baseline models started!')
        tables_baseline = baseline_models(df_train)
        print(f'{j} model selection with grid search started!')
        tables, test_results, all_results, winner_model =  hyp_optimization_model_selection(df_train,df_test)
        time_took = time.time() - start_time
        print(f"Total runtime for processing {j}: {hms_string(time_took)}")
        all_results['time_for_percentage_evaluation'] = hms_string(time_took)
        all_results['League'] = j
        all_results['Percentage'] = round(i*100)
        all_results['Baseline_model_accuracy'] = tables_baseline[tables_baseline['model']==winner_model]['test'].values[0]
        all_results['Gridsearch_model_accuracy'] =tables[tables['model']==winner_model]['test'].values[0]
        leaderboard.append(all_results)
        
time_took = time.time() - start_time_
print(f"Total runtime for all iteration: {hms_string(time_took)}")

leaderboard = pd.concat(leaderboard)
leaderboard = leaderboard[columns]
leaderboard.to_csv('leaderboard.csv', index=0)


In [None]:
leaderboard = pd.read_csv('../input/model-selection-outputs/leaderboard.csv');leaderboard

### Result evaluations

According to the result table obtained, the following examinations were made:

- **Winner Model Distributions**: Evaluation of the distribution of the selected models among the leagues, considering the percentages from which the model predictions were taken, since it gave the best results among all the algorithms tested. As shown in below graph Linear SVC chosen by gridsearch which is used as prediction models for all leagues except La Liga which mostly chose Logistic Regression and Neural Networks.


- **Model Accuracy Results**: Evaluation of success rates of all model results, specific to leagues, according to certain percentages of the matches played.



In [None]:
all_dfs = [leaderboard[leaderboard['League']=='bundesliga'],
          leaderboard[leaderboard['League']=='epl'],
          leaderboard[leaderboard['League']=='laliga'],
          leaderboard[leaderboard['League']=='mls'],]
labels = ['bundesliga','epl','laliga','mls']

fig = plt.figure(figsize=(20,10))
plt.suptitle('Winner Model Distributions', fontsize= 20,fontweight= 'bold')
plt.subplots_adjust(wspace=0.1, hspace=0.3)
a = 2 
b = 2 
c = 1 

for t,z in enumerate(zip(labels,all_dfs)):
    plt.subplot(a, b, c+t)
    sns.histplot(binwidth=0.5, x="winner_model_name", hue="Percentage", data=z[1], stat="count", multiple="stack")
    plt.title(z[0], fontdict={'fontsize': 15,
        'fontweight': 'bold'})
    #plt.axis('off')

plt.show() 
fig.savefig('Winner.Model.Distributions.jpeg', dpi=350, bbox_inches='tight')

From the Balanced Score chart in the first line of below graph, after being stable for weeks, we see that Major League Soccer (MLS) is more predictable as we approach the end of the season. Based on this observation, we can say that it maintains its competitiveness up to about 70% of the season. Of all the leagues, only the Bundesliga experienced a sharp decline in the middle of the league, while at the beginning of the league it was at a predictable level of competitiveness. On the contrary, the La Liga league’s predictability continues to increase up to a 60% of the season and then drops sharply, so we can conclude with its competitiveness level increases rapidly towards to the end of season. Premier League starts with a high predictability accuracy and continues with small ups and downs throughout the season, closing the season with an upward trend. Therefore, we can say its competitiveness level is low when compared to the others.

In [None]:
#Define a dataframe for Viz

all_ = []

for j in list(leaderboard.League.unique()):
    b_ = leaderboard[leaderboard['League']==j]
    b_ = pd.DataFrame(b_.loc[:,'Baseline_model_accuracy':].T).reset_index().rename(columns={'index':'Accuracy_Type'})
    b_.columns=['Accuracy_Type'] + list(leaderboard.Percentage.unique())

    finals = []
    for i in (b_.iloc[:,1:].columns):
        temp = b_.copy()
        temp = temp[['Accuracy_Type',i]]
        temp['Percentage'] = i
        temp.rename(columns={i:'Accuracy'}, inplace=True)
        finals.append(temp)
    finals = pd.concat(finals)  
    finals['League'] = j
    all_.append(finals)
    
all_ = pd.concat(all_)
all_.replace('_model_accuracy','', inplace=True, regex=True)
all_.replace('balanced_accuracy','Balanced', inplace=True, regex=True)
all_.replace('_tc',' true class', inplace=True,regex=True)
all_.replace('_fc',' false class', inplace=True, regex=True)
all_.sort_values(by='Accuracy_Type', inplace=True)
all_.head()

In [None]:
g = sns.FacetGrid(all_[all_['Accuracy_Type'].isin(['Balanced','Baseline','Gridsearch'])], col="League", row="Accuracy_Type", margin_titles=True, despine=False)
g.map_dataframe(sns.lineplot, x="Percentage", y="Accuracy")
g.set_axis_labels("Percentage of Season", "Accuracy")
g.set_titles(row_template = '{row_name}', col_template = '{col_name}',fontsize=12,fontweight='bold',)
g.fig.subplots_adjust(wspace=0, hspace=0)
g.fig.suptitle( x=0.4, y=1.02, t="Model Accuracy Results",
                  fontsize=14,fontweight='bold',  ha='left')
g.savefig('Model.Accuracy.Results.jpeg', dpi=350, bbox_inches='tight')

Comparison of classification accuracies are shown in below. In these graphs, we can observe to what extent the True Class and False Class performances have changed and are accurate in the predictions of each league. For example, the high predictability of the Premier League during the season, when both Precision and Recall are taken into account, confirms that this league has a low competitive value. On the contrary, the La Liga shows a value of 50 percent or less when these values are taken into account, showing that the teams that completed the season in the top 3 positions were unsuccessful in their predictions and therefore the competition is high.

In [None]:
g = sns.FacetGrid(all_[all_['Accuracy_Type'].isin([ 'fscore false class',
 'fscore true class',
 'precision false class',
 'precision true class',
 'recall false class',
 'recall true class'])], col="League", row="Accuracy_Type", margin_titles=True, despine=False)
g.map_dataframe(sns.lineplot, x="Percentage", y="Accuracy")
g.set_axis_labels("Percentage of Season", "Accuracy")
g.set_titles(row_template = '{row_name}', col_template = '{col_name}',fontsize=12,fontweight='bold',)
g.fig.subplots_adjust(wspace=0, hspace=0)
g.fig.suptitle( x=0.4, y=1.02, t="Model Classification Results",
                  fontsize=14,fontweight='bold',  ha='left')
g.savefig('Model.Classification.Results.jpeg', dpi=350, bbox_inches='tight')

### Competitiveness Ranking

Considering the Balanced Accuracy Results, we score within each percentile, ranking from the lowest accuracy to the higher, with #1 corresponding to the lowest accuracy. 

From this point of view, we state that the competition rate of the league with low prrediction accuracy is higher than the league with high accuracy, and it ranks 1st.

**Comments**:By ranking the accuracy results descending order we could observe how competitiveness distributed over the season. Although the prediction accuracies of La Liga were in an upward trend until the last quarter of the season, we can observe that when the results are compared with other leagues, it is the most difficult league to predict during the whole season. Therefore, we conclude that it is the league with the highest competition. The Premier League and Bundesliga are progressing by passing and lagging behind each other until the week when 60 percent of the games of the season are played. In this respect, we can say that both leagues share the second and third places. However, in the last quarter of the season, we observe that the competition rate of the Bundesliga league has increased sharply. Since Bundesliga has more downwards trend considering accuracy value, we can put it in the second order of the competitiveness. We observe that the MLS league, which has the least competition compared to the others until the 60 percent match of the season is played, although it increases its competition fiercely at some point, decreases rapidly after the 70 percent matches played. From this point of view, we can say that it has the lowest level of competition among others.

In [None]:
ranks = []
for i in list(leaderboard.Percentage.unique()):
    temp = leaderboard.copy()
    temp_ = temp[temp['Percentage']==i]
    temp_['competitiveness_rank'] = temp_['Baseline_model_accuracy'].rank(ascending=False).astype(int)
    ranks.append(temp_)
    
ranks = pd.concat(ranks)
ranks = ranks.pivot(index='League',columns='Percentage', values='competitiveness_rank').reset_index()
ranks

In [None]:
from pandas.plotting import parallel_coordinates
fig = plt.figure(figsize=(12,5))
parallel_coordinates(ranks, 'League',colormap=plt.get_cmap("tab20b_r") ) #color=sns.color_palette()  
plt.title("Competitiveness levels of the Leagues", fontsize=12, fontweight='bold')
plt.xlabel("Played Percentage of the Season")
plt.show()
fig.savefig('Competitiveness.levels.Leagues.jpeg', dpi=350, bbox_inches='tight')

### Competitiveness Ranking in the Training Set

In order to evaluate the success rates of the models during grid search, we examine the Grid Search Model Accuracy results by ranking them with the same approach.

In [None]:
ranks = []
for i in list(leaderboard.Percentage.unique()):
    temp = leaderboard.copy()
    temp_ = temp[temp['Percentage']==i]
    temp_['competitiveness_rank_for_train_set'] = temp_['Gridsearch_model_accuracy'].rank(ascending=False).astype(int)
    ranks.append(temp_)
    
ranks = pd.concat(ranks)
ranks = ranks.pivot(index='League',columns='Percentage', values='competitiveness_rank_for_train_set').reset_index()
ranks

In [None]:
fig = plt.figure(figsize=(12,5))
parallel_coordinates(ranks, 'League',colormap=plt.get_cmap("tab20b_r") ) #color=sns.color_palette()  
plt.title("Competitiveness levels of the Leagues", fontsize=12, fontweight='bold')
plt.xlabel("Played Percentage of the Season")
plt.show()

**Conclusions**: We have presented the competitiveness levels of the leagues by comparing several models applicable for binary target classification. As a result of our approach, it turned out that the **La Liga** league is the most competitive league compared to the others, maintaining the lowest balanced accuracy score until the end of the season. In addition to this work, further approximations can be made to this problem by incorporating various aspects of outcome over the course of the season, or different models can be applied, such as transforming the output into a multi-class classification problem to predict the position of each team. 