![](http://www.sabcnews.com/sabcnews/wp-content/uploads/2018/10/SABC-News-UEFA_UEFA-website.png)


In this study, I will try to predict the match outcomes, thus the group standings in UEFA Champions League season 2019-2020 by developing a statistical model that uses player (actually team) ratings to determine the outcome of the match.

**1. Databases to be used:** 
* FIFA 19 player dataset
* FIFA 20 player dataset
* 2018-2019 match results from top European Leagues (~3000 matches)

**2. Models to be developed:** 
* A predictor model will be developed using 2018-2019 match results and player dataset (actually we will be constructing team dataset using player attributes)
* After the model is complete it will be used to predict Champions League match outcomes using FIFA 20 player dataset
* My approach will be on categorisation of match outcomes (Home, Away, Draw) thus I will test KNN (yet I will not test other algorithms for this particular project)

**3. Expected results:** 
* At the end of this study, we will have group standings predictions for each group in UEFA Champions League 2019/2020 season


Let's begin to see if your favorite team will be able to make the second round this year! 

In [None]:
# *Import libraries*
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as mat
# *Import FIFA 2019 and FIFA 20 dataset. We need to identify the shared columns between the database to make sure that they exactly match*
df= pd.read_csv('../input/fifa19/data.csv')
fifa20 = pd.read_csv("../input/fifa-20-complete-player-dataset/fifa20_data.csv")

Take a brief look at the data, using head and info functions

In [None]:
fifa20.columns   

In [None]:
df.columns

In [None]:
fifa20.head()

In [None]:
df.head()

As you can see above, there are many attributes attached to a player. I will use those attributes to determine clubs' attrivbutes, basically taking an average of each attribute of each player. This seems straightforward however this will be a basic model, the further study could be done on how to determine the club's attributes better. 

Firstly for the sake of simplicity, I will remove some of the columns from the table that I plan not to use in my model. Secondly, I will need to make sure that all the columns in FIFA 19 and FIFA 20 datasets should match, thus I will only keep the columns existing in each table.

In [None]:
df=df.drop(df[['ID','Unnamed: 0','Value','Height','Weight','Wage','Weak Foot','Special','Preferred Foot','Skill Moves','Work Rate','Body Type','Photo','Nationality','Flag','Club Logo','Real Face','Jersey Number','Joined','Loaned From','Contract Valid Until','Release Clause']],axis=1)

In [None]:
difcol20 = fifa20.columns.difference(df.columns)
difcol19 = df.columns.difference(fifa20.columns)

In [None]:
difcol20

In [None]:
difcol19

So we see that there are two different cases there:
* Some columns that exists in FIFA 19 do not exist in FIFA 20 (and vice versa)
* Some columns are named differently in two databases

Thus, we need to first eliminate the ones that makes the difference, then rename FIFA 20 columns according to FIFA 19 dataset

In [None]:
fifa20.rename(columns={'Ball Control': 'BallControl', 'FK Accuracy': 'FKAccuracy','GK Diving':'GKDiving','GK Handling':'GKHandling','GK Positioning':'GKPositioning','GK Reflexes':'GKReflexes','Heading Accuracy':'HeadingAccuracy','Short Passing':'ShortPassing','Shot Power':'ShotPower','Sliding Tackle':'SlidingTackle','Sprint Speed':'SprintSpeed','Standing Tackle':'StandingTackle','Long Passing':'LongPassing','Long Shots':'LongShots'}, inplace=True)
difcol20 = fifa20.columns.difference(df.columns)
fifa20=fifa20.drop(fifa20[difcol20],axis=1)
difcol19 = df.columns.difference(fifa20.columns)
df=df.drop(df[difcol19],axis=1)

In [None]:
fifa20.columns

In [None]:
df.columns

In [None]:
difcol20 = fifa20.columns.difference(df.columns)
difcol20

In [None]:
difcol19 = df.columns.difference(fifa20.columns)
difcol19

I would like to keep goalkeeper statistics separate to calculate the overall of the clubs. This is why some non-GK's also have GK skills however I do not want to keep them . Thus, I will categorise players into:
- Goalkeepers
- Others

Then, each player will have a new position and we will remove all other stats that belong to Position.

In [None]:
df['New Position']=0
df['New Position'][df['Position']=='GK']='GK'
fifa20['New Position']=0
fifa20['New Position'][fifa20['Position']=='GK']='GK'

Now I have the attributes that I plan to use in my models. What we need to do next is to group the player stats under teams - so building team attributes. For this, I will use .mean() function to get a team's overall attributes for each position.

For all statistics, I will divide positions as Goalkeeper and Non-Goalkeepers into two tables, then I will merge them into one

In [None]:
gk19 = df[df['New Position']=='GK']
gk20 = fifa20[fifa20['New Position']=='GK']

In [None]:
gk20.head()

In [None]:
gk19.head()

I'll need to eliminate attributes not attached to goal keeping

In [None]:
gk19 = gk19.drop(['Name','Crossing','Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling','Curve','FKAccuracy',
              'LongPassing','BallControl','Acceleration','SprintSpeed', 'Agility', 'Balance', 'ShotPower',
              'LongShots','Interceptions','Positioning','Vision','Penalties','Marking','StandingTackle','SlidingTackle',
             'Aggression','Stamina'],
            axis=1)

gk20 = gk20.drop(['Name','Crossing','Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling','Curve','FKAccuracy',
              'LongPassing','BallControl','Acceleration','SprintSpeed', 'Agility', 'Balance', 'ShotPower',
              'LongShots','Interceptions','Positioning','Vision','Penalties','Marking','StandingTackle','SlidingTackle',
             'Aggression','Stamina'],
            axis=1)

I also want to keep keeper statistics different than the other players, so I will add 'GK' to the beginning of each attribute

In [None]:
gk19.columns = [str(col) + " GK" for col in gk19.columns]
gk19['Club']=gk19['Club GK']
gk19=gk19.drop(['Club GK'],axis=1)
gk20.columns = [str(col) + " GK" for col in gk20.columns]
gk20['Club']=gk20['Club GK']
gk20=gk20.drop(['Club GK'],axis=1)

In [None]:
gk19.head()

In [None]:
gk20.head()

It seems like we have chosen the right attributes for the goalkeepers, now it is time to grouping each player under the teams. We will get means for each team.

In [None]:
gk19teams = gk19.groupby('Club').mean().sort_values('Overall GK',ascending=False)
gk19teams.head()

In [None]:
gk20teams = gk20.groupby('Club').mean().sort_values('Overall GK',ascending=False)
gk20teams.head()

Here we have our subframe for teams. Now we need to create a new subset for the players who are not GKs

In [None]:
notgk19 = df[df['New Position']!='GK']
notgk20 = fifa20[fifa20['New Position']!='GK']

Similar to what we have done for the keepers, we will now delete the rows for GK stats from this table

In [None]:
notgk19.columns

In [None]:
notgk19 = notgk19.drop(['Name','GKDiving','GKHandling','GKPositioning','GKReflexes'],axis=1)
notgk20 = notgk20.drop(['Name','GKDiving','GKHandling','GKPositioning','GKReflexes'],axis=1)

In [None]:
notgk19.head()

It seems like we have chosen the right attributes for the the other players, now it is time to grouping each player under the teams. We will get means for each team.

In [None]:
notgk19teams = notgk19.groupby('Club').mean().sort_values('Overall',ascending=False)
notgk20teams = notgk20.groupby('Club').mean().sort_values('Overall',ascending=False)
notgk19teams

Now it is time to merge two tables into one using the Club as the key

In [None]:
teams19=pd.merge(notgk19teams,gk19teams,'right','Club')
teams20=pd.merge(notgk20teams,gk20teams,'right','Club')


I also decided to drop some minor attributes that I believe that do not affect overall performance of the team

In [None]:
teams19 = teams19.drop(["Potential GK","Jumping GK","GKHandling GK","GKPositioning GK","Reactions GK","Composure GK","GKDiving GK","Volleys","Curve","FKAccuracy","Jumping","LongShots","Penalties",],axis=1)
teams20 = teams20.drop(["Potential GK","Jumping GK","GKHandling GK","GKPositioning GK","Reactions GK","Composure GK","GKDiving GK","Volleys","Curve","FKAccuracy","Jumping","LongShots","Penalties",],axis=1)

Now we have the aggregate stats for 651 teams on FIFA 19

We have the results from the following leagues. The main reason choosing these leagues due to ability to find relevant statistics and the fact that Champions League includes most of its participants from those countries:
* English Premier League
* La Liga
* Serie A
* Bundesliga
* Belgium Pro League
* France Ligue 1
* Eredivisie
* Primeira Liga
* Turkish Super Lig

First, we will read the data to dataframes and manipulate. The team names in the following databases are already changed to their FIFA Names to ensure uniqueness (for example 'Man United' in results database changed to 'Manchester United' as it is appeared in FIFA 19 database).

In [None]:
uk = pd.read_csv('../input/europe-top-leagues-1819-results/UK.csv',sep=';',encoding='latin-1')
es = pd.read_csv('../input/europe-top-leagues-1819-results/ES.csv',sep=';',encoding='latin-1')
it = pd.read_csv('../input/europe-top-leagues-1819-results/IT.csv',sep=';',encoding='latin-1')
de = pd.read_csv('../input/europe-top-leagues-1819-results/DE.csv',sep=';',encoding='latin-1')
be = pd.read_csv('../input/europe-top-leagues-1819-results/BE.csv',sep=';',encoding='latin-1')
fr = pd.read_csv('../input/europe-top-leagues-1819-results/FR.csv',sep=';',encoding='latin-1')
ne = pd.read_csv('../input/europe-top-leagues-1819-results/NE.csv',sep=';',encoding='latin-1')
pt = pd.read_csv('../input/europe-top-leagues-1819-results/PO.csv',sep=';',encoding='latin-1')
tr = pd.read_csv('../input/europe-top-leagues-1819-results/TR.csv',sep=';',encoding='latin-1')

In [None]:
uk.head()

In [None]:
es.head()

In [None]:
tr.head()

Looking at three leagues, we have a sense the structure of the data. I will only use HomeTeam, AwayTeam, FTHG (Full Team Home Goals), FTAG (Full Time Away Goals), FTR (Full Time Result). Thus, I will drop all the remaining columns from the database.

In [None]:
allres = uk.append([be,de,tr,es,ne,fr,pt,it])
allres['Div'].unique()

So all the divisions are there, let's take out the other columns that we will not use

In [None]:
allres=allres[['HomeTeam','AwayTeam','FTHG','FTAG','FTR']]
allres.head()

In [None]:
allres['HomeTeam'].describe()

Here we can see that there are 2984 matches played in these leagues in 2018/2019 season. It is a good sample size. Now I want to integrate the attributes from FIFA 19 database to match results dataframe. My problem for this is that the team names do not match, so I need to convert the team names in results to that of FIFA 19 team names. I have done it manually on my database that includes the match results from different leagues. Thus I have the exact same names for two databases.

My first plan to firstly integrate the team attributes for home team and then the away team. Thus I need to make a copy of FIFA 19 dataset for each and name the each column of two different dataset adding a 'Home' and 'Away' prefix

In [None]:
HomeStats = teams19
HomeStats = HomeStats.add_prefix('Home ')
HomeStats = HomeStats.reset_index()
AwayStats = teams19
AwayStats = AwayStats.add_prefix('Away ')
AwayStats = AwayStats.reset_index()

In [None]:
HomeStats.head()

In [None]:
AwayStats.head()

In [None]:
res1 = pd.merge(allres,HomeStats,'left',left_on='HomeTeam',right_on='Club')
res1.head()
alltable = pd.merge(res1, AwayStats, 'left',left_on='AwayTeam',right_on='Club')

Control if there is any NaN values, if so what the reason is for this values

In [None]:
nan = alltable[alltable['Club_x'].isna()]
nan['HomeTeam'].unique()

So we see that there are some teams their names do not match with FIFA database, we need to make sure that they have the same name in both database. We need to amend their names on HomeStats and AwayStats databases

In [None]:
allres['HomeTeam'] = allres['HomeTeam'].replace('FC Schalke 04 04', 'FC Schalke 04')
allres['AwayTeam'] =  allres['AwayTeam'].replace('FC Schalke 04 04', 'FC Schalke 04')
allres['HomeTeam'] =  allres['HomeTeam'].replace('Medipol Baþakþehir FK', 'Medipol Başakşehir FK')
allres['AwayTeam'] =  allres['AwayTeam'].replace('Medipol Baþakþehir FK', 'Medipol Başakşehir FK')
allres['HomeTeam'] = allres['HomeTeam'].replace('Beþiktaþ JK', 'Beşiktaş JK')
allres['AwayTeam'] = allres['AwayTeam'].replace('Beþiktaþ JK', 'Beşiktaş JK')
allres['HomeTeam'] = allres['HomeTeam'].replace('Sociedad', 'Real Sociedad')
allres['AwayTeam'] = allres['AwayTeam'].replace('Sociedad', 'Real Sociedad')
allres['HomeTeam'] = allres['HomeTeam'].replace('Spal', 'SPAL')
allres['AwayTeam'] = allres['AwayTeam'].replace('Spal', 'SPAL')
allres['HomeTeam'] = allres['HomeTeam'].replace('Kasimpaþa SK', 'Kasimpaşa SK')
allres['AwayTeam'] = allres['AwayTeam'].replace('Kasimpaþa SK', 'Kasimpaşa SK')

In [None]:
res1 = pd.merge(allres,HomeStats,'left',left_on='HomeTeam',right_on='Club')
alltable2 = pd.merge(res1, AwayStats, 'left',left_on='AwayTeam',right_on='Club')

Another run to see if there is any NaN value left

In [None]:
nan2 = alltable2[alltable2['Club_x'].isna()]
nan2['Club_x'].unique()

In [None]:
nan2 = alltable2[alltable2['Club_y'].isna()]
nan2['Club_y'].unique()

Phew! Finally the data seems clear and good to go. Here is the description of the data

In [None]:
alltable2.info()

In [None]:
alltable2.describe()

Now remove club names from the table to just to have the pure data for the remaining part of the process

In [None]:
table = alltable2.drop(columns=['HomeTeam','AwayTeam','Club_x','Club_y'])
table.head()

Categorisation of match results: Home Win as 1, Away Win as 2, Draw as 0

In [None]:
table['FTR']= table['FTR'].replace(['H','A','D'],[1,2,0])
table.head()

**FUN PART BEGINS!!!**

So that we have the all match data ready (around 3000 matches and 61 columns atributed a match), it is now time to build our first model. 

**Model selection:** I want to build a k-nearest neighbors model since we have the categorical data (Home, Away, Draw). Other models could be used in a further study such as linear regression to determine how many goals teams would score or decision tree models. However, I will only focus on k-nearest neighbors model.

I will drop Total Goals scored columns since it directly affects the outcome of the match (surprise!)

In [None]:
tablek=table.iloc[:,2:]
tablek.info()

Let's first standardise the data for all the columns with numbers

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(tablek.iloc[:,1:])
scaled_feat=scaler.transform(tablek.iloc[:,1:])
tablek_feat=pd.DataFrame(scaled_feat,tablek.iloc[:,1:])
X = tablek_feat
y=tablek['FTR']

As a next step we will split the data as train and test, using a test size of 0.3

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=8)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
pred=knn.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

With a random k value chosen as 3, we see that our accuracy is 46%. However, this is not enough grouping size when we consider the nature of kNN algorithm. A good practice of choosing k in kNN is the square root of all samples. 

On the other hand, I would like to find a near optimal value for k evaluating the error rate. Thus, I will produce a graph using a k value from 1 to 50 and see which k value will generate the least error rate.

In [None]:
error_rate=[]

for i in range(1,50):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

mat.figure(figsize=(10,6))
mat.plot(range(1,50),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
mat.title('Error Rate vs. K Value')
mat.xlabel('K')
mat.ylabel('Error Rate')

As we can see above, the least error rate happens in k=15. Let's also take a look at the accuracy of our model according to changing k values:

In [None]:
from sklearn import metrics
k_range= range(1,50)

scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

print(scores)

mat.plot(k_range, scores)
mat.xlabel('Value of K for KNN')
mat.ylabel('Testing Accuracy')

As expected, the highest accuracy happens in k=15. Thus, I will fit my model using k=15:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
pred=knn.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

From the confusion matrix above, we see that our model's accuracy rised to 54% from 46%

Now, it is time to read 2019/2020 groups into the dataset. Unfortunately, Red Star FC and Zenit are not in FIFA 20 dataset, thus their CL groups consist of three teams each in this study.

In [None]:
cl = pd.read_excel('../input/champions-league-groups-1920/clgroups1920.xlsx',header=0)
table1=pd.merge(cl, HomeStats,'left', left_on='HomeTeam',right_on='Club')
clmatches=pd.merge(table1, AwayStats,'left',left_on='AwayTeam',right_on='Club')
clmatches

In [None]:
clmatches.info()

In [None]:
clmatches=clmatches.drop(['Club_x','Club_y'],axis=1)
clmatches.info()

Normalize the data:

In [None]:
scaler.fit(clmatches.iloc[:,3:])
scaled_feat=scaler.transform(clmatches.iloc[:,3:])
tablecl_feat=pd.DataFrame(scaled_feat,clmatches.iloc[:,3:])
Xcl = tablecl_feat
predcl=knn.predict(Xcl)

Predict the outcomes and take a look at group A

In [None]:
clmatches['Results']=predcl
clresults=clmatches[['Group ','HomeTeam','AwayTeam','Results']]
clresults['Homepts']=0
clresults['Awaypts']=0
clresults['Homepts'][clresults['Results']==1]=3
clresults['Awaypts'][clresults['Results']==2]=3
clresults['Homepts'][clresults['Results']==0]=1
clresults['Awaypts'][clresults['Results']==0]=1
clresults[clresults['Group ']=='A']

In [None]:
hpts=clresults.groupby(['Group ','HomeTeam']).sum()
hpts=hpts.drop(['Awaypts','Results'],axis=1)
apts=clresults.groupby(['Group ','AwayTeam']).sum()
apts=apts.drop(['Homepts','Results'],axis=1)

In [None]:
hpts.reset_index(inplace=True)
apts.reset_index(inplace=True)
clpred = pd.concat([hpts,apts],axis=1)
clpred['Total Points']=clpred['Homepts']+clpred['Awaypts']
clpred=clpred.drop(columns=['Homepts','Awaypts','AwayTeam'],axis=1)
clpred=clpred.iloc[:,~clpred.columns.duplicated()]
clpred=clpred.groupby(['Group ','HomeTeam']).sum()
clpred.sort_values(['Group ','Total Points'],ascending=False).groupby('Group ').head(4)

Finally, here is the list of teams who will promoted to 2nd round per group predicted by the model with the estimated points:

In [None]:
clpred.sort_values(['Group ','Total Points'],ascending=False).groupby('Group ').head(2)

**Conclusion:**

***What was my aim in this study?***

* I wanted to build a statistical model using match results and team attributes, and use this model to predict Champions Leage 2019/2020 group standings by predicting the outcome each game

***How was the model constructed?***

* I used FIFA 19 player dataset to determine the overall abilities of each team in many dimensions (dribling, shooting, etc.). We used each player's attributes to decide on overall ratings
* I used 2018-2019 football results from major leagues around the world and FIFA 19 dataset to train the model (around 3000 matches) 
* I created fixtures for the teams in CL and merged this table with FIFA 20 dataset so each team's attributes are calculated
* Then I used k-NN algorithm to build the model and predict the outcome

***What was the outcome of the model?***

* The model provided 54% accuracy which is near to levels achieved by some academic studies as well (https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/1718-ug-projects/Corentin-Herbinet-Using-Machine-Learning-techniques-to-predict-the-outcome-of-profressional-football-matches.pdf)

***What could be further steps and development areas?***

* Developing a different ML model: A linear regression model to determine how much goals would be scored by each team in each game or a decision tree model could be applied to see the outcomes and the accuracy
* Data manipulation: More data regarding the form of each team, starting elevens, injuries, condition etc. could be added to the initial data for (hopefully) better accuracy. The model then could be runned just before each game to have a more accurate outcome

Please let me know in the comments if your team will make it to the next round or regarding the models itself!