In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

I collected data on premier league player performances for the current season(2021-22) from [fbref.com](http://https://fbref.com/en/). It consists of 83 different player metrics.The website data on many more metrics if you want to check them out. The idea is to use these metrics to bucket players according to their playing style. A generic way to do so would be to simply differentiate players according to their positions, however, that's not interesting. If at all you're a football fanatic and have been following the game for a while, you ought to be privy to the how often postitions in modern football overlap. Many full-backs engage extensively in attacks, often sacrificing defensive responsibilites, while central midfielders cover for them at the back. I want to see whether the clustering algorithm, based on these 83 metrics, can capture this trend.

PS: Check the dataset description if you want to know what these metrics are

In [None]:
df=pd.read_csv('../input/202122-epl-playerstats/2021-22_EPL_PlayerStats.csv',index_col=0)
df.sample(10,random_state=42)

In [None]:
#Maximum games played by a player
plt.subplots(1,1,figsize=(10,5))
sns.histplot(df,x='MP',kde=True,binwidth=1)
plt.title('Frequency of Matches Played',size=15)
plt.xlabel('Matches Played',size=15)
plt.show()

As the data is from the current season till matchday-20, the maximum appearances by any player is 20. For the analysis, I am going to ignore players who have played less than 10 games. Also, if you notice in the dataframe above, few players have multiple postitions. I'll separate them into primary and secondary positions.

Few metrics are NULL for certain players and I'll replace them with 0.

In [None]:
df[df.isna().any(axis=1)].head(5)

In [None]:
df=df[df['MP']>10]
df['Primary_pos']=df['Pos'].str.split(',').str[0]
df['Secondary_pos']=df['Pos'].str.split(',').str[1]
df['Secondary_pos']=df['Secondary_pos'].fillna('NA')
df.drop(columns=['Pos'],inplace=True)
df=df.fillna(0)

# Checking for Correlation

In [None]:
tot=0
c=0
feats=df.columns[2:-2]
corr=df.iloc[:,2:-2].corr() # [2:-2] are the metric columns
for i,c1 in enumerate(corr.index):
    for j in range(i+1,len(corr)):
        if abs(corr.loc[c1,corr.columns[j]]) > 0.5: #cheecking whether correlation >0.5
            tot+=1
        c+=1
tot

From a total possible of 3403 unique variable pairs, 687 have an absolute correlation of more than 0.5. Hence, I'll use PCA to transform the variables and extract only the important information.

# PCA

In [None]:
#PCA
x=df.iloc[:,2:-2].values
x=StandardScaler().fit_transform(x)
pca=PCA()
pc=pca.fit_transform(x) #Matrix of principal components

In [None]:
#Variance explained by each principal component
dic={}
s=0
for i,v in enumerate(pca.explained_variance_):
    s=s+v
    dic[i+1]=s*100/pca.explained_variance_.sum()
plt.subplots(1,1,figsize=(15,8))
plt.bar(dic.keys(),dic.values())
plt.title('Explained variance of principal components',size=15)
plt.xlabel('Principal Components',size=15)
plt.ylabel('Percentage')
plt.show()

The first 2 principal components(PCs) explain around 53% of the variation in data, while the first 5 PCs describe more than 70% of the variation. Clearly, there was no point in carrying all the 83 metrics. I'll consider only the first two
PCs going forward. Let's dig them up to see what information they relay.

# Relating principal components with original variables
PCs in isolation are neither fascinating nor intuitive unless one is able to compare them to the orginial variables. Essentially, each PC is a mixture (some linear combination) of the original variables, and it'd be useful to determine how much of each variable is present in a particular PC. How to do that? Well, some simple math will help with that.

Consider a data-space with n variables then the first principal component can be written as:
$$PC1=a_{11}X_{1}+a_{12}X_{2}+..+a_{1n}X_{n}$$
I won't deep-dive into the math behind PCA, but if you're familiar with it, you'd know that the vector $[a_{11}\,a_{12}\,a_{13}\,...a_{1n}]$ is the axis which explains the most variation in data and $\sum_{1}^{n}a_{1n}^{2}=1$. Now, as PC1 is a dot product of this vector and the data-space(x), the amount of a variable 'k' in PC1 can be ascertained by the coefficient $a_{1k}$. Essentially, you can think of these coefficients as the weights of the original variables in PC1. Let's see how this works in code.

In [None]:
axes=pca.components_ #Prinicpal axes (eignevectors) obtained from PCA
pc.T[0] #First principal component

The above first principal component can also be obtained by the dot product of x (original variables) and the first principal axis

In [None]:
#Function to extract top 10 variables in a PC
def feat_pc(axes,feats,n):
    ind=np.argsort(abs(axes[n]))[-10:]
    df=pd.DataFrame()
    df['variables']=feats[ind][::-1]
    df['coefficient']=axes[n][ind][::-1]
    df['percentage']=100*np.square(df['coefficient'])
    return df

Coefficients, as discussed above, are the weights of each variable. Negative coefficient implies a negative correlation of the variable with the PC. Percentage is calculated as the square of each coefficient times 100. Basically, the percentage of each variable in the PC.

# First Principal Component

In [None]:
#First principal component
feat_pc(axes,feats,0)

Evidently, the first PC is associated more with the goals and assists made by players. It's also negatively associated with touches in the defensive third. Hence,this component should help differntiate attacking and defensive players. Let's the same by performing clustering on just the first principal component

In [None]:
#Utility functions
def pos_infor(df,l,names=False):
    df['cl']=l
    for i in df['cl'].unique():
        print(i)
        print(df.groupby('cl').get_group(i)['Primary_pos'].value_counts())
        if names==True:
            print(df.groupby('cl').get_group(i)['Player'].head(20).values)
    df.drop(columns=['cl'],inplace=True)

def kmeans_scores(pc,ul,c1,c2=0):
    scores=[]
    for i in range(1,ul+1):
        kmeans=KMeans(i)
        if c2==0:
            kmeans.fit(pc.T[c1-1].reshape(-1,1))
        else:
            kmeans.fit(np.vstack((pc.T[c1-1],pc.T[c2-1])).T)
        scores.append(kmeans.inertia_)
    plt.plot(range(1,i+1),scores)
    plt.xlabel('Number of Clusters')
    plt.title('Elbow Method')
    plt.show()

def cluster_plot(pc,i,c1):
    kmeans=KMeans(i)
    l=kmeans.fit_predict(pc.T[c1-1].reshape(-1,1))
    plt.subplots(1,1,figsize=(14,6))
    sc=plt.scatter(np.arange(0,300),pc.T[c1-1],c=l)
    plt.xlabel('player index')
    plt.ylabel('PC'+str(c1))
    plt.show()
    return l

In [None]:
kmeans_scores(pc,6,1)

According to the above elbow plot, two clusters are sufficient for adequate grouping based on just the first principal component.

In [None]:
l=cluster_plot(pc,2,1)

The KMeans algorithm has clustered the data into groups based on negative and positive PC1 values. Let's now see the majority primary position of these clusters.

In [None]:
#Primary player positions in the above clusters
pos_infor(df,l)

Cluster '0' comprises mainly of defenders and all the goalkeepers, while cluster '1' has all the forward players and very few defenders. This division conforms with the variable composition of PC1 which we discussed above. As PC1 gives more weight to variables related to goals and assists, it helped differentiate between attacking and defensive players. It'll be interesting to know more about the 9 defenders belonging to cluster '1'. But first, let's repeat the exercise for some more principal components.

# Second Principal Component

In [None]:
#2nd principal component
feat_pc(axes,feats,1)

You can check the data description to find what each of these variables track. SPA, for instance, is the the total short passes attempted. So, the second principal componet takes into account the type of passes made by the player.Carries, TotalDistCarry,ProgPass quantify how much a player is responsible in progressing the ball forward(toward's opposition half). SCPass are passes that create a shot. In essense, the second principal component relays information about the playing style of a player. It is correlated with players who make a lot of good passes and are involved heavily in teamplay.

In [None]:
kmeans_scores(pc,6,2)

In [None]:
l=cluster_plot(pc,3,2)

The second principal component divides the data well into three clusters.

In [None]:
pos_infor(df,l,names=True)

Based on this component,the KMeans algorithm, I believe, has clustered players according to their creativity and ball progression actions. If you watch football, you'll recognize that players in cluster '0' make a lot of ball progression and are involved in a lot of teamplay in the attcking third. Contrarily, cluster '1' have rather immobile players who either solely score goals (forwards) or stick to their defensive duties. 

# Tenth Principal Componenet
Now, let's look at a PC that does not explain a lot of variation in data. PC 10 expalins only around 1.5% of the variation

In [None]:
feat_pc(axes,feats,10)

From the variable compostition, it's not at all transparent what information this component relays. As opposed to the first two PCs which had majority contribution from similar variables, attacking output and passing types respectively, PC 10 is composed of many uncorrelated variables (Yellow cards, Goal errors and Left foot passes have little in common).

In [None]:
kmeans_scores(pc,6,9)

In [None]:
l=cluster_plot(pc,3,9)

Notice how the y range (range of principal component) has shrunk. It varied from -10 to 20 for PC2. This observation is in accordance with the PCA theory. As latter components explain less variation, the data-points are concentrated in a narrower range. This also results in inefficient clustering. The reduction in elbow score from a single cluster to 3 clusters is quite less compared to PC1 and PC2.

# Combining PC1 and PC2
Finally, let's see how the first two principal components, in combination, cluster the data.

In [None]:
kmeans_scores(pc,12,1,2)

4 clusters seem to be the right choice here

In [None]:
kmeans=KMeans(4)
l=kmeans.fit_predict(pc[::,:2])
cen=kmeans.cluster_centers_
df['pc1']=pc.T[0]
df['pc2']=pc.T[1]
df['cluster']=l

In [None]:
sns.relplot(data=df, x='pc1', y='pc2', hue='cluster', palette='tab10', kind='scatter',height=8.27, aspect=11.7/8.27)
plt.show()

In [None]:
pos=[['GK','DF'],['MF','FW']]
f,ax=plt.subplots(2,2,figsize=(20,10))
for i in range(0,2):
    for j in range(0,2):
        r=df[df['Primary_pos']==pos[i][j]]
        sns.scatterplot(ax=ax[i,j],data=r, x='pc1', y='pc2', hue='cluster', palette='tab10',s=40)
        ax[i][j].set_title(pos[i][j],size=15)

Remember that PC1 is associated with attacking output (goals,assists etc) while PC2 is associated with ball progression and involvement in attacking third. Based on this, let's analyze the four clusters. Below, I have created a function to see the names belonging to each cluster.

**Cluster 0**
1. All players in this cluster have negative values for both PC1 and PC2
2. These are mainly defensive players. All goalkeepers, hence, quite undestandably belong to this cluster. Rest of this cluster comprises mainly of defenders, and a few midfilders who I believe must be the ones responsible for screening the back line.

**Cluster 1**
1. Positive values for both PC1 and PC2
2. This cluster mainly includes midfilders and forwards. There are only 3 defenders in this clusters. They must be the modern highly active full-backs who shuttle up and down the flank providing assistance in the attacking third. 
3. Players in this cluster are perhaps the cheif creators in their teams. Mohamed Salah, Trent, Bruno Fernandes etc fall in this cluster.

**Cluster 2**
1. Positive PC1 but negative PC2
2. These are players who have good attacking numbers (maybe not as good as the ones in cluster 1), but aren't involved a lot in the build up play in the attacking half. As it mainly includes forwards, I belive these are typical goalpoachers.

**Cluster 3**
1. Negative PC1 but postive PC2
2. Highly active players who cover a lot of distance and progress the ball, but don't have hight attacking numbers. Mainly composed of defenders and midfielders like Henderson,Fabinho, Declan Rice etc.

In [None]:
def player_cluster(pos,cl,df):
    db=df.groupby('cluster').get_group(cl)
    return db[db['Primary_pos']==pos]

In [None]:
db=player_cluster('MF',1,df)
db