# <p style="background-color: #6D3FCA; font-family: Blippo, fantasy; line-height: 1.3; font-size: 26px; letter-spacing: 3px; text-align: center; color: #DBCBFA">Beginners Hierarchical Clustering: FIFA 21</p>

![](https://enter21st.com/wp-content/uploads/2020/04/FIFA-21-Release-date-Career-Mode-Ratings-Delays-PS5-Xbox.jpg)

# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">INTRODUCTION</span>
In this notebook a clustering algorithm is presented. Approaching a dataset from the field of sports & games, I want to show how <b>hierarchical clustering</b> can be performed shortly and easily.
For that purpose, 2 features will be considered:
 - 'Passing'
 - 'Mentality_Vision'
 
The objective is to find different groups of players based on their 'Passing' and 'Mentality_vision'


# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">WHAT IS A CLUSTER?</span>
 - A cluster refers to a group of items with similar characteristics
 


# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">WHAT IS CLUSTERING?</span>
- Clustering is the process through which items with similar characteristics are grouped

<hr style="height: 10px; border: 0; background-color: #6D3FCA">

**Importing libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.vq import whiten
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import warnings
warnings.filterwarnings('ignore')

**Reading data**

In [None]:
df = pd.read_csv('../input/fifa-21-complete-player-dataset/players_21.csv')
print(df.shape)
df.head()

**Features**

In [None]:
# Filter the 'passing' & 'mentality_vision' features and check for missing values
df[['passing', 'mentality_vision']].isnull().sum()

In [None]:
# Filter only the rows that do not contain missing values
df_notna = df[df['passing'].notna()]
df_notna[['passing', 'mentality_vision']].isnull().sum()

In [None]:
# Explore visually the relationship between 'passing' and 'mentality_vision' through a scatterplot
fig, ax = plt.subplots(figsize=(15,8))
ax.scatter(df_notna['passing'], df_notna['mentality_vision'], color='violet')
ax.set_xlabel('Passing')
ax.set_ylabel('Mentality vision')
plt.show()

# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">FEATURES NORMALIZATION</span>
 - Normalization is a process of rescaling the data to standard deviation (STD) of 1

# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">WHY IS NORMALIZATION REQUIRED FOR CLUSTERING?</span>
- The features might have incomparable units: weight (kg/lbs) vs value (dollar/euro)
- Data in raw form can lead to bias
- Clusters might be highly dependent on one feature



In [None]:
# Normalize the features
passing = df_notna['passing'].tolist()
mentality_vision = df_notna['mentality_vision'].tolist()

scaled_passing = whiten(passing)
scaled_mvision = whiten(mentality_vision)

In [None]:
# Create columns with the scaled features in the dataframe
df_notna['scaled_passing'] = scaled_passing
df_notna['scaled_mvision'] = scaled_mvision

In [None]:
# Create plots illustrating the features before & after normalization.
fig, ax = plt.subplots(1,2,figsize=(12,6))

ax[0].plot(df_notna['passing'], label='original', color='wheat')
ax[0].plot(df_notna['scaled_passing'], label='scaled', color='purple')
ax[0].set_title('Passing')
ax[0].legend()
ax[1].plot(df['mentality_vision'], label='original', color='wheat')
ax[1].plot(df_notna['scaled_mvision'], label='scaled', color='purple')
ax[1].set_title('Mentality vision')
ax[1].legend()
plt.show()

# <span style="font-family: Blippo, fantasy; font-size: 18px; font-weight: bold; letter-spacing: 3px; color: #6D3FCA">WHAT IS A DENDROGRAM?</span>
- A diagram that shows how clusters are formed
- Visual method through which clusters can be decided

AXES LABELS
 - y-axis: height of the dendrogram showing the distance between merging clusters
 - x-axis: features used to create clusters



In [None]:
# Fit the data into hierarchical clustering algorithm
dm = linkage(df_notna[['scaled_passing', 'scaled_mvision']], method='ward')

# Create a dendrogram
plt.figure(figsize=(15,8))
dendr = dendrogram(dm)
plt.show()

For a height of around 70, we can observe that we have 3 clusters.

The smaller the height the higher the clusters. 

In [None]:
# Assign cluster label to each row
df_notna['labels'] = fcluster(dm, 3, criterion = 'maxclust')

In [None]:
# Print cluster centers of each cluster
print(df_notna.groupby('labels')['scaled_passing', 'scaled_mvision'].mean())

In [None]:
# Create the scatterplot
plt.figure(figsize=(15,8))
sns.scatterplot(x='passing', y='mentality_vision', hue='labels', data=df_notna)
plt.show()

**Players in different clusters**

In [None]:
# Identify 5 players of each cluster
for cluster in df_notna['labels'].unique():
    print(cluster, df_notna[df_notna['labels']==cluster]['short_name'].values[:5])

The plyers from cluster 2 are the players with the highest passing and mentality vision in their attirbutes, while the players from cluster 1 comes with the lowest values on these attributes.

Cluster 3 consists of players somewhere in between.