# A study of Unsupervised Learning techniques on the FIFA 20 dataset.
![FIFA20](https://media.contentapi.ea.com/content/dam/ea/fifa/fifa-20/common/nav/fifa20-nav-clubpacks.png)

In this project we explore the data through the domain of unsupervised learning performing principal component analysis and clustering analysis. One goal of this project is to best describe the variation in the different types of players. Doing so would equip us with insight into how to best choose players in a team. In a high-dimensional data, it is often difficult to develop an intuition of the features and our goal in this project is to reduce the dimensionality of the dataset so that we can visualize the relationships between the features and clusters in our dataset. We start with 104 features and bring down the dimensionality to 28 features by selecting key features using our domain knowledge, removing highly correlated features using regression techniques, and then further to just two principal components using PCA. We visualize the data using these principal components, perform clustering analysis and visualize the clusters and develop an inference for the same.

# Dataset

FIFA 20 features more than 30 official leagues, over 700 clubs and over 17,000 players. Included for the first time is the Romanian Liga I and its 14 teams, as well as Emirati club Al Ain, who were added following extensive requests from the fans in the region.    
With the amount of player and team data available on the game, this makes for an interesting dataset with a rich in-depth breakdown of every possible recorded attribute a player can have. In addition, football lends a detailed structure to the player data due to the dynamic nature of the game and various positions a player may command. This results in a very complex and interesting structure in the dataset which we will look to explore in this project through unsupervised learning techniques

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from IPython.display import HTML, display
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
pd.set_option('display.max_columns', 500)
data = pd.read_csv('/kaggle/input/fifa-20-complete-player-dataset/players_20.csv') 

In [None]:
display(data.head())
print(f'The Data has {data.shape[0]} rows and {data.shape[1]} features')

## Data Understanding and cleaning

This dataset provides the complete statistics available at the player level in the FIFA 20 game. This dataset was scraped from the sofifa.com and is very clean in terms of the structure and expected data types. We will just transform the data to the format we need, and keep only selected features which might be useful in our analysis.    
Our data has a lot of features that are actually explained or have the same information captured in them as other variables.

Some key observations in the dataset:
-	`potential` and `overall` are highly correlated. `potential` is basically an integer greater than or equal to `overall`.
-	`overall` is a computation based on all the other skill ratings of a player such as `shooting`, `passing`, etc.
-	Unless a player plays at a Goalkeeper position (`GW`), all his goalkeeper statistics are `NaNs`.
-	The columns `ls`, `st`, `rs`, `lw` etc. are playing positions in the game and the data in these columns is basically the max potential of a player if he were to play in that position. We will assume a player only plays in his preferred position and we will drop all these columns.
-	For our analysis, we will drop all columns *unnecessary for our analysis as and when we reach that conclusion*. For now, all descriptive columns like `sofifa_id`, `player_url`, `nationality` etc. will be dropped.
-	`player_positions` are the preferred positions of the player. We will keep only the first playing position for our analysis.


In [None]:
data['player_positions'] = data['player_positions'].str.split(',').str[0]

Prior to dropping our columns as discussed above, lets take a  copy of the data. Since we are interested to see the clusters the data forms, it would be interesting to take some samples from various playing positions and see how they get transformed into our clusters.  

In [None]:
original_data = data.copy()
data.head()

In [None]:
def generate_samples(positions = ['CAM', 'RM', 'CDM', 'LM', 'CM'], n_samples = 10):
    '''
    positions = ['RW', 'ST', 'LW', 'GK', 'CAM', 'CB', 'CM', 'CDM', 'CF', 'LB', 'RB','RM', 'LM', 'LWB', 'RWB']
    '''
    samples = original_data[original_data.player_positions.isin(positions) & (original_data.overall>=70)].sample(n_samples)
    return samples.index.values

Lets take a look at the player ratings

In [None]:
plt.figure(figsize=(20,10))
ax = sns.distplot(data.overall, bins=20);
ax.set_title('Distributions of Player Ratings')
ax.set_xlabel('Overall Ratings');

Since, `overall` decides the overall quality of a player, we can plot its histogram to visualize how players are distributed. For the purpose of this notebook, we will remove players with `overall` rating lower than 70. This is just a soft criterion and a generally decent score on FIFA.

Here we see an almost normal distribution of player age with their rankings.

In [None]:
data = data[data.overall>=70]

In [None]:
pd.DataFrame(data.overall.value_counts().sort_index())

In [None]:
plt.figure(figsize=(20,10))
ax = sns.distplot(data.overall, bins=20,vertical=False,kde=False);
ax.set_title('Distributions of Player Ratings')
ax.set_xlabel('Overall Ratings');

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot('age','overall',hue='player_positions',data=data);
ax.set_title('Player Ages vs Overall Rating')
ax.set_xlabel('Ages')
ax.set_ylabel('Overall Rating')

def label_point(x, y, val, ax):
    a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    for i, point in a.iterrows():
        if (point['y'] >=90) :
            ax.text(point['x']+.1, point['y']+.1, str(point['val']),fontsize=12)

label_point( data.age,data.overall, data.short_name, plt.gca())  

1. We see an obvious yet interesting insight here that a player reaches his maximum potential during the middle of his career, usually in between 25-32 years of age. Since we suspect the data to be highly correlated, we will look at a heatmap of the correlations in the data below.

In [None]:
to_drop = ['sofifa_id','player_url','long_name','potential','dob',\
           'work_rate','body_type','real_face','release_clause_eur','player_tags',\
           'team_position','team_jersey_number','loaned_from','joined','contract_valid_until',\
           'nation_position','nation_jersey_number','player_traits',\
           'ls','st','rs','lw','lf','cf','rf','rw','lam','cam','ram','lm','lcm','cm','rcm',\
           'rm','lwb','ldm','cdm','rdm','rwb','lb','lcb','cb','rcb','rb', 'value_eur','wage_eur']
data = data.drop(to_drop, axis=1)

In [None]:
pd.DataFrame(data.dtypes).T

Lets look at the correlation in the dataset

In [None]:
# Create correlation matrix
corr_matrix = data.corr().abs()

In [None]:
plt.figure(figsize=(20,10))
ax = sns.heatmap(corr_matrix)

As we can see, goalkeeper related features are perfectly correlated as seen by the white squares in the data. Goalkeepers are a separate group and none of the main player skills apply to goalkeepers. We will assume this as a separate cluster and remove all goalkeepers from the dataset. Now we will have to also drop the rows with player_position with the value GK. That is, we will also drop all the goalkeepers from the dataset. PCA requires continuous features only and hence we will also drop all features that are categorical. The reason for this is that PCA looks to capture the maximum variance in the data in the principal components and categorical features are discrete in nature with zero variance.

In [None]:
goalkeeper_features = ['gk_handling','gk_reflexes','gk_positioning','gk_diving','gk_kicking','gk_speed',\
                       'goalkeeping_diving','goalkeeping_handling','goalkeeping_kicking','goalkeeping_positioning','goalkeeping_reflexes']
data = data.drop(goalkeeper_features, axis = 1)

Now we will have to also drop the rows with `player_position` with the value `GK`. That is, we will also drop all the goalkeepers from the dataset.

In [None]:
data = data[data.player_positions !='GK']

Now let's drop all the categorical features

In [None]:
categorical_features = ['short_name','nationality','club','preferred_foot','player_positions','international_reputation','weak_foot','skill_moves']
data = data.drop(categorical_features, axis =1)
data = data.fillna(0)
data.shape

# Feature Relevance
Now we are left with over `39` variables and we still need to check if our initial assumptions that "`overall` and other summary skills are be explained by the other variables".    

A simple way to check this is to run a regression model on these features as the response and all other features as predictors.    

Let us build a `DecisionTree` model to check this. We will create a function to perform this regression. The function will run regression with some feature as response and all other features as predictors. The `R2 scores` for response greater than `0.95` only is shown below. We will remove these features as the variance in these features can be explained by the remaining variables and they do not add a lot of further information to our analysis.

In [None]:
def model_features(data, feature, random_state, hc_feat):
    new_data = data.drop(feature, axis = 1)
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(new_data, data[feature], test_size = 0.20, random_state = random_state)

    from sklearn.tree import DecisionTreeRegressor
    regressor = DecisionTreeRegressor(random_state=random_state)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    score =  regressor.score(X_test, y_test)
    if score >= 0.95:
        hc_feat.append(feature)
        print("R2 Score for feature {} is {}".format(feature, round(score,3) ))
    else:
        pass
    return hc_feat

In [None]:
hc_feat = []
for key in data:
    model_features(data, key, random_state=13263600,hc_feat=hc_feat)

As we can see the features `pace`,`shooting`,`passing`,`dribbling`,`defending`,`physic` have very high R2 scores. These are the overall statistics of the players and are calculated using the other independent features.    
Also, we see that the features `attacking_finishing`, `skill_dribbling`, `movement_acceleration`,`movement_sprint_speed`,`defending_standing_tackle` have R2 scores over 0.95.
For these reasons I will drop these features as well for our analysis.

In [None]:
data = data.drop(hc_feat, axis = 1)
print(data.shape)
data.head()

# Feature Scaling

Since, for PCA we need scaled data, we will transform the dataset by trying various scaling techniques. The original distribution of the dataset can be seen in the figure below.

In [None]:
data.describe()

In [None]:
plt.figure(figsize=(20,6))
for col in data.columns:
    sns.kdeplot(data[col], shade=True)
plt.legend(loc='best');

In [None]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

We can notice that the features are on different scales and most of the features are slightly left-skewed. We try the following scaling techniques on the data and chose the one that results in the best explained variance by the first two principal components:    

| Scaling Technique |
|------|
|   Log Scaling	  |
|   Standard Scaling  |
|   Min Max Scaling  |
|   Log normal Scaling  |

## Transforming the data for PCA

In [None]:
from sklearn.decomposition import PCA
def pca_results(data, n_components=8):
    #PCA model
    pca = PCA(n_components=n_components, random_state=1).fit(data)
    
    #DataFrame creation
    dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
    components = pd.DataFrame(np.round(pca.components_, 4), columns = list(data.columns))
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (25,10))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)
    plt.legend(loc='upper right')

    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n%.4f"%(ev))

    # Return a concatenated DataFrame
    df = pd.concat([variance_ratios, components], axis = 1)
    print(f'Total Variance Explained by the first 2 dimensions: {df.iloc[:2,0].sum()}')
    return df

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

log_data = np.log(data)

scaler = StandardScaler()
minmax = MinMaxScaler()

minmax.fit(data)
scaler.fit(data)

scaled_data = scaler.transform(data)
scaled_data = pd.DataFrame(scaled_data, columns=data.columns)

minmax_data = minmax.transform(data)
minmax_data = pd.DataFrame(minmax_data, columns=data.columns)

scaler.fit(log_data)
log_normal_data = scaler.transform(log_data)
log_normal_data = pd.DataFrame(log_normal_data, columns=data.columns)

In [None]:
def plot_transformed_data(data):
    plt.figure(figsize=(20,6))
    for col in data.columns:
        sns.kdeplot(data[col], shade=True)
    plt.legend(loc='best');

In [None]:
#Data
plot_transformed_data(data)

In [None]:
#Log
plot_transformed_data(log_data)

In [None]:
plot_transformed_data(scaled_data)

In [None]:
plot_transformed_data(minmax_data)

In [None]:
plot_transformed_data(log_normal_data)

In [None]:
pca_log = pca_results(log_data, 5)

In [None]:
pca_scaled = pca_results(scaled_data, 5)

In [None]:
pca_minmax = pca_results(minmax_data)

In [None]:
pca_log_normal = pca_results(log_normal_data,5)

The scaling techniques and variance explained by the first two principal components are summarized in the table below:

| Scaling Technique | Variance explained by PC1 and PC2 |
|------|------|
|   Log Scaling	  |    68.4%    |
|   Standard Scaling  |    52.5%    |
|   Min Max Scaling  |    57.2%    |
|   Log normal Scaling  |    50.8%    |


# PCA Analysis

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem.    
However, dimensionality reduction comes at a cost as fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a significant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.    

Since PCs describe variation and account for the varied influences of the original characteristics, we can plot the PCs to find out which feature produces the differences among clusters.    
To do this we plot the loadings, or vectors representing each feature of the PC plot centered at (0, 0) with the direction and length of these vectors showing how much significance each feature has on the PCs. Also, the angle between these vectors let us know correlation between the features with a small angle denoting high correlation. A plot that visualizes the above information is called a Biplot.


In [None]:
def make_pca(data, sample_ids):
    pca = PCA(n_components=2).fit(data)
    reduced_data = pca.transform(log_data)
    pca_samples = pca.transform(log_data[log_data.index.isin(sample_ids)])
    reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
    return pca, reduced_data, pca_samples


In [None]:
pca, reduced_data, pca_samples = make_pca(log_data, sample_ids = generate_samples(['CAM','CM']))

In [None]:
def biplot(log_data, reduced_data, pca):
    '''
    Produce a biplot that shows a scatterplot of the reduced
    data and the projections of the original features.
    
    good_data: original data, before transformation.
               Needs to be a pandas dataframe with valid column names
    reduced_data: the reduced data (the first two dimensions are plotted)
    pca: pca object that contains the components_ attribute
    return: a matplotlib AxesSubplot object (for any additional customization)
    
    This procedure is inspired by the script:
    https://github.com/teddyroland/python-biplot
    '''

    fig, ax = plt.subplots(figsize = (10,10))
    # scatterplot of the reduced data    
    ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'], 
        facecolors='b', edgecolors='b', s=70, alpha=0.5)
    
    feature_vectors = pca.components_.T

    # we use scaling factors to make the arrows easier to see
    arrow_size, text_pos = 4.0, 5.0,

    # projections of the original features
    for i, v in enumerate(feature_vectors):
        ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], 
                  head_width=0.2, head_length=0.2, linewidth=2, color='red')
        ax.text(v[0]*text_pos, v[1]*text_pos, log_data.columns[i], color='black', 
                 ha='center', va='center', fontsize=18)

    ax.set_xlabel("Dimension 1", fontsize=14)
    ax.set_ylabel("Dimension 2", fontsize=14)
    ax.set_title("PC plane with original feature projections.", fontsize=16);
    return ax
    

The Biplot for our dataset is provided below. As we can see, the features mentality_interceptions, defending_sliding_tackle and defending_marking is close together. Also, these strongly influence both PC1 and PC2.  

In [None]:
biplot(log_data, reduced_data, pca);

# Clustering

In this section, we choose to use a K-Means clustering algorithm to identify the various player segments hidden in the data.    
Advantages of KMeans clustering algorithm are: Kmeans is very fast. This is because Kmeans only needs to fit data to cluster centers. This makes KMeans faster in training.    
However, one drawback is that KMeans only assigns hard clusters and does not give the probability score of the cluster.     

Based on the data, it seems KMeans would do a good job assuming that the players are well segmented, and each player assumes a special role.    

Depending on the problem, the number of clusters in the data may not be known in advance. As a result, we do not know for sure if a certain number of clusters are the best choice for our data. Since we do not know the structure present in the data, in order to measure the “goodness” of our clustering, we calculate each point’s **silhouette coefficient.**    

The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.    

The Silhouette Coefficient is defined for each sample and is composed of two scores (a and b):

a.	The mean distance between a sample and all other points in the same class.
b.	The mean distance between a sample and all other points in the next nearest cluster.    

The Silhouette Coefficient s for a single sample is then given as:

$$ s = \frac{b-a}{max(a, b)}$$


The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample. In our analysis, we receive the highest Silhouette Score of about `0.53` for three clusters. Another popular method to guess the appropriate number of clusters is the **Elbow Method**. In this method, we choose that value of `K`, which lies at the elbow of the curve plotted between the number of clusters and sum of distances between each point and its centroid. As we can see from the image below, the elbow of the curve appears at 3 clusters thus concurring with the Silhouette score.

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

def cluster(reduced_data,n_clusters,pca_samples=pca_samples):
    clusterer = KMeans(n_clusters=n_clusters, random_state=123).fit(reduced_data)    
    preds = clusterer.predict(reduced_data)
    centers = clusterer.cluster_centers_
    sample_preds = clusterer.predict(pca_samples)
    return preds, centers, sample_preds

def silhouette_scorer(reduced_data,n_clusters):
    preds,_,_ = cluster(reduced_data,n_clusters)
    score = silhouette_score(reduced_data, preds)
    return score

for n_clusters in range(2,10):
    score = silhouette_scorer(reduced_data,n_clusters)
    print ("Silhoutte Score for {} cluster is {}".format(n_clusters,score))


In [None]:
inertia = []
clusters = range(2,10)
for n_clusters in clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state=123).fit(reduced_data)
    preds = clusterer.predict(reduced_data)
    inertia.append(clusterer.inertia_)

plt.plot(clusters, inertia)
plt.ylabel('Inertia')
plt.xlabel('n_clusters')
plt.title('Elbow Method');


So we get three clusters

In [None]:
def cluster_results(reduced_data, preds, centers, pca_samples):
    '''
    Visualizes the PCA-reduced cluster data in two dimensions
    Adds cues for cluster centers and student-selected sample data
    '''
    import matplotlib.cm as cm
    predictions = pd.DataFrame(preds, columns = ['Cluster'])
    plot_data = pd.concat([predictions, reduced_data], axis = 1)

    # Generate the cluster plot
    fig, ax = plt.subplots(figsize = (14,8))

    # Color map
    cmap = cm.get_cmap('gist_rainbow')

    # Color the points based on assigned cluster
    for i, cluster in plot_data.groupby('Cluster'):   
        cluster.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \
                     color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30);

    # Plot centers with indicators
    for i, c in enumerate(centers):
        ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \
                   alpha = 1, linewidth = 2, marker = 'o', s=200);
        ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100);

    # Plot transformed sample points 
    ax.scatter(x = pca_samples[:,0], y = pca_samples[:,1], \
               s = 150, linewidth = 4, color = 'black', marker = 'x');

    # Set plot title
    ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross");

In [None]:
original_data.player_positions.unique()

![](https://www.fifauteam.com/wp-content/uploads/2012/08/A046-1.jpg)

In [None]:
sample_ids = generate_samples(['CB','LB','RB'], 10)

_, _, pca_samples = make_pca(log_data, sample_ids)
preds, centers, sample_preds = cluster(reduced_data, 3)
cluster_results(reduced_data, preds, centers, pca_samples)

In [None]:
sample_ids = generate_samples(['ST','CF'], 10)

_, _, pca_samples = make_pca(log_data, sample_ids)
preds, centers, sample_preds = cluster(reduced_data, 3)
cluster_results(reduced_data, preds, centers, pca_samples)

In [None]:
sample_ids = generate_samples(['CM','RM','LM','CAM','CDM'], 10)

_, _, pca_samples = make_pca(log_data, sample_ids)
preds, centers, sample_preds = cluster(reduced_data, 3)
cluster_results(reduced_data, preds, centers, pca_samples)

# Results

It appears that from the results above, the clusters are separated by the three main playing positions in the game.

- Forwards (Green cluster or Cluster 1)
- Midfielders (Pink cluster or Cluster 2)
- Defenders (Red cluster or Cluster 0)

## Interpretation

In order to understand why K Means returned the clusters that it returned, we should go back to the Biplot visualization above.    
The Biplot maps the original features as vectors to the principal components and comparing the clusters and the feature vectors, it becomes obvious.    

For example, lets take a look at Cluster 0, which is the Red cluster. We have inferred that its a cluster of defenders based on the random samples. Now, if we take a look at the Biplot, we can see that some of the features vectors have strong influence along the direction of this cluster is `defending_marking`, `mentality_interceptions`, `defending_sliding_tackle`.    

Similarly, we can observe that, the most important features in along cluster 2 are `skill_long_passing`, `short_passing`, `power_stamina`, `mental_composure` etc. Thus, we can infer that midfielders are the players that possess these traits and we can identify the strongest players in this cluster for a midfielder role.

And finally, for a forward player, the main job is to score goals and naturally, the important features along this direction are Some of the key ones are `attacking`, `volleys`, `mentality_positioning`, `attacking_header_accuracy`. 

Important References

1.	FIFA 20 Information - Wikipedia." https://en.wikipedia.org/wiki/FIFA_20.
2.	FIFA 20 complete player dataset | Kaggle." 26 Sep. 2019, https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset
3.	Sci-kit learn – Dimensionality Reduction - https://scikit-learn.org/stable/modules/decomposition.html
4.	Silhouette Score — scikit-learn 0.22.2 ...." http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html.
5.	Elbow method (clustering) - Wikipedia." https://en.wikipedia.org/wiki/Elbow_method_(clustering). 
6.	Udacity – Clustering visualization https://github.com/udacity/mlnd
7.	How to read PCA biplots and scree plots – Linh Ngo https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/
8.	Sofifa.com – Data Scraped to Kaggle from sofifa -  sofifa.com 
