In [0]:
import pandas as pd
import csv
import numpy as np
import nltk
import sklearn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

file_data = pd.read_csv("N_Fifa.csv")
file_data.head()

Here we are specifying the player stats columns to be the following. These are the majority of player statistics we found in the dataset to be used for further analysis.

In [0]:
skill_cols = ['Crossing',
       'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
       'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes']
dfskills = file_data[skill_cols]
len(dfskills.columns)

Using a heatmap of the entire dataset, we notice that there are variables that are correlated. We are lead to believe that similar variables may have the same impact on player worth as certain others. A method must be implemented to distinguish variables with more weight in the study to further narrow our scope.

In [0]:
hist_data = pd.read_csv("Fifa_hist.csv")
cordf=hist_data.corr()
sns.heatmap(cordf, annot=False, cmap='coolwarm')

We run a clustermap for our data to further visualize which variables are the most correlated. From the map, it seems that Crossing, BallControl, Special, Vision, Curve, Skill Moves, and Positioning are the most positively correlated with one another. Redundancies may occur in our model of the neural net, so we need to now find a way to limit the scope of the project to a select few variables.

In [0]:
sns.clustermap(cordf,cmap='coolwarm',annot=False,standard_scale=1)

In order to reduce the dimensions of our dataset and observe which variables are the most important in predicting player worth, principle component analysis is run with 5 dimensions. This will allow us to obtain the top 5 variables that are the most important.

In [0]:
dfskills = fifa[skill_cols]
import pandas_profiling
pandas_profiling.ProfileReport(dfskills)
dfskills.dropna(inplace=True)
from sklearn.decomposition import PCA 
#Scaling 
scaler = StandardScaler()
scaler.fit(dfskills)
scaled_data = scaler.transform(dfskills)
pca = PCA(n_components=5)
x_pca = pca.transform(scaled_data)
plt.scatter(x_pca[:,0], x_pca[:,1])
plt.xlabel('First Principle Component ')
plt.ylabel('Second Principle Component')

Furthermore, we see that we reduce the dimensions of our dataset from 34 to 5.

In [0]:
scaled_data.shape

In [0]:
x_pca.shape

Although this plot is not interpretable due to the high number of dimensions, we are able to obtain the weights for all of the PCA component features.


In [0]:
# Makes data frame with weights for all the features in the PCA components 
df_comp = pd.DataFrame(pca.components_, columns = skill_cols)
df_comp

Now, a weighted average is taken with the feature weights obtained in the previous step in order to determine the top five most important variables. This average is taken since each principle component have different magnitudes and a single one cannot represent one feature entirely. We find that StandingTackle, SlidingTackle, Longshots, Finishing, and Interceptions are the most important variables. These will be used as inputs in the neural network.

In [0]:
vals=[]
vals = pca.explained_variance_ratio_
df_adjusted  = df_comp.mul(vals, axis=0)
absolute_df_adjusted = df_adjusted.abs()
rankings = absolute_df_adjusted.sum(axis = 0)
rankings.sort_values(ascending=False)

We were curious in regards to how these variables differ in the acclaimed top five FIFA teams of 2019. Violin plots are thus made to compare the distribution and means of teams in Belgium, France, Brazil, England, and Croatia. For standing tackles, we observe that Brazil, Croatia, and France are well above the average, whereas Belgium and Croatia are approximately average.

In [0]:
newdf = hist_data[(hist_data['Nationality']=='Belgium') | (hist_data['Nationality']== 'France') | (hist_data['Nationality']=='Brazil') | (hist_data['Nationality']=='England') | (hist_data['Nationality']=='Croatia')]
sns.set_style('darkgrid')
sns.violinplot(x='Nationality',y='StandingTackle',data=newdf)
newdf['StandingTackle'].mean()

For sliding tackles, we observe the same pattern as standing tackles. All of these plots seem to be bimodal.

In [0]:
sns.violinplot(x='Nationality',y='SlidingTackle',data=newdf)
newdf['SlidingTackle'].mean()

It seems that Brazil and Croatia are dominant in heading accuracy. The plots all seem to be skewed left and slightly bimodal as well.

In [0]:
sns.violinplot(x='Nationality',y='HeadingAccuracy',data=newdf)
newdf['HeadingAccuracy'].mean()

For longshots, it seems that Brazil and Belgium are the best and most above average in comparison to the other teams. The distributions seem to be a lot more uniform for these 5 teams as well.

In [0]:
sns.violinplot(x='Nationality',y='LongShots',data=newdf)
newdf['LongShots'].mean()

The finishing player statistic seems to follow the same distribution and ranking pattern as long shots. Overall, we found it interesting that the distributions for each player statistic were similar for all teams per plot.

In [0]:
sns.violinplot(x='Nationality',y='Finishing',data=newdf)
newdf['Finishing'].mean()