# FIFA 21
#### An exploration by Pranjal Timsina

This dataset is a collection of various attributes of the players in the game FIFA 21 by EA Sports. It is not a large dataset by any means - for the limited number of people or players who qualify as professionals. This notebook is an interpretaion of this data by a follower of this beautiful game. 
<a id="section-one"></a>
# 1. The Big Picture

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

sns.set_theme()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
df = pd.read_csv('/kaggle/input/fifa-21-complete-player-dataset/players_21.csv')

It always is a good idea to know about the dataset in general before any visualization.

In [None]:
df.head(5)

The dataset is sorted in ascending order with "overall" as the key. No surprise that Lionel Messi and Cristiano are the first two names on the list. Let's have a look at the dataset with a statistical eye.

In [None]:
df.describe()

In [None]:
print(f"This dataset has {df.shape[0]} rows, and {df.shape[1]} columns.")

With ~20000 row and ~100 columns, this dataset is not humongous, but it isn't small either.

The sheer number of columns could be a blessing, or perhaps a curse. Let's find out.

In [None]:
for index, col in enumerate(df.columns):
    print(f"{col:<26} | ", end="")
    if (index % 3 == 2):
        print("\n", end="")

There seems to be a fair share of both categorical and numerical data. Most of the columns seem to be of use to us, some do not. Fields like sofifa_id, dob and a few others are of no use to us, but let's not get rid of it immediately.

In [None]:
df.nunique(axis=0)

Can't figure how to get everything printed.
<a id="section-two"></a>
# 2. Distribution of Physical Attributes

In [None]:
physical_attributes = [
    "height_cm",
    "weight_kg",
    "power_strength",
    "pace",
    "movement_sprint_speed",
    "physic"
    ]
df[physical_attributes].describe()

In [None]:
f = plt.figure(figsize=(20, 9))
gs = f.add_gridspec(2, 4)

with sns.axes_style("white"):
    sns.set_style("ticks")
    x = 0
    y = 0
    for attr in physical_attributes:
        ax = f.add_subplot(gs[x, y])
        # sns.histplot(data=df, x=attr, bins=25, kde=True);
        sns.kdeplot(data=df, x=attr, cut=0, fill=True, palette="crest", linewidth=0, alpha=.5);
        plt.title(f"Distribution of {attr}");
        plt.axvline(x=np.mean(df[attr]),c='red',label=f'Mean {attr}')
        plt.xlabel(attr);
        plt.ylabel("Frequency");
        plt.legend(loc="upper left")
        sns.despine(trim=True, offset=5)
        y+=1
        if (y % 3) == 0:
            y = 0
            x +=1

f.tight_layout()

Here, height is the closest to a normal distribution, which isn't surprising at all. Players actively train to attain a specific weight, strength, pace and physique which brings skewness in the distribution. 

A hypothesis that the negative skewness of pace, strength, physic and sprint speed can be attributed to the older players can be put forth.

We can look at the average pace, strength and other attributes in different age groups to test this hypothesis.

But before we get into that, let us have a look at the distibution of age.

In [None]:
df["age"].describe()

In [None]:
sns.set_style("white")
sns.set_style("ticks")
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x="age", color="lightblue", width=0.2);
plt.title("Boxplot of the age of players")
sns.despine(left=True);

In [None]:
unique_ages = df["age"].unique()
unique_ages = sorted(unique_ages)
paces = []
counts = []
for age in unique_ages:
    avg_df = df[df["age"] == age]["pace"]
    count = avg_df.count()
    mean = avg_df.mean()
    paces.append( mean)
    counts.append(count)
sns.set_style("white")
plt.figure(figsize=(8, 8));
sns.scatterplot(x=unique_ages, y=counts, color="darkblue", size=counts);
sns.despine()
plt.title("Frequency of Ages");
plt.xlabel("Ages");
plt.ylabel("Count");

I don't know if we can really call this unbalanced as from ages 20 to 35, the number of players are pretty even, and given that players tend to retire between ages 30-35, the steady drop off in frequency makes sense. Let us now have a look at how the average pace varies with age

In [None]:
plt.figure(figsize=(8, 8))
sns.scatterplot(x=unique_ages, y=paces, size=paces, color="darkblue");
sns.despine()
plt.title("Age vs Average Pace");
plt.xlabel("Ages");
plt.ylabel("Average Pace");

This is not surprising at all.

But, for the sake of formality "Evidence suggests players get slower as they get older."

As unintersting as it may sound, the distribution of height and weight also deserve something.

The two fancy plots below illustrate the distribution of height and weight in the dataset

In [None]:
plt.figure(figsize=(8, 8))
sns.kdeplot(
    data=df,
    x="weight_kg",
    y="height_cm",
    fill=True,
    cmap="mako",
    thresh=0, levels=100,
);
plt.title("Weight-Height Distribution");
with sns.axes_style('white'):
    sns.jointplot( data=df, x="weight_kg", y="height_cm", kind='hex');

We can deduce that if you are around 75 kgs fat and around 180cms short, you're likely to be a footballer.

No, of course I'm kidding. That just means that the mean/median weight and height of footballers tend to be around 80kgs and 180cms respectively.

<a id="section-three"></a>
# 3. Skill Moves
Skill moves are exciting, arent' they?

Let's get the basic information about skill moves out of the way.

In [None]:
df["skill_moves"].describe()

In [None]:
df["skill_moves"].value_counts()

In [None]:
sns.set_style("white")
sns.set_style("ticks")
plt.figure(figsize=(10, 5))
sns.boxplot(x=df["skill_moves"], color="lightblue", width=0.2);
plt.title("Boxplot of skill moves")
sns.despine(left=True);

Players with 5 start skill moves are not common - so much so that they are considered outliers!!

Perhaps a pie chart would illustrate this better.

In [None]:
pie, ax = plt.subplots(figsize=[15,10])
labels = df["skill_moves"].value_counts().keys()
plt.pie(x=df["skill_moves"].value_counts(), autopct="%.1f%%", labels=labels, explode=[0.05]*5, pctdistance=0.5)
plt.legend()
plt.title("Proportion of players by skill moves", fontsize=14);

We've all heard of Joga Bonito - Brazil's way of playing beautiful football. Let's see if this is true.
Let us take players who either have 4 or 5 star skill moves

In [None]:
skillers = df[(df["skill_moves"] == 4) | (df["skill_moves"] == 5)]
skiller_nations = skillers["nationality"].value_counts(normalize=True)
rest = skiller_nations[10:].sum()
skiller_nations = skiller_nations[:10]
skiller_nations["Other"] = rest
pie, ax = plt.subplots(figsize=[12,12])
labels = skiller_nations.keys()
plt.pie(x=skiller_nations, autopct="%.1f%%", labels=labels, pctdistance=0.5, explode=[0.05]*11);
plt.legend(loc="upper right")
plt.title("Skill moves and countries", fontsize=14);

Let's have a look at the distribution of players strictly having 5 star skill moves

In [None]:
skillers = df[(df["skill_moves"] == 5)]
skiller_nations = skillers["nationality"].value_counts(normalize=True)
rest = skiller_nations[10:].sum()
skiller_nations = skiller_nations[:10]
skiller_nations["Other"] = rest
pie, ax = plt.subplots(figsize=[12,12])
labels = skiller_nations.keys()
plt.pie(x=skiller_nations, autopct="%.1f%%", labels=labels, pctdistance=0.5, explode=[0.05]*11)
plt.legend(loc="upper right")
plt.title("Skill moves and countries", fontsize=14);

Very sus from spain - if we take players with either 4 star and 5 star skill moves, spain is second in the number of players; however, when we just consider the players with 5* skill moves, 
there are no where to be seen.

Also, yes, Brazillian players are quite skilled.

Let's have a look at which club loves skillful players.

The graph below shows the number of 4/5* skillers in a club

In [None]:
skillers = df[(df["skill_moves"] == 4) | (df["skill_moves"] == 5)]
skiller_clubs = skillers["club_name"].value_counts()
skiller_clubs = skiller_clubs.to_frame().reset_index().rename(columns={'index': 'Club', 'club_name': 'Count'},)[:10]
sns.set(font_scale=4.5)
sns.set_style("white")
plt.figure(figsize=(100, 30))
sns.barplot(data=skiller_clubs, x="Club", y="Count", palette="pastel");
sns.despine(trim=True)

The one below shows the number of only 5* skillers

In [None]:
skillers = df[(df["skill_moves"] == 5)]
skiller_clubs = skillers["club_name"].value_counts()
skiller_clubs = skiller_clubs.to_frame().reset_index().rename(columns={'index': 'Club', 'club_name': 'Count'},)[:10]
sns.set(font_scale=5)
sns.set_style("white")
plt.figure(figsize=(100, 30))
sns.barplot(data=skiller_clubs, x="Club", y="Count", palette="pastel");
sns.set(font_scale=1)
sns.despine(trim=True)

Juventus has got the most number of players with 5* skill moves, but when it comes to 4* and 5* combined, La Liga giants Barcelona and Real Madrid seem to shine.

<a id="section-four"></a>
# 4. Loaners?

In [None]:
loaners = df["loaned_from"].value_counts()
loaners = loaners[loaners > 7]
loaners = loaners.to_frame().reset_index().rename(columns={'index': 'Club', 'loaned_from': 'Count'},)[:10]
loaners

The table above shows what we want, but this isn't visually appealing, a bar plot is much better

In [None]:
sns.set_theme(style="white")
plt.figure(figsize=(30,6))
sns.barplot(data=loaners, x="Club", y="Count", palette="pastel");
plt.title("Bar plot of the number of players a club loans out")
sns.set(font_scale=1)
sns.despine(trim=True, offset=5)

Chelsea and Citehh love to loan out their players. +1 reason why I don't like them

<a id="section-five"></a>
# 5. Primary Attributes

There are 6 primary attributes for a player in FIFA: pace, dribbling, shooting, passing, defending and physical.
One could hypothesize that there is a correlation between dribbling, shooting and passing as attackers tend to be good at these three things. Defending and physical are, perhaps, correlated, too, as for defenders, defending and strength are highly sought after attributes in the modern game.

In [None]:
primaries = ["pace" , "defending", "shooting", "dribbling", "passing", "physic"] 
primary_df = df[primaries]
primary_df.head()

Replacing NaN values with averages would be rather stupid, so dropping them would be a good idea. 
I presume the coorelation between the variables will be interesting.

In [None]:
primary_df = primary_df.dropna()
corr = primary_df.corr()
corr

In [None]:
corr_mat = primary_df.corr().stack().reset_index(name="correlation")

sns.set_style("white")
g = sns.relplot(
    data=corr_mat,
    x="level_0", y="level_1", hue="correlation", size="correlation",
    palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
    height=10, sizes=(50, 250), size_norm=(-.2, .8),
)
plt.title("Correlation of different primary attributes")
# Tweak the figure to finalize
g.set(xlabel="", ylabel="", aspect="equal")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)
for artist in g.legend.legendHandles:
    artist.set_edgecolor(".7")

Well, dribbling, passing, and shooting seem to have some correlation - the most notable one being between dribbling and passing with 0.834.

Unfortunately, the correaltion between a players physical attributes and defending (0.55) do not seem to be as correlated as we thought, but it isn't too bad, is it?

We can see a little correlation between pace, shooting and dribbling, too.

<a id="section-six"></a>
# 6. Goalkeepers
We've seen skill moves, and the primary attributes of outfield players. The goalkeepers deserve some love too.

In [None]:
mean_gk_age = df[df["team_position"] == "GK"]["age"].mean()
mean_outfield_age = df[df["team_position"] != "GK"]["age"].mean()
mean_overall = df["age"].mean()
median_overall = df["age"].median()
print(f"The mean age for goalkeepers is {mean_gk_age}\nThe mean age for outfield players is {mean_outfield_age}.\nThe overall mean age is {mean_overall}")

Initial findings show that goalkeepers are older by 3 years than outfield players. This deserves more exploration.

In [None]:
plt.figure(figsize=(25, 10))
sns.set(font_scale=1)
sns.set_style("white")
graph = sns.boxplot(data=df, x="team_position", y="age", palette="vlag");
sns.despine()
plt.title("Box plot of ages by position")
graph.axhline(mean_overall, color="yellow", label="mean");
graph.axhline(median_overall, color="black", label="median");
graph.axhline(mean_gk_age, color="blue", label="goalkeepers")
graph.legend();

The graphs confirm that goalkeepers are as old as time itself.

In [None]:
goalkeeper_attributes = [
    "goalkeeping_diving",
    "goalkeeping_handling",
    "goalkeeping_kicking",
    "goalkeeping_positioning",
    "goalkeeping_reflexes"  
]
gk_attributes = [
    "short_name",
    "gk_diving",
    "gk_handling",
    "gk_kicking",
    "gk_reflexes",
    "gk_speed",
    "gk_positioning",
    "overall",
    "potential",
    "height_cm",
    "weight_kg"
]

We seem to have a little problem here

In [None]:
df["gk_diving"]

In [None]:
df["goalkeeping_diving"]

It is pretty clear that gk_* attributes are NaN for outfield players, while goalkeeping_* attributes are the goalkeeping attributes any player would have if they were played as a goalkeeper. 

Let's look at the stats of the highest rated goalkeeper.

In [None]:
goalkeepers = df[df["team_position"] == "GK"]
gk_attr_df = goalkeepers[gk_attributes]
highest_rated = gk_attr_df.loc[gk_attr_df['overall'] == gk_attr_df['overall'].max()]
highest_rated

No surprise that it is Jan Oblak. 

Let's have a look at the primary attributes of goalkeepers now.

In [None]:
plt.figure(figsize=(13, 8))
sns.set_style("white")
plot = sns.boxplot(data=gk_attr_df, width=0.25)
plt.title("Box plot of different gk attributes")
sns.despine()
plt.show()

This is not very interesting. Let's do some head-to-head comparisons

In [None]:
gk_attr_df.head(10)

### Round 1 Hugo Lloris vs. Jan Oblak
![Lloris](http://tot-tmp.azureedge.net/media/31714/firstteam_hugolloris.png?anchor=center&mode=crop&width=500)
![Oblak](https://www.pngitem.com/pimgs/m/538-5387295_jan-oblak-png-high-quality-image-jan-oblak.png)

In [None]:
labels=np.array([
    "gk_diving",
    "gk_handling",
    "gk_kicking",
    "gk_reflexes",
    "gk_speed",
    "gk_positioning",
    "overall",
    "potential",
])
stats=gk_attr_df.loc[2, labels].values
# fig = px.line_polar(highest_rated, r=stats, theta=labels, line_close=True)
# fig.update_traces(fill='toself')
# fig.show()

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=stats,
      theta=labels,
      fill='toself',
      name='J. Oblak'
))

stats=gk_attr_df.loc[36, labels].values

fig.add_trace(go.Scatterpolar(
      r=stats,
      theta=labels,
      fill='toself',
      name='H. Lloris'
))

fig.update_layout(
    autosize=False,
    width=500,
    height=500,)

fig.show()

As we can see here, Hugo Lloris is nub. Oblak clearly won the fight. Jan is a bit slower, but on all other fronts, he is clearly better.
<a id="section-seven"></a>
# 7. The End?

People can take different approaches to a dataset; what I've presented above is the dataset through my eyes. Of course, this notebook hasn't done justice to the 106 columns of this dataset, but that does not mean anything useful was not discovered - ~some~ most of what I've presented was rather obvious, but I think it is beautiful when data backs our intuition. Here is the summary of our findings.

1. The negative skewness of the physical attributes can be attributed to the older players. Perhaps, excluding players over 35 years of age will result in the distributions being closer to the normal distribution.
2. The weight of players tends to be 75 kgs and their height tend to be 180cms.
3. Players with 5 start skill moves are very rare.
4. Brazillians are the most skilled. When it comes to players with 4 star skill moves, Spain comes second, but it's sad that they are no where around the top 10 when considering 5 star skill moves alone.
5. Juventus loves hoarding players with 5 star skill moves.
6. Lille, Chelsea and Manchester City love to loan out their players. 
7. Dribbling, passing, and shooting seem to have mid to high positive correlation (dribbling and passing have the highest correlation with 0.834).
8. The correlation between a player's physic and defending (0.55) was not as high as I expected it to be.
9. We can see a some correlation between pace, shooting and dribbling, too.
10. Goalkeepers tend to be older than outfield players.
11. Oblak is a really good goalkeeper.

I hope these new found insights will help you while playing FIFA 21 Career mode.

Also, thank you for taking time to go through my work.
