**In this notebook, we are going to analyse different features of the FIFA 20 dataset and visualise patterns, trends and relations between them.**

**Let's get started!**

# Load Libraries

Let us load the libraries we are using for Exploratory Data Analysis.

In [None]:
#Load Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.graph_objs as go
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

## Load Data

Let us now load the data and story it as a pandas dataframe.
Here we are displaying the first 5 rows of our dataframe.

In [None]:
data = pd.read_csv("../input/fifa-20-complete-player-dataset/players_20.csv")
data.head()

# Data exploration

In [None]:
data.shape

From the shape of the data, we observe that there are:
- 18,2758 observations (
- 104 features

Now let us print the column names and find out what the 104 features are.

In [None]:
list(data.columns)

In [None]:
data.info()

From the above line we infer the data types of our features.
- data types: float64(16), int64(45), object(43)
<br><br>
We must note that we need to deal with object data types as numeric data is preferred for training models well.

Now let us analyse the statistical description of our data.

In [None]:
data.describe()

We can make the following **observations**:
- outliers in columns like age, value_eur, wage_eur, international_reputation, etc
- All features have different value scales.

## Data Cleaning

Let us remove some redundant features that wont be required for Data Analysis.

In [None]:
#removing redundant columns
redundant_columns = ['sofifa_id','player_url','long_name','dob','nation_jersey_number','loaned_from']
data = data.drop(redundant_columns, axis = 1)

Now let us find the missing values of different columns(features).

In [None]:
null_data = data.isna().sum().sort_values(ascending=False)
null_data = null_data.reset_index(drop = False)
null_data = null_data.rename(columns={"index": "Columns", 0:"Value"})
null_data['proportion'] = (null_data['Value']/len(data))*100
null_data.head()

Now let us plot the percentage of missing values to understand it better.

In [None]:
missing = null_data[null_data['proportion']>10]
fig = px.pie(missing, names='Columns', values='proportion',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Percentage of Missing values in Columns')
fig.update_traces(textposition='inside', textinfo='label')
fig.show()

Let us look at the revised data now after removing redundant features:

In [None]:
data.shape

In [None]:
data.head()

# Exploratory Data Analysis

Since our data contains 98 features, it will be very confusing to compare all features at the same time. Hence, we take subsets of the dataset by selecting some important features and analyse them.

In [None]:
#Taking subsets of data for analysis
dataset1 = data[['short_name','age','height_cm','weight_kg','nationality','club','overall','potential','value_eur',
                 'wage_eur','player_positions','preferred_foot','international_reputation','weak_foot',
                 'skill_moves','work_rate','body_type','real_face','release_clause_eur','joined']]

I made a seperate data frame for the numeric columns of dataset1 to analyse them statiscally later on.

In [None]:
numeric_features = ['age','height_cm','weight_kg','overall','potential','value_eur',
                 'wage_eur','international_reputation','weak_foot',
                 'skill_moves','release_clause_eur']
numeric_dataset1 = data[numeric_features]

In [None]:
dataset1.describe()

In [None]:
dataset1.isna().sum().sort_values(ascending=False)

We observe:
- only joined and release_clause_eur contain missing values.
- all other features of dataset1 have 0 null values.

Now, we will plot boxplots of the numerical columns to identify outliers in our dataset, which can be seen as dots or circles below.

In [None]:
#boxplot for outliers
for col in numeric_dataset1.columns:
    sns.boxplot(x = col, data = dataset1) 
    plt.show()

We observe:
- features like age, weight_kg etc have few outliers
- features like wage_eur and value_eur have a lot of outliers as can be seen from their boxplots.



Since we are only analysing and visualising data in this notebook, we don't need to handle outliers at present.

Now let us plot their histograms to look at their distributions.

In [None]:
#histograms of numerical features
dataset1.hist(bins='auto', figsize=(14, 10))

We infer:
- max players are from the age group 20-25.
- maximum and patterns of other features can be seen in the histograms too.
- frequency plots of value_eur and wage_eur is highly skewed.
- international_reputation, skill_moves and weak_foot have categorical values.
- all other features have discrete values

# Exploring Features

Now, let us pick features from dataset1 and observe trends and patterns in them.

## Age

In [None]:
plt.figure(figsize=(14,5))
plt.title('Age Distribution FIFA 20')
sns.distplot(a=dataset1['age'], kde=True, bins=20)
plt.axvline(x=np.mean(dataset1['age']),c='orange',label='Mean Age of All Players')
plt.legend()

In [None]:
#Count of players of different ages
plt.figure(figsize= (14,8))

ax = sns.countplot(x='age', data=dataset1)
ax.set_title(label='Count of Players by age', fontsize=20)

ax.set_xlabel(xlabel='Age')
ax.set_ylabel(ylabel='Count')

plt.show()

In [None]:
#Oldest Player
print("Oldest Players: ")
data.loc[dataset1['age'] == data['age'].max()]

In [None]:
#Youngest Player
print("Youngest Players: ")
data.loc[data['age'] == dataset1['age'].min()]

In [None]:
#Skewness
dataset1['age'].skew()

### Impact of Age on Overall rating

In [None]:
fig = go.Figure()

fig = go.Figure(data=go.Scatter(
    x = dataset1['age'],
    y = dataset1['overall'],
    mode='markers',
    marker=dict(
        color=dataset1['overall'], 
        showscale=True
    ),
    text= dataset1['short_name'],
))

fig.update_layout(title='Age vs Overall Rating',
                  xaxis_title='Age',
                  yaxis_title='Overall Rating')
fig.show()

We infer:
- No definite pattern is visible since the plot is vastly spread.
- While age doesnt linearly effect rating, some trends can be observed.
- Minimum rating generally increases with age.
- Maximum of rating first increases and then decreases, with peak at 32 (Messi)

## Height

In [None]:
plt.figure(figsize=(14,5))
plt.title('Height Distribution FIFA 20')
sns.distplot(a=dataset1['height_cm'], kde=True, bins=20)
plt.axvline(x=np.mean(dataset1['height_cm']),c='orange',label='Mean Height of All Players')
plt.legend()

In [None]:
#Tallest Player
print("Tallest Players: ")
data.loc[data['height_cm'] == dataset1['height_cm'].max()]

In [None]:
#Shortest Player
print("Shortest Player: ")
data.loc[data['height_cm'] == dataset1['height_cm'].min()]

In [None]:
#Skewness
dataset1['height_cm'].skew()

In [None]:
plt.figure()
x=data.head(20)['height_cm']
y=data.head(20)['pace']

sns.regplot(x,y)
plt.title('Height v Pace')
plt.xlabel('Height')
plt.ylabel('Pace')
plt.show()

We observe:
- Pace tends to decrease with increase in height.

## Weight

In [None]:
plt.figure(figsize=(14,5))
plt.title('Weight Distribution FIFA 20')
sns.distplot(a=dataset1['weight_kg'], kde=True, bins=20)
plt.axvline(x=np.mean(dataset1['weight_kg']),c='orange',label='Mean Weight of All Players')
plt.legend()

In [None]:
plt.figure()
x=data.head(20)['weight_kg']
y=data.head(20)['pace']

sns.regplot(x,y)
plt.title('Weight v Pace')
plt.xlabel('Weight')
plt.ylabel('Pace')
plt.show()

We observe:

- Pace tends to decrease with increase in weight.

In [None]:
plt.figure()
x=data.head(20)['height_cm']
y=data.head(20)['weight_kg']

sns.regplot(x,y)
plt.title('Height v Weight')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.show()

We observe:
- Height and weight are linearly dependant

## BMI

Let us now create a new feature BMI from weight and height, and analyse its effect on performance ratings.

In [None]:
data['bmi'] = data['weight_kg'] // (data['height_cm']/100)**2

In [None]:
plt.figure(figsize= (14, 7))

ax = sns.countplot(x='bmi', data=data, order=data.bmi.value_counts().iloc[:20].index)
ax.set_title(label='Count of Players on Basis of BMI(Body Mass Index) in FIFA 20')

ax.set_xlabel(xlabel='BMI(Body Mass Index)')
ax.set_ylabel(ylabel='Count')

plt.show()

- 22 is the most common BMI

In [None]:
fig = go.Figure()

fig = go.Figure(data=go.Scatter(
    x = data['bmi'],
    y = dataset1['overall'],
    mode='markers',
    marker=dict(
        color=dataset1['overall'], 
        showscale=True
    ),
    text= dataset1['short_name'],
))

fig.update_layout(title='BMI vs Overall Rating',
                  xaxis_title='BMI',
                  yaxis_title='Overall Rating')
fig.show()

We observe:
- Highest rating for BMI 24
- No direct relation 
- Maximum of rating first increases and then decreases.

## Nationality

In [None]:
plt.figure(figsize = (20,7))
dataset1['nationality'].value_counts().head(50).plot.bar(color = 'purple')
plt.title('Players from different countries present in FIFA-2021')
plt.xlabel('Country')
plt.ylabel('Count')
plt.show()

We observe:
- England, Germany and Spain has the highest number of players.
- Venezuela and Slovenia have least number of players.

## Nationality vs Overall Rating

In [None]:
fig = go.Figure()

fig = go.Figure(data=go.Scatter(
    x = dataset1['nationality'],
    y = dataset1['overall'],
    mode='markers',
    marker=dict(
        color=dataset1['overall'], 
        showscale=True
    ),
    text= dataset1['short_name'],
))

fig.update_layout(title='Nationality vs Overall Rating',
                  xaxis_title='Nationality',
                  yaxis_title='Overall Rating')
fig.show()

We observe:
- Argetina Players have the highest maximum rating.
- South Sudan has least maximum rating.

## Overall

In [None]:
plt.figure(figsize=(14, 7))
sns.countplot(dataset1['overall'])
plt.title("Overall Rating")
plt.show()

We observe:
- Most players have rating in range 60-70.
- Very few players have rating > 80.

Now let us find mean, min and max values of our data.

In [None]:
print("Min: ", dataset1['overall'].min())
print("Mean: ", dataset1['overall'].mean())
print("Max: ", dataset1['overall'].max())

In [None]:
#Player with highest rating
data.loc[data['overall'] == dataset1['overall'].max()]

## Top Rated Players

Now let us find top rated players and analyse their origins.

In [None]:
#Top 10 Players
top_rated = data.sort_values(by=["overall"], ascending=False)
top_rated.head(10)

In [None]:
fig = px.pie(top_rated.head(25), names='club',
             title='Percentage of Clubs in top 25 players')
fig.show()

In [None]:
fig = px.pie(top_rated.head(25), names='nationality',
             title='Percentage of Nations in top 25 players')
fig.show()

## Potential

In [None]:
plt.figure(figsize=(14, 7))
sns.countplot(dataset1['potential'])
plt.title("Potential")
plt.show()

In [None]:
sns.relplot(x='potential',y='overall',hue='age',palette = 'viridis', aspect=2.5,data=data)
plt.title('Potential v Overall',fontsize = 20)
plt.xlabel('Potential')
plt.ylabel('Overall Rating')
plt.show()

We notice:
- Overall Rating is linearly dependant on your Potential.

## Value

In [None]:
plt.figure(figsize=(14,5))
plt.title('Value Distribution FIFA 20')
sns.distplot(a=dataset1['value_eur'], kde=True, bins=100)
plt.axvline(x=np.mean(dataset1['value_eur']),c='orange',label='Mean Value of All Players')
plt.legend()

We observe:
- The distribution is highly skewed.

In [None]:
#Skewness
dataset1['value_eur'].skew()

In [None]:
#Value of Top 10 players
plt.figure(figsize=(14, 7))
sns.barplot(top_rated['short_name'].head(10), top_rated['value_eur'].head(10)).set_title("Value of top 10 players")

In [None]:
#Most Valueable
data.loc[data['value_eur'] == dataset1['value_eur'].max()]

In [None]:
#Most Valued Players
top_valued = data.sort_values(by=["value_eur"], ascending=False)
top_valued.head(10)

In [None]:
fig = px.pie(top_valued.head(25), names='club',
             title='Percentage of Clubs in top 25 players')
fig.show()

In [None]:
fig = px.pie(top_valued.head(25), names='nationality',
             title='Percentage of Nations in top 25 players')
fig.show()

In [None]:
#Top 10 Valued players
plt.figure(figsize=(14, 7))
sns.barplot(top_valued['short_name'].head(10), top_valued['value_eur'].head(10)).set_title("Top 10 Valued players")

## Preferred Foot

In [None]:
sns.countplot(x = dataset1['preferred_foot'])

In [None]:
foot = ['Left', 'Right']
foot_data = data.query('preferred_foot in @foot')    
fig = px.pie(foot_data, names='preferred_foot',
             title='Preferred foot %')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [None]:
print("The average overall score of preferred Right foot", dataset1.loc[dataset1['preferred_foot'] == 'Right']['overall'].mean())
print("The average overall score of preferred Left foot", dataset1.loc[dataset1['preferred_foot'] == 'Left']['overall'].mean())

We observe:
- Preferred foot doesn't have a significant impact on the overall score.

## International Reputation

In [None]:
sns.countplot(x = dataset1['international_reputation'])

## Skill Moves

In [None]:
sns.countplot(x = dataset1['skill_moves'])

## Work Rate

In [None]:
plt.figure(figsize=(14,7))
sns.countplot(x = dataset1['work_rate'])

## Team Position

In [None]:
plt.figure(figsize=(14, 7))
sns.countplot(x='team_position', data=data, palette='bright', order=data.team_position.value_counts().index)

Team positions can be grouped into three catagories: Attack, Defend and Midfield.<br>
Let us explore them now.

In [None]:
attack = ['LW', 'RW', 'ST', 'LF', 'RF', 'CF', 'LS', 'RS']
attack_data = data.query('team_position in @attack')    
fig = px.pie(attack_data, names='team_position',
             color_discrete_sequence=px.colors.sequential.Inferno ,
             title='Percentages of Player Attacking Positions')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [None]:
defence = ['LWB', 'RWB', 'CB', 'LB', 'RB', 'LCB', 'RCB']
defence_data = data.query('team_position in @defence')    
fig = px.pie(defence_data, names='team_position',
             color_discrete_sequence=px.colors.sequential.Inferno ,
             title='Percentages of Player Defending Positions')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [None]:
mid = ['CM', 'RCM', 'LCM', 'RM', 'LM', 'CAM', 'RDM', 'LDM', 'CDM', 'RAM', 'LAM']
mid_data = data.query('team_position in @mid')    
fig = px.pie(mid_data, names='team_position',
             color_discrete_sequence=px.colors.sequential.Inferno ,
             title='Percentages of Player Midfield Positions')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

## Release Clause

In [None]:
high_release = data.sort_values(by=['release_clause_eur'], ascending=False)
high_release.head(10)

In [None]:
#Top Release clauses 
plt.figure(figsize=(14, 7))
sns.barplot(top_valued['short_name'].head(10), top_valued['release_clause_eur'].head(10)).set_title("Top Release clauses")

In [None]:
print("Min: ", dataset1['release_clause_eur'].min())
print("Mean: ", dataset1['release_clause_eur'].mean())
print("Max: ", dataset1['release_clause_eur'].max())

## Wages

In [None]:
high_wage = data.sort_values(by=['wage_eur'], ascending=False)
high_wage.head(10)

In [None]:
#Highest Wages
plt.figure(figsize=(14, 7))
sns.barplot(high_wage['short_name'].head(10), high_wage['wage_eur'].head(10)).set_title("Highest Wages")

In [None]:
print("Min: ", dataset1['release_clause_eur'].min())
print("Mean: ", dataset1['release_clause_eur'].mean())
print("Max: ", dataset1['release_clause_eur'].max())

In [None]:
#Wage of top 10 rated
plt.figure(figsize=(14, 7))
sns.barplot(top_rated['short_name'].head(10), top_rated['wage_eur'].head(10)).set_title("Wage of top 10 players")

In [None]:
sns.relplot(x='wage_eur',y='overall',hue='age',palette = 'viridis', aspect=2.5,data=data)
plt.title('Wage v Overall',fontsize = 20)
plt.xlabel('Wage in Euros')
plt.ylabel('Overall Rating')
plt.show()

We observe:
- Wage increases with increase in Overall Rating as depicted by the curve.

In [None]:
sns.boxplot(y='wage_eur',x='international_reputation',data=data)
plt.title('Wage v Reputation')
plt.ylabel('Wage in Euros')
plt.xlabel('international_reputation')
plt.show()

We observe:
- Range and distribution of wages varies for different reputations.
- Mean wage increases with increase in reputation

Let us now plot the relation between Potential, Overall rating and age for the top 20 rated players in 3D.

In [None]:
fig = px.scatter_3d(top_rated.head(20), x='potential', y='overall', z='wage_eur',
              color='short_name')
fig.update_layout(title='3D Plot of Potential, Overall and Wage')
fig.show()

We observe:
- wage increases with increase in overall and potential.


Now, let us plot a correlation heatmap to find relations between the features.

## Correlation Heatmap

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(dataset1.corr(), annot=True, cbar=True)

We observe that there is high correlation between the following features:
- wage and value
- release clause and value
- release cause and age


In such cases, we should remove one of the features of these pairs when using the data for training to get better results.

**I Hope you enjoyed reading this notebook!**

**Any suggestions would be highly appreciated.**