# Fifa 22 EDA with Linear Regression using L1 Regularization (Lasso)
![alt text](FIFA-22-cover.jpg)

## Contents:
* [1. Introduction](#1)
* [2. Understanding Football](#2)
* [3. Goal Statement](#3)
* [4. Data Collection](#4)
    * [4.1 Importing Libraries](#4.1)
    * [4.2 Loading Dataset](#4.2)
* [5. Data Collection](#5)
* [6. Exploratory Data Analysis](#6)
    * [6.1 Data Insights](#6.1)
    * [6.2 Data Analysis](#6.2)
* [7. Train-Valid-Test Split](#7)
* [8. Model Selection & Hyperparameter Tuning](#8)
    * [8.1 Linear Regression](#8.1)
    * [8.2 Least Absolute Shrinkage and Selection Operator (LASSO)(L1)](#8.2)
    * [8.3 Hyperparamter Tuning](#8.3)
* [9. Model Performance on Test](#9)
* [10. Final Observations](#10)
* [11. Conclusion](#11)
* [12. References](#12)

- - - 
## 1. Introduction <a class="anchor" id="1"></a>

> To begin with, Fifa 22 is a simulation video game based on **Football**. In the video game, **Overall Rating** is the common metric used to rank football players. Being a football fan myself, I've wondered what influences this metric and how can it be predicted using different attributes of the player. In this section, I'll be demonstrating and explaining the process and workflow of the problem undertaken.


- - -
## 2. Understanding Football <a class="anchor" id="2"></a>

> - Football is a 11v11 player game consisting of `10 outfield players and 1 goalkeeper.` 
> - Both the teams should coordinate passes and kick the ball into their respective oppositions net(guarded by the goalkeeper) to obtain 1 goal(point).
> - Each game lasts for about 90 minutes with a 15 minutes break after 45 minutes. 
> - The team with highest number of goals at the end wins the game.
> - Any fouls committed have consequences with player being warned or sent off.
> - 3-5 substitutions are allowed depending on the league.
> - The players can be grouped into 4 groups based on their positions:
>> - Attackers
>> - Midfielders
>> - Defenders
>> - GoalKeepers
> - There are also players falling under multiple groups.
> - These are only the basics of football and there's so much more to this beautiful game.



![alt text](Lineup.jpg)

- - -
## 3. Goal Statement <a class="anchor" id="3"></a>

> Predicting the overall rating of the outfield players using supervised learning method(with tunable hyper-parameter) and comparing the results of the undertaken models.



- - -
## 4. Data Collection <a class="anchor" id="4"></a>


### 4.1 Importing Libraries <a class="anchor" id="4.1"></a>

In [None]:
# Imported all the necessary libraries required
import eli5
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

- - -
### 4.2 Loading Data Set <a class="anchor" id="4.2"></a>

In [None]:
fifa_original=pd.read_csv('players_22.csv')

In [None]:
# displayed the head of the data set 
fifa_original.head(10)

- - -
## 5. Data Cleaning <a class="anchor" id="5"></a>

In [None]:
# displaying the info of the dataset to understand the proportions of the rows and columns present(including their data types)
fifa_original.info()

In [None]:
# dropping all the unecessary columns which do not contribute to the process
fifa_original = fifa_original.drop(['sofifa_id',
                                   'player_url',
                                   'long_name',
                                   'league_level',
                                   'player_positions',
                                   'body_type',
                                   'real_face',
                                   'release_clause_eur',
                                   'player_tags',
                                   'club_jersey_number',
                                   'club_loaned_from',
                                   'club_contract_valid_until',
                                   'club_joined',
                                   'nation_position',
                                   'nation_jersey_number',
                                   'player_traits',
                                   'defending_marking_awareness',
                                   'dob',
                                   'nation_team_id',
                                   'nationality_id',
                                   'club_team_id'],axis=1)

In [None]:
# Checking and displaying columns containing null values(NaN) in the data set 
fifa_original.columns[fifa_original.isna().any()].tolist()

In [None]:
# Dropped players that do not have any "Value_eur" in the market i.e un-registered player 
fifa_original = fifa_original[fifa_original.value_eur.notna()]

# Doing this also eliminated players without any club name, league name, wage and club position 
fifa_original.columns[fifa_original.isna().any()].tolist()

# Displaying columns that still contain null values 

In [None]:
# It is evident from the output that goalkeepers are the reason for most of the null(NaN) values 
fifa_GK = fifa_original[fifa_original["club_position"].isin(['GK'])]
fifa_GK[['pace','shooting','passing','dribbling','defending','physic']]

In [None]:
# Since the goal is to predict the overall of the "Outfield players" only, we can drop the goalkeepers
fifa_original = fifa_original.drop(fifa_original[fifa_original["club_position"].isin(['GK'])].index)
# Dropped all the goal keepers from the data set

In [None]:
# Still some remaining null values in the below columns 
fifa_original.columns[fifa_original.isna().any()].tolist()

In [None]:
# The remaining null values are also goalkeepers who are in the substitute and reserve positions 
fifa_original_null = fifa_original[fifa_original.isna().any(axis=1)]
fifa_original_null[['short_name',"club_position"]]

In [None]:
# Dropped all the remaining rows which contain null values 
fifa_original = fifa_original.dropna()

In [None]:
# No null values present 
fifa_original.columns[fifa_original.isna().any()].tolist()

In [None]:
# Checked for uneccessary columns 
fifa_original.columns


In [None]:
# Dropped the columns related to goalkeeper attributes(which are of no use now)
fifa_original = fifa_original.drop(['goalkeeping_diving',
                                   'goalkeeping_handling',
                                    'goalkeeping_kicking',
                                    'goalkeeping_positioning',
                                    'goalkeeping_reflexes'],axis=1)

- - -
## 6. Exploratory Data Analysis  <a class="anchor" id="6"></a>

### 6.1 Data Insights <a class="anchor" id="6.1"></a>

In [None]:
# Displayed top 5 countries with highest number of players 
print('Total number of countries : {0}'.format(fifa_original['nationality_name'].nunique()))
print(fifa_original['nationality_name'].value_counts().head(5))

In [None]:
# Displayed top 5 players with the highest overall ratings 
fifa_original.groupby(['short_name'])['overall'].max().sort_values(ascending = False).head()

In [None]:
# Displayed top 5 players with the highest potential 
fifa_original.groupby(['short_name'])['potential'].max().sort_values(ascending = False).head()

In [None]:
# Displayed top 5 players with the highest value 
fifa_original.groupby(['short_name'])['value_eur'].max().sort_values(ascending = False).head()

In [None]:
# Displayed top 5 players with the highest weekly wage
fifa_original.groupby(['short_name'])['wage_eur'].max().sort_values(ascending = False).head()

> **<font size="3"> Takeaways</font>**
> - `England` is the country with the highest population of football players folllowed by `Germany`, `France`, `Spain` and `Argentina`
> - `Lionel Messi`(L.Messi) is the highest rated football player
> - `Kylian Mbappe`(K.Mbappe) is the player with highest potential
> - Although `L.Messi` is the highest rated player, `K.Mbappe` has the highest value
> - `Kevin De Bryune`(K.De Bryune) earns the highest weekly wage

- - -
### 6.2 Data Analysis <a class="anchor" id="6.2"></a>

In [None]:
# Generated a lineplot that displays bivariate analysis of player age and player potential
sns.lineplot(fifa_original['age'],fifa_original['potential'],color='red').set(title=' Age vs Potential')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Age")
plt.ylabel("Potential")


In [None]:
# Generated a lineplot that displays bivariate analysis of player age and player pace
sns.lineplot(fifa_original['age'],fifa_original['pace'],color='darkorange').set(title=' Age vs Pace')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Age")
plt.ylabel("Pace")

In [None]:
# Generated a lineplot that displays bivariate analysis of player age and player physical
sns.lineplot(fifa_original['age'],fifa_original['physic'],color='blue').set(title='Age vs Physical')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Age")
plt.ylabel("Physical")

In [None]:
# Generated a lineplot that displays bivariate analysis of player overall ratings and player potential
sns.lineplot(fifa_original['overall'],fifa_original['potential'],color='black').set(title='Overall vs Potential')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Overall")
plt.ylabel("Potential")

In [None]:
# Generated a jointplot that displays bivariate analysis of player weight and player height
sns.jointplot(x=fifa_original['weight_kg'], y=fifa_original['height_cm'], kind="hex", color="darkblue")
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("white")
plt.xlabel("Weight")
plt.ylabel("Height")

In [None]:
# Generated a histplot that displays univariate analysis of player preferred foot
sns.histplot(fifa_original["preferred_foot"],color='black').set(title='Preferred Foot - Left Vs Right')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Preferred Foot")

In [None]:
# Generated a histplot that displays univariate analysis of player work rate with player preferred foot 
sns.histplot(x='work_rate',data=fifa_original,hue=fifa_original['preferred_foot']).set(title='Work Rate - Left Vs Right foot')
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Work Rate")

In [None]:
# Generated a point that displays bivariate analysis of player skill moves and player value 
sns.pointplot(x='skill_moves',y='value_eur',data=fifa_original,color='darkorange').set(title = "Skill moves vs Value")
sns.set(rc = {'figure.figsize':(15,8)})
sns.set_style("whitegrid")
plt.xlabel("Skill Moves")
plt.ylabel("Value")

> **<font size="3"> Takeaways</font>**
> - Player's `potential` gradually decreases as they grow old 
> - Player's `pace` significantly decreases as they grow old 
> - The player's `physical` steadily increases, and gradually declines once they reach 30 years 
> - The player's `overall` sharply increases with the increase in their `potential`
> - Most of the players are between 70kgs to 85kgs `weight` class and 175cm to 185cm `height` class
> - `right footed` players are 3 times more than `left footed` players
> - Approximately `6000` `right footed` and `2000` `left footed` players have `medium/medium` workrate followed by `high/medium` workrate which is half of the former 
> - The `value` of a player increases with increase in `skill moves` rating

In [None]:
# Dropped columns uneccessary to the model (mostly categorical) and assigned it to a new data frame ""
fifa_model = fifa_original.drop(['short_name',
                                'nationality_name',
                                'club_name',
                                'league_name',
                                'club_position',
                                'value_eur',
                                'wage_eur',
                                'preferred_foot',
                                'work_rate',
                                'international_reputation'],axis=1)

In [None]:
# Generated a plot to observe the correlation between target variable - overall and others
correlation_overall = pd.DataFrame(fifa_model.corr().overall).reset_index().sort_values(by = 'overall',ascending = False)
sns.barplot(x = 'overall',y = 'index',data = correlation_overall)
sns.set(rc = {'figure.figsize':(23,23)})

> **<font size="3">Observations</font>**
> - Every column has a positive correlation to the overall
> - While `reactions` and `composure` have the highest correlation to the overall, `balance` and `height` have the least 

In [None]:
# Checked for any uneccessary columns - there are none 
fifa_model.columns

- - -
## 7. Train-Valid-Test Split <a class="anchor" id="7"></a>

In [None]:
# "fifa_target" consists of the target column - overall
fifa_target = fifa_model.overall

# "fifa_train" consists of all the columns excluding the target column
fifa_train = fifa_model.drop(['overall'], axis = 1)

# Performed Train:Rem split of 60:40 
X_train, X_rem, y_train, y_rem = train_test_split(fifa_train,fifa_target, train_size=0.6)

# Performed Valid:Test split of 20:20 from Rem split(20) 
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

# Displayed the shape of Train,Valid,Test after split
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

- - -
## 8. Model Selection & Hyperparameter Tuning <a class="anchor" id="8"></a>

### 8.1 Linear Regression<a class="anchor" id="8.1"></a>

> **<font size="3">Definition</font>**
> - Linear Regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line

> **<font size="3">Overview</font>**
> - Compute linear regression on our validation set
> - Observe R2 value
> - Observe Train and Validation accuracy 

> **<font size="3">Formula</font>**
$$
Y_{i} =  f(X_{i},\beta)+e_{i} 
$$
> - $Y_{i}$ = Dependant Variable 
> - $f$ = Function
> - $X_{i}$  = Independent Variable
> - $\beta$ = Unknown Parameters
> - $e_{i}$ = Error Terms

In [None]:
# Performed Linear Regression fit and prediction on validation set
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_valid)

#Computed the r2 score 
print('r2 score: '+str(r2_score(y_valid, predictions)))

> -  The r squared score is the proportion of the variation in the dependent variable that is predictable from the independent variable
> - Achieved a good r squared score in `Linear Regression`

In [None]:
# Displayed R2 Accuracy
print("Train R2 score for Linear Regression")
print(model.score(X_train,y_train))

print("--------------------------------------------------------")

# Displayed R2 Accuracy
print("Validation R2 Score for Linear Regression")
print(model.score(X_valid,y_valid))

print("--------------------------------------------------------")


# Displayed Coefficients of the model 
print ("Linear model:", (model.coef_))

- - -
### 8.2 Least Absolute Shrinkage and Selection Operator (LASSO)(L1)<a class="anchor" id="8.2"></a>

> **<font size="3">Definition</font>**
> - Lasso Regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean.

> **<font size="3">Regularization ?</font>**
> - Regularization is the process which regularizes or shrinks the coefficients towards zero
> - L1 Regularization adds a penalty equal to the absolute value of the magnitude of coefficients

> **<font size="3">Overview</font>**
> - Compute Lasso regression/L1 regularization using different values of alpha (hyperparamter) on our validation set
> - Obtain a good value for the hyper-parameter using GridSearchCV
> - Observe Train and Validation accuracy 
> - Observe R2 value

> **<font size="3">Formula</font>**
$$
\sum_{i=1}^{n}\left ( y_{i}-\sum_{j}x_{ij}\beta_{j}\right )^2+ \lambda\sum_{j=1}^{p}\left | \beta_{j} \right |
$$
> - $\lambda$ = Amount of shrinkage

> **<font size="3">Hyper-parameter</font>**
> - λ is the hyper-parameter for l1 regularization ($\alpha$ in the code)
> - The bias increases with increase in λ
> - Variance increases with decrease in λ
> - Therefore finding an optimal value for λ is important for the accuracy of the model

- - -
### 8.3 Hyperparamter Tuning <a class="anchor" id="8.3"></a>
> <font size="3">Experiment - 1</font>
> - Alpha = 0

In [None]:
# Performed Lasso fit and predictions on validation set 
lasso = Lasso(alpha=0) # Alpha = 0 
lasso.fit(X_train,y_train)
predictions_l = lasso.predict(X_valid)

# Displayed Train R2 Score
print("Train R2 Score for LASSO")
print(lasso.score(X_train, y_train))

print("--------------------------------------------------------")

# Displayed Validation R2 Score 
print("Validation R2 Score for LASSO")
print(lasso.score(X_valid, y_valid))

print("--------------------------------------------------------")

# Displayed Coefficients of the model 
print ("Lasso model:", (lasso.coef_))


> <font size="3">Experiment - 2</font>
> - Alpha = 10

In [None]:
# Performed Lasso fit and predictions on validation set 
lasso = Lasso(alpha=10) # Alpha = 10
lasso.fit(X_train,y_train)
predictions_l = lasso.predict(X_valid)

# Displayed Train R2 Score
print("Train R2 Score for LASSO")
print(lasso.score(X_train, y_train))

print("--------------------------------------------------------")

# Displayed Validation R2 Score 
print("Validation R2 Score for LASSO")
print(lasso.score(X_valid, y_valid))

print("--------------------------------------------------------")

# Displayed Coefficients of the model 
print ("Lasso model:", (lasso.coef_))

> <font size="3">Experiment - 3</font>
> - Alpha = optimal

In [None]:
# note(1) - This block of code takes approx. 1minute 30secs to compile

# Computed the best value for alpha 
params = {'alpha': (np.logspace(-8, 8, 100))} # It will check from 1e-08 to 1e+08 (optimal)
lasso = Lasso(normalize=True)
lasso_model = GridSearchCV(lasso, params, cv = 10)
lasso_model.fit(X_train, y_train)
print(lasso_model.best_params_)

In [None]:
# Performed Lasso fit and predictions on validation set 
lasso = Lasso(alpha=0.000024)
lasso.fit(X_train,y_train)
predictions_l = lasso.predict(X_valid)

# Displayed Train R2 score
print("Train R2 score for LASSO")
print(lasso.score(X_train, y_train))

print("--------------------------------------------------------")

# Displayed Validation R2 score 
print("Validation R2 score for LASSO")
print(lasso.score(X_valid, y_valid))

print("--------------------------------------------------------")

# Displayed Coefficients of the model 
print ("Lasso model:", (lasso.coef_))

In [None]:
# Displayed the data which shows the weight of the features taken by the model
perm = PermutationImportance(model, random_state=1).fit(X_valid, y_valid)
eli5.show_weights(perm, feature_names = X_valid.columns.tolist())

> **<font size="3">Observations</font>**
> - After the 3 experiments with different hyperparameter values these are the observations:
>> - When alpha = 0, the estimate is similar to the one found in linear regression
>> - When alpha = 10, almost all features are shrinked to 0 implying most of the features are not considered
>> - When GridSearchCV is performed for an optimal value of alpha, the estimate is different to that of linear regression and also the shrinkage is moderate. There is no scope of dropping features here as no feature was shrunk to 0.
> - Therefore, we choose the optimal value of alpha for our test set
> - Potential, age and reactions are the 3 features with the most weights assigned 
> - The train R2 score decreased, causing less overfitting 
> - The R2 score has been noted for comparison

- - -
## 9. Model Performance on Test<a class="anchor" id="9"></a>

> **<font size="3">Overview</font>**
> - Select the optimal value of alpha from the validation set
> - Compute the linear regression and lasso regression on test set
> - Observe the train and test accuracy


In [None]:
# Performed Lasso fit and predictions on test set 
lasso = Lasso(alpha=0.000024)
lasso.fit(X_train,y_train)
predictions_l_test = lasso.predict(X_test)

print("--------------------------------------------------------")

# Displayed Train R2 Score 
print("Train R2 Score for LASSO")
print(lasso.score(X_train, y_train))

print("--------------------------------------------------------")

# Displayed Test R2 Score
print("Test R2 Score for LASSO")
print(lasso.score(X_test, y_test))

In [None]:
# Displayed the data which shows the weight of the features taken by the model
perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

> **<font size="3">Observations</font>**
> - Decreased Overfitting
> - Potential,age and reactions are the features with the most weights assigned similar to validation set, but, the value of the weights are different 
> - The R2 score has been noted for comparison

- - -
## 10. Final Observations <a class="anchor" id="10"></a>

In [None]:
dict ={}

dict1 = {1: ["Linear Regression", 93.74, 93.64,"-"],
     2: ["Lasso Regression", 93.73,93.63,"-"],
     3: ["Lasso Regression", 93.73,'-',93.76],
     }
 
print ("{:<20} {:<10} {:<15} {:<15}".format('Model','Train %', 'Validation %', 'Test %'))
 
for key, value in dict1.items():
    Model, Train, Validation,Test = value
    print ("{:<20} {:<10} {:<15} {:<15}".format(Model, Train, Validation,Test))


# Displayed all the observations of accuracies 

> - Although the validation R2 Score in Lasso Regression dropped compared to Linear Regression, overfitting (train R2 Score) of the data also dropped which is good for the model. 
> - Based on the hyperparameter tuning, I selected an optimal value for the alpha.
> - Lasso model is chosen.
> - The results on the test set were interesting as the R2 Score observed is good - `93.76%`
> - The preferrence of the model is user dependant:
>> - <font size="3">Linear Regression:</font>
>>> - High R2 Score
>>> - Overfitting Data
>> - <font size="3">Lasso Regression:</font>
>>> - Slightly less R2 Score
>>> - Decreases overfitting of the data by shrinkage
>>> - If the data is too large with more features, removes less important features(feature selection)

- - -
## 11. Conclusion <a class="anchor" id="11"></a>
> One machine learning model is not fixed to one type of a problem. The selection of the model is user dependant and is based on what they are trying to achieve.
>> <font size="3">Learnings:</font>
>> - Understood the importance of hyperparameter tuning in the model evaluation stage
>> - Understood the functioning of l1 regularization
>> - Gained some valuable insights from the data set using EDA

>> <font size="3">Improvements:</font>
>> - More number of models with different complexities can be taken and experimented with.

> Successfully predicted the overall rating of the outfield players using supervised learning method(using tunable hyperparameter), `with good R2 score`, and compared the results of the undertaken models

- - -
## 12. References  <a class="anchor" id="12"></a>

> - [Fifa 22 data set](https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset?select=players_22.csv)
> - [Stack overflow](https://stackoverflow.com)
> - [SheCanCode](https://shecancode.io/blog/univariate-and-bivariate-analysis-usingseaborn)
> - [The Medium](https://medium.com)
> - [Analytics Vidhya](https://www.analyticsvidhya.com)
> - [Towards Data science](https://towardsdatascience.com)

# Thank You