![image.png](attachment:image.png)

FIFA 19 is a football simulation video game developed by EA Vancouver as part of Electronic Arts' FIFA series. It is the 26th installment in the FIFA series, and was released on 28 September 2018 for PlayStation 3, PlayStation 4, Xbox 360, Xbox One, Nintendo Switch, and Microsoft Windows.

As with FIFA 18, Cristiano Ronaldo featured as the cover athlete of the regular edition: however, following his unanticipated transfer from Spanish club Real Madrid to Italian side Juventus, new cover art was released. He also appeared with Neymar in the cover of the Champions edition. From February 2019, an updated version featured Neymar, Kevin De Bruyne and Paulo Dybala on the cover of the regular edition.

The game features the UEFA club competitions for the first time, including the UEFA Champions League and UEFA Europa League and the UEFA Super Cup as well. Martin Tyler and Alan Smith return as regular commentators, while the new commentary team of Derek Rae and Lee Dixon feature in the UEFA competitions mode. Composer Hans Zimmer and rapper Vince Staples recorded a new remix of the UEFA Champions League anthem specifically for the game.

The character Alex Hunter, who first appeared in FIFA 17, returns for the third and final installment of "The Journey", entitled, "The Journey: Champions". In June 2019, a free update added the FIFA Women's World Cup as a separate game mode. It is the last FIFA game to be available on a seventh-generation console, and the last known game to be physically available for the PlayStation 3 worldwide.

## Importing Required Libraries

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import missingno as msno
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer

## Reading Dataset

In [1]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

data = pd.read_csv('../input/fifa19/data.csv')
data.shape

In [1]:
data.head()

In [1]:
# Required Features....
working_col = ['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
                'Wage', 'Special', 'Preferred Foot', 'International Reputation', 'Weak Foot',
                'Skill Moves', 'Work Rate', 'Body Type', 'Position', 'Height', 'Weight','Crossing',
                'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
                'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
                'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
                'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
                'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
                'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
                'GKKicking', 'GKPositioning', 'GKReflexes', 'Release Clause']

In [1]:
# Working DataFrame....
df = data[working_col]
df.head()

## Data Cleaning

In [1]:
# Cleaning the Value column
def clean_money(column):
    values = []
    for value in data[column]:
        if value[-1]=='M':
            money = 1000000
            money *= float(value[1:-1])
        elif value[-1]=='K':
            money = 1000
            money *= float(value[1:-1])
        else: 
            money = 0
        values.append(money/1000000)
    return values

# Cleaning Weight column
def clean_weight():
    weights = []
    for weight in data['Weight'].fillna(''):
        if weight != '':
            weights.append(int(weight[:-3]))
        else:
            weights.append(np.nan)
    return weights

# Cleaning Height Column
def clean_height():
    heights = []
    for height in data['Height'].fillna(''):
        if height != '':
            height =int(height[0])*12 + int(height[2])
            heights.append(height)
        else:
            heights.append(np.nan)
    return heights

# # Cleaning Release Clause
def clean_release_clause():
    release_clause = []
    for clause in data['Release Clause'].fillna(''):
        if clause == '':
            money=0.0
        elif clause[-1]=='M':
            money = 1000000
            money *= float(clause[1:-1])
        elif clause[-1]=='K':
            money = 1000
            money *= float(clause[1:-1])
        else: 
            money = 0
        release_clause.append(money/1000000)
    return release_clause

In [1]:
df['Value'] =  clean_money('Value')
df['Wage'] = clean_money('Wage')
df['Weight'] = clean_weight()
df['Height'] = clean_height()
df['Release Clause'] = clean_release_clause()

In [1]:
df.isna().sum()

## Seperating Categorical and Numerical Values

In [1]:
numerical_features =['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special', 'Height',
                   'Weight', 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing',
                   'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing',
                   'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions',
                   'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots',
                   'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties',
                   'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving',
                   'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes','Release Clause']

categorical_features = ['Name','Nationality', 'Club', 'Preferred Foot', 'Work Rate','Body Type', 
                        'Position','International Reputation', 'Weak Foot', 'Skill Moves']

In [1]:
df[numerical_features].describe().T

* We can see much variablity in S.D. in dataset

## Data Visualisation

* Missing Values Visualisation

In [1]:
fig = plt.figure(figsize=(15,7))
sns.heatmap(df.isna(), yticklabels=False, cmap='YlGnBu')

The club has no relation with the other features and imputing it wiht any club names will be bias. So i am imputing it with the 'no Club'.
We can also see there is an line showing missing values in rows of different features both categorical and numerical. 
So let's view categorical and numerical variable seperately

* Missing Categorical Values Visualisation

In [1]:
fig = plt.figure(figsize=(100,100))
fig.subplots_adjust(hspace=0.4, wspace=0.1)

ax = fig.add_subplot(7, 7, 1)
sns.heatmap(df[categorical_features].isna(), yticklabels=False, cmap='YlGnBu')

ax = fig.add_subplot(7, 7, 2)
sns.heatmap(df[numerical_features].isna(), yticklabels=False, cmap='YlGnBu')

We can clearly see that some of the rows of numerical features having missing values and that same rows also continued to categorical features.
* **So the question is how we can impute this missing values??**

In [1]:
fig = plt.figure(figsize=(20,30))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
count=1
for feature in numerical_features:
    ax = fig.add_subplot(len(numerical_features)//4+1, 4, count)
    sns.boxplot(x=df[feature])
    count +=1

* If we try to statistically impute these missing values then it will be biased imputation for many features.
* There are many features with outliers so if we try to impute the missing values using any imputer models then model will provide incorrect results.
* But if we try to remove these outliers or correct the outliers, then the models will be somewhat provide meaningful results.
* But again a question arise weather to remove or correct the outliers??
* 

In [1]:
corr_ = df[numerical_features].corr()

f,ax = plt.subplots(figsize=(25, 10))
sns.heatmap(corr_,annot=True, linewidths=0.5, cmap="YlGnBu", fmt= '.1f',ax=ax)
plt.show()

* We can clearly see from these heatmap that 'Special' feature of the dataset is somewhat strongly correlated with many features which are having missing values. 
        *  'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing','Volleys', 'Dribbling', 'Curve', 
           'FKAccuracy', 'LongPassing','BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 
           'Reactions','Balance', 'ShotPower','Stamina','LongShots','Aggression', 'Interceptions', 
           'Positioning', 'Vision', 'Penalties','Composure', 'Marking','GKDiving','GKHandling', 'GKKicking', 
           'GKPositioning', 'GKReflexes'
* **Height** and **Weight** doesn't have good correlation with any other features. So we will impute this using some stats imputations technique after getting some visual view of the data below.
* Weight is Correlation(0.7) with  
        * 'Balence'       
* Marking is Correlation(0.9) with
        * 'StandingTackle', 'SlidingTackle'

## So let's Visualise these features in Comparison with the respected Corrected features

* 'Special' Vs
        * 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing','Volleys', 'Dribbling', 'Curve', 
          'FKAccuracy', 'LongPassing','BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 
          'Reactions','Balance', 'ShotPower','Stamina','LongShots','Aggression', 'Interceptions', 
          'Positioning', 'Vision', 'Penalties','Composure', 'Marking','GKDiving','GKHandling', 'GKKicking', 
          'GKPositioning', 'GKReflexes'

In [1]:
features =  ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing','Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing','BallControl', 
            'Acceleration', 'SprintSpeed', 'Agility', 'Reactions','Balance', 'ShotPower','Stamina','LongShots','Aggression', 'Interceptions', 
            'Positioning', 'Vision', 'Penalties','Composure', 'Marking','GKDiving','GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']

fig = plt.figure(figsize=(20,30))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
count=1
for feature in features:
    ax = fig.add_subplot(len(features)//4+1, 4, count)
    sns.scatterplot(x=df['Special'], y=df[feature])
    count +=1

> We don't see any outliers effecting.

* Let's visualise Balance and Weight

In [1]:
sns.scatterplot(df['Balance'], df['Weight'])

* Let's visualise Marking vs (StandingTackle and SlidingTackle)

In [1]:
fig = plt.figure(figsize=(15,5))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

ax = fig.add_subplot(1, 3, 1)
sns.scatterplot(x=df['Marking'], y=df['StandingTackle'])

ax = fig.add_subplot(1, 3, 2)
sns.scatterplot(x=df['Marking'], y=df['SlidingTackle'])

ax = fig.add_subplot(1, 3, 3)
sns.scatterplot(x=df['StandingTackle'], y=df['SlidingTackle'])

* Let's Visuslise Height

In [1]:
fig = plt.figure(figsize=(5,5))
sns.distplot(df['Height'])

## So let's now Impute the the feature using Linear Imputer and KNN Imputer

* **Before proceeding let's list down what Imputer will be used in which feature**

    1. As we can see from above scatter plots in which many features are compaired from 'Special' features. We see a linear relationship except some  features('GKDiving','GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes'). So these feature can be imputed by Linear Imputation. 
    2. And we can see features like :- ('GKDiving','GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes') are making 2 group. So we can imputed this by KNN imputation.
    3. We can impute the Weights by Balance using Linear Imputation.
    4. We can impute the StandingTackle and SlidingTackle by Marking using Linear Imputation.
    5. The height will be imputed through Stats imputation techniques.

###  Splitting train and test data

* Let's create a traning and testing set for Linear Imputation and use LinearRegression model to predict the values for features

In [1]:
train_df = df[['Special']+features[:-5]].dropna()
test_df = df[df[['Special']+features[:-5]].isnull().any(axis=1)]

In [1]:
for feature in features[:-5]:
    
    polyreg=make_pipeline(PolynomialFeatures(2),LinearRegression())
    polyreg.fit(X = train_df[['Special']], y = train_df[feature])
    
    predicted_output = polyreg.predict(test_df[['Special']])
    test_df[feature] = np.round(predicted_output)
    df[feature].fillna(test_df[feature], inplace=True)

In [1]:
fig = plt.figure(figsize=(20,30))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
count=1
for feature in features[:-5]:
    ax = fig.add_subplot(len(features[:-5])//4+1, 4, count)
    sns.scatterplot(x=train_df[feature], y=train_df['Special'])
    sns.scatterplot(x=test_df[feature], y=test_df['Special'])
    count +=1

> As we can see that the imputation is quit good. So let's now do the KNN imputation for remanining numerical variables.

* Let's create a traning and testing set for KNN Imputation and use KNNImputer model to predict the values for features

In [1]:
train_df = df[['Special']+features[-5:]].dropna()
test_df = df[df[features[-5:]].isnull().any(axis=1)][['Special']+features[-5:]]


In [1]:
imputer = KNNImputer(n_neighbors=1)
imputer.fit(train_df)
predicted_df = pd.DataFrame(np.round(imputer.transform(test_df)), columns=test_df.columns,index=test_df.index)

In [1]:
for col in test_df.columns[1:]:
    df[col].fillna(predicted_df[col], inplace=True)

In [1]:
fig = plt.figure(figsize=(20, 10))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
count=1
for feature in features[-5:]:
    ax = fig.add_subplot(2, 3, count)
    sns.scatterplot(x=train_df[feature], y=train_df['Special'])
    sns.scatterplot(x=predicted_df[feature], y=predicted_df['Special'])
    count +=1

* Let's impute Weight feature by Balance using LinearRegression

In [1]:
train_df = df[['Weight', 'Balance']].dropna()
test_df = df[df['Weight'].isna()][['Weight', 'Balance']]

In [1]:
polyreg=make_pipeline(PolynomialFeatures(2),LinearRegression())
polyreg.fit(X = train_df[['Balance']], y = train_df['Weight'])

test_df['Weight'] = np.round(polyreg.predict(test_df[['Balance']]))

In [1]:
df['Weight'].fillna(test_df['Weight'], inplace=True)

In [1]:
sns.scatterplot(train_df['Weight'], train_df['Balance'])
sns.scatterplot(test_df['Weight'], test_df['Balance'])

* Let's impute StandingTackle and SlidingTackle feature by Marking using LinearRegression

In [1]:
train_df = df[['StandingTackle', 'SlidingTackle', 'Marking']].dropna()
test_df = df[df['SlidingTackle'].isna()][['StandingTackle', 'SlidingTackle', 'Marking']]

In [1]:
for feature in ['StandingTackle', 'SlidingTackle']:
    polyreg=make_pipeline(PolynomialFeatures(2),LinearRegression())
    polyreg.fit(X = train_df[['Marking']], y = train_df[feature])

    test_df[feature] = np.round(polyreg.predict(test_df[['Marking']]))
    df[feature].fillna(test_df[feature], inplace=True)

In [1]:
fig = plt.figure(figsize=(20, 5))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
count=1
for feature in ['StandingTackle', 'SlidingTackle']:
    ax = fig.add_subplot(1, 2, count)
    sns.scatterplot(train_df['Marking'], train_df[feature])
    sns.scatterplot(test_df['Marking'], test_df[feature])
    count +=1



* Let's impute the height,stamina and Jumping using stats imputation technique


In [1]:
df['Height'].fillna(61.0, inplace=True)
df['Jumping'].fillna(df['Jumping'].mean(), inplace=True)
df['Strength'].fillna(df['Strength'].mean(), inplace=True)

## As all the Numerical features are been imputed. Let's fill the Categorical features

### Let's first visualise the category features distribution.

In [1]:
df[categorical_features].isna().sum()

In [1]:
sns.countplot(df['International Reputation'])

In [1]:
sns.countplot(df['Skill Moves'])

In [1]:
sns.countplot(df['Preferred Foot'])

In [1]:
sns.countplot(df['Weak Foot'])

In [1]:
fig = plt.figure(figsize=(15,5))
sns.countplot(df['Position'])

In [1]:
fig = plt.figure(figsize=(15,5))
sns.countplot(df['Body Type'])

In [1]:
fig = plt.figure(figsize=(15, 5))
sns.countplot(df['Work Rate'])

### Imputing some Categorical variable by viewing the graphs above.

In [1]:
df['Club'].fillna('No Club', inplace=True)
df['Preferred Foot'].fillna('Right', inplace=True)
df['Weak Foot'].fillna(3.0, inplace=True)
df['International Reputation'].fillna(1.0, inplace=True)
df['Body Type'].fillna('Normal',inplace=True)
df['Work Rate'].fillna('Medium/Medium', inplace=True)
df['Position'].fillna('NA', inplace=True)
df['Skill Moves'].fillna(2.0, inplace=True)

### After all imputations let's visualise the Dataset using Heatmap

In [1]:
fig = plt.figure(figsize=(15,7))
sns.heatmap(df.isna(), yticklabels=False, cmap='YlGnBu')