This Notebook is created for initial exploration and modela explanation with some state of the art techniques namely Permutation Importance + Partial Dependence plot + Shap Values [Coolest of all of them ;)]. The links for each of the techniques with reasonable explanation is given in the subsequent sections.

## Feature Explanations

DBNOs - Number of enemy players knocked.

assists - Number of enemy players this player damaged that were killed by teammates.

boosts - Number of boost items used.

damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.

headshotKills - Number of enemy players killed with headshots.

heals - Number of healing items used.

killPlace - Ranking in match of number of enemy players killed.

killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.)

killStreaks - Max number of enemy players killed in a short amount of time.

kills - Number of enemy players killed.

longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving 
away may lead to a large longestKill stat.

matchId - Integer ID to identify match. There are no matches that are in both the training and testing set.

revives - Number of times this player revived teammates.

rideDistance - Total distance traveled in vehicles measured in meters.

roadKills - Number of kills while in a vehicle.

swimDistance - Total distance traveled by swimming measured in meters.

teamKills - Number of times this player killed a teammate.

vehicleDestroys - Number of vehicles destroyed.

walkDistance - Total distance traveled on foot measured in meters.

weaponsAcquired - Number of weapons picked up.

winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.)

groupId - Integer ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.

numGroups - Number of groups we have data for in the match.

maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.

winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

The notebook will discuss the sections in the following order:
<html>
 I. Importing the dataset and initial preprocessing <br>
II. Let's explore <br>
III. Fit a linear model <br>
IV. Permutation Importance <br>
V. Partial Dependence Plot <br>
VI. SHAP Values <br>
VII. What does XGBoost Says <br>
VIII. Permutation Importance + PDP + Shap on XGBoost Results <br>
   </html>


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kag?gle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
#from plotly.offline import init_notebook_mode, iplot
#init_notebook_mode(connected = True)
#import plotly.graph_objs as go
import warnings
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Section I. Importing the dataset and initial preprocessing 

In [None]:
train = pd.read_csv('../input/train_V2.csv')

In [None]:
test = pd.read_csv('../input/test_V2.csv')

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int': #Encode with the most relevant datatype.g
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

In [None]:
train.info()

# Let's Explore

## Validating IID
#### Univariate Distribution [Train]

In [None]:
#Univariate Distribution
for col in train.columns:
    if 'Id' in col:
        pass
    elif train[col].dtype == 'float16' or len(train[col].unique()) > 100:
        print(col)
        plt.figure(figsize = (15, 8))
        sns.kdeplot(train[col])
        plt.show()
    else:
        print(col)
        plt.figure(figsize = (15, 8))
        train[col].value_counts().plot(kind = 'bar')
        plt.show()

#### Univariate Distribution [Test]

In [None]:

for col in test.columns:
    if 'Id' in col:
        pass
    elif test[col].dtype == 'float16' or len(test[col].unique()) > 100:
        print(col)
        plt.figure(figsize = (15, 8))
        sns.kdeplot(test[col])
        plt.show()
    else:
        print(col)
        plt.figure(figsize = (15, 8))
        test[col].value_counts().plot(kind = 'bar')
        plt.show()

 ## Bivariate Distribution with target [Relationship with target]

In [None]:
for col in train.columns:
    if 'Id' in col or col == 'winPlacePerc':
        pass
    elif train[col].dtype == 'float16':
        print(col)
        sns.jointplot(x = col, y = 'winPlacePerc', data = train, height = 10, ratio = 3)
        plt.show()
    else:
        print(col)
        sns.catplot(x = col, y = 'winPlacePerc', data=train, kind = 'boxen', aspect=3)
        plt.show()

In [None]:
#Correlation map of variables in the dataset
plt.figure(figsize = (15, 8))
sns.heatmap(train.corr())

This dataset will tell us a lot of story for instance higher killplace [lower numerically] and walk Distance will give you higher winplace (and thus Chicker Dinner). There are a lot of variables which are inter related and thus have to be taken out (due to multi collinearity). Here, we can see that, 

1. The higher walk distance you have, the lower the value of kill place which defines higher number of killings.
2. Heals and boosts along with more weapons help you keep you up and running along with which is a good explanation for higher numebr of killings. 
3. Some of the variables are following pareto distribution and thus, it would be beneficial if we apply either log transformation or Box Cox Transformation.
4. Most of the variables have non-linear relationship with the win places.

Let's see if our intuition is correct about various models using a simple model.

FE - Total Distance Covered (from ride, swim, walk, etc.), heals + boosts

In [None]:
#Drop Null Vals
train.dropna(inplace = True)

## Fit a Linear Model [Lasso or Linear Regression]

In [None]:
from sklearn.linear_model import LinearRegression, Lasso
def fit_linear_model(train, y):
    lm = Lasso(alpha = 0.0000001)
    lm.fit(train, y)
    return lm

In [None]:
import xgboost

def fit_tree_model(train, y):
    # train XGBoost model
    model = xgboost.train({"learning_rate": 0.05}, xgboost.DMatrix(train, label=y), 100)
    return model

In [None]:
from sklearn.model_selection import train_test_split
cols = ['assists', 'boosts', 'damageDealt', 'DBNOs',
           'headshotKills', 'heals', 'killPlace', 'killPoints', 'kills',
           'killStreaks', 'longestKill', 'maxPlace', 'numGroups', 'revives',
           'rideDistance', 'roadKills', 'swimDistance', 'teamKills',
           'vehicleDestroys', 'walkDistance', 'weaponsAcquired', 'winPoints']
trainX, valX, trainY, valY = train_test_split(train[cols], train['winPlacePerc'], test_size = 0.2)
lm = fit_linear_model(trainX, trainY)
print('Score:{:.4f} & RMSE: {:.4f}'.format(lm.score(valX[cols], valY), np.sqrt(mean_squared_error(lm.predict(valX[cols]), valY))))

Result:
1. R Score:0.8148 & RMSE: 0.1323 [with given attributes]
2. Score:0.8071 & RMSE: 0.1295 [With BoxCox Transformation]
3. Score:0.8447 & RMSE: 0.1213 [Median of given values (to evaluate the relativeness in a game) + With BoxCox Transformation]
4. Score: 0.87 & RMSE: 0.11 [Polynomial Regression of original vals]

## Permutation Importance [Inspecting Black Box Models]

This methodology simplifies the order in which model is relying upon those features. This functionality just behaves like SBS or SFS if you have uncorrelated variables while differently if you have correlated variables. I feel that this method is more robust while defining the relationship of different features with model and determining the performamce of model. You can get an overview <a href = "https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html"> here</a>. 

In [None]:
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(lm, random_state=1).fit(train[cols], train['winPlacePerc'])
eli5.show_weights(perm, feature_names = cols)

We can see that Permutation Importance defines numGroups and maxPlace as the one which hurts the model most if permuted as opposed to killPlace and walkDistance. My best guess is that killPlace and walkDistance have a lot of dependency on other variables which accounts for less effect while numGroups and maxPlace has less dependency which accounts for more effect. We will try with more advance models and compare the results.

## Partial Dependence Plot

So far we get to know the sensitivity of our model w.r.t variables it has been trained on. Let's see how it will behave by changing 1 variable at a time and making eveything constant. This techniques is best described as PDP or Partial Dependence plot. You can browse them <a href='https://pdpbox.readthedocs.io/en/latest/index.html'>here</a>.

In [None]:
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

for feat_name in cols:
    pdp_dist = pdp.pdp_isolate(model=lm, dataset=valX, model_features=cols, feature=feat_name)
    pdp.pdp_plot(pdp_dist, feat_name)
    plt.show()

The graphs best matches with our intuition of various correlations that we have seen in heatmap. Since, we have fitted a very simple model, these graphs are nothing but depicts the linear relationship at their best. 

## Shap [Shapley Additive Explanation]

This is a very cool finding of last year. You can find the paper <a href = 'https://arxiv.org/abs/1706.06060'>here</a>. You can find the implementation <a href = 'https://github.com/slundberg/shap'>here</a>. In a nutshell, this method gives you an overview why a model is giving certain prediction. Thus, it will be helpful to uncover the questions like, 

1.) If a model gives decision to release the patient from Hospital, what factor contributed the most in that decision.
2.) If a model gives decision to approve the loan for a customer, what are certain positive aspects that model found for that candidate?

Thus, making these decisions and explaining it to upper management is as easy as it could be. This method helps us in uncovering the black boxes wide open.

In [None]:
import shap
# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
explainer = shap.LinearExplainer(lm, data=trainX)
shap_values = explainer.shap_values(valX.iloc[:1000,:])

# visualize the first prediction's explanation
#shap.force_plot(explainer.expected_value, shap_values[0,:], valX.iloc[0,:])
shap.force_plot(explainer.expected_value, shap_values, valX.iloc[:1000,:], link="logit")

In [None]:
explainer = shap.LinearExplainer(lm, data=trainX)
shap_values = explainer.shap_values(valX)
shap.summary_plot(shap_values, valX)

## What does XGBoost Say

In [None]:
tree = fit_tree_model(trainX, trainY)
print('RMSE: {:.4f}'.format(np.sqrt(mean_squared_error(tree.predict(data=xgboost.DMatrix(valX[cols])), valY))))

## Permutation Importance + PDP + Shap on XGBoost Results 

In [None]:
# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
explainer = shap.TreeExplainer(tree)
shap_values = explainer.shap_values(valX)

# visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0,:], valX.iloc[0,:])

In [None]:
shap.force_plot(explainer.expected_value, shap_values[1,:], valX.iloc[1,:])

In [None]:
explainer = shap.TreeExplainer(tree)
shap_values = explainer.shap_values(valX)
shap.summary_plot(shap_values, valX)

In [None]:
shap_values = explainer.shap_values(valX.iloc[:1000,:])
shap.force_plot(explainer.expected_value, shap_values, valX.iloc[:1000,:], link="logit")

## Aha! Pretty Beautiful. 
These graphs will tell how much impactful a particular variable is towards the prediction of a datapoint. From these graphs we can say that, following variables has direct relationship towards the prediction of a datapoint.

-- killplace

--walkDistance

--boosts

--numGroups

On the other hand, following variables has interactive relationship with the model.

--numGroups

--DBNOs

--maxPlace

-- killStreaks

--weapons Acquired

While all the other variables have almost zero impact over the model which again reduced the breadth of our research for meaningful variables.
