# My Approach to Tabular May 2021 Competition

Hi, in this notebook I have performed extensive EDA and used SHAP to find the most impactul and useless features in this data to help us in making better predictions.

Please upvote if you like it!

# Table of Contents
* [Importing Libraries](#section-one)
* [Reading the data files](#section-two)
* [Overview](#section-three)
* [Exploratory Data Analysis (EDA)](#section-four)
    - [Scaling](#subsection-fourone)
    - [Correlation Check](#subsection-fourtwo)
    - [Outlier Treatment](#subsection-fourthree)
* [Modeling](#section-six)
* [Model Explainability using SHAP](#section-seven)

<a id="section-one"></a>
# Importing Libraries

In [None]:
#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import shap
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import log_loss
from statistics import mean

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', 100)
sns.set_palette("coolwarm_r", 4)

<a id="section-two"></a>
# Reading the data files

In [None]:
#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

<a id="section-three"></a>
# Overview

In [None]:
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()

In [None]:
train.info()
print ("*"*40)
train.nunique()

* Training data has 100000 records and 50 features. 
* Column 'id' is the primary key.
* It's a multiclass classification problem and 'target' is our target variable.
* All the features are numerical in this data.
* There is no missing value in the data.
* The numerical features are discrete in nature since the cardinality is not very high.

In [None]:
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()

In [None]:
test.info()
print ("*"*40)
test.nunique()

* Test data has 50000 records and 50 features. 
* Column 'id' is the primary key.
* All the features are numerical in this data.
* There is no missing value in the data.
* The numerical features are discrete in nature since the cardinality is not very high.

In [None]:
sample.head()

* We need to submit the predicted probability values for each id in the test data.

<a id="section-four"></a>
# Exploratory Data Analysis (EDA)

In [None]:
# Setting index as 'id'
train = train.set_index('id')
test = test.set_index('id')

In [None]:
#Checking if there is any difference between the behaviour of train and test data
train.describe() - test.describe()

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in correct validation.

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

In [None]:
train.describe().T.style.bar(subset=['mean'], color='royalblue')\
                            .background_gradient(subset=['std'], cmap='coolwarm_r')\
                            .background_gradient(subset=['50%'], cmap='coolwarm_r')\
                            .background_gradient(subset=['min'], cmap='coolwarm_r')\
                            .background_gradient(subset=['max'], cmap='coolwarm_r')

**Observations**

* Most of the features have 0 value in more than 50 percentiles.
* Only feature14 and feature38 have values other than 0 in more than 50 percentile records.
* Only handful of features have negative values. It will be interesting to see their importance in prediction.

#### Target Feature

In [None]:
sorted(train['target'].unique())

In [None]:
#Checking the distribution of target variable

target3 = train['target'].value_counts()['Class_4']
target2 = train['target'].value_counts()['Class_3']
target1 = train['target'].value_counts()['Class_2']
target0 = train['target'].value_counts()['Class_1']
target3per = target3 / train.shape[0] * 100
target2per = target2 / train.shape[0] * 100
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100

print('{} of {} records have target 1 it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))
print('{} of {} records have target 2 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 3 and it is the {:.2f}% of the training set.'.format(target2, train.shape[0], target2per))
print('{} of {} records have target 4 and it is the {:.2f}% of the training set.\n'.format(target3, train.shape[0], target3per))

plt.figure(figsize=(8,6))
sns.countplot(train['target'], palette = 'coolwarm_r', order = sorted(train['target'].unique()))

plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1, 2, 3), ['1 ({0:.2f}%)'.format(target0per), '2 ({0:.2f}%)'.format(target1per), '3 ({0:.2f}%)'.format(target2per), '4 ({0:.2f}%)'.format(target3per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Training Set Target Distribution', size=15, y=1.05)

plt.show()

**Observations**

* Distribution of the classes are imbalanced.
* More than 50% of the records belong to class2.
* Smallest class is class1 having only 8.5% records.

In [None]:
# Label Encoding the classes

train.loc[train['target'] == 'Class_1', 'target'] = '1'
train.loc[train['target'] == 'Class_2', 'target'] = '2'
train.loc[train['target'] == 'Class_3', 'target'] = '3'
train.loc[train['target'] == 'Class_4', 'target'] = '4'

train['target'] = train['target'].astype(int)

#### Continuous Features

In [None]:
len(num_columns)

All of the 50 features are numerical in this data.

In [None]:
# Checking the distribution of continuous features
from tqdm import tqdm

i = 1
fig, ax = plt.subplots(10,5, figsize=(40,30))

for feature in tqdm(num_columns):
    plt.subplot(10,5, i)
    sns.kdeplot(data = train, y = feature, vertical=True, palette = 'coolwarm_r')
    plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}', size=20)
    i += 1

fig.tight_layout()
plt.show()

**Observations**

* We can see a big peak in all the features at 0 value.
* The features are sparse just like one hot encoding.
* There is skewness present in all the features but let's not treat it since the values are discrete and not continuous in this data.

In [None]:
# Checking the distribution of continuous features
from tqdm import tqdm

i = 1
fig, ax = plt.subplots(10,5, figsize=(50,30))

for feature in tqdm(num_columns):
    plt.subplot(10,5, i)
    sns.countplot(data = train, x = feature, order = train[feature].value_counts()[:4].index, hue = 'target', palette = 'coolwarm_r')
    plt.xlabel(feature, size=25)
    plt.legend(loc='upper right', prop={'size': 15})
    i += 1

fig.tight_layout()
plt.show()

**Observations**

* We cannot get any good insight here. Since the values are distributed in almost same proportion as the target variable.
* Clearly 0 alone won't help the model in classification.
* It will be interesting to see if the model is able to pick any other value apart from 0 which can help in classification.

### Analyzing Zeros

Since 0 covers most of the cell values in this data, let's check if there is any interesting pattern with zeros in this data.

In [None]:
zero_data = ((train.drop('target', axis = 1)==0).sum() / len(train) * 100)[::-1]
fig, ax = plt.subplots(1,1,figsize=(10, 19))

ax.barh(zero_data.index, 100, color='lightgrey', height=0.6)
barh = ax.barh(zero_data.index, zero_data, height=0.6, color='royalblue')
ax.bar_label(barh, fmt='%.01f %%')
ax.spines[['left', 'bottom', 'right']].set_visible(False)

ax.set_xticks([])

ax.set_title('# of Zeros (by feature)', loc='center', fontweight='bold', fontsize=15)    
plt.show()

**Observations**

* Some features have more than 90% zero values, these features are very sparse and won't help the model much.
* features 15, 15, 27, 38 looks most promising since they have variety of negative, positive and zero values.

<a id="subsection-fourtwo"></a>
### Correlation Check

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

In [None]:
#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(60,60))
ax.set_title("Correlation Matrix", fontsize=30)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm_r',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(20) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(20)
    tick.label.set_rotation(0)
    
plt.show()

**Observations**

* None of the features show any linear correlation among themselves and with the target variable.

<a id="subsection-fourone"></a>
### Scaling

In [None]:
train.describe()

**Observations**

* There are some high max values in this data. Let's get them to a standard scale.

In [None]:
#Scaling the data using standard scaler

train[num_columns] = StandardScaler().fit_transform(train[num_columns])
test[num_columns] = StandardScaler().fit_transform(test[num_columns])

<a id="subsection-fourthree"></a>
### Outlier Treatment

In [None]:
# OUTLIERS

iqr_factor = [3]
list1, list2 = [], []

for factor in iqr_factor:
    count = 0
    print(f'Outliers for {factor} IQR :')
    print('-------------------------------------')
    for col in num_columns:
    
        IQR = train[col].quantile(0.75) - train[col].quantile(0.25)
        lower_lim = train[col].quantile(0.25) - factor*IQR
        upper_lim = train[col].quantile(0.75) + factor*IQR
    
        cond = train[(train[col] < lower_lim) | (train[col] > upper_lim)].shape[0]
        
        if cond > 0 and factor == 1.5:
            list1.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
        elif cond > 0 and factor == 3:
            list2.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
        
        if cond > 0: print(f'{col:<30} : ', cond); count += cond
    print(f'\nTOTAL OUTLIERS FOR {factor} IQR : {count}')
    print('')

**Observations**

* The above table shows the number of outliers in each feature. 
* But these are not the actual outliers since the data is very sparse, most of the values other than 0 are being detected as outlier here.
* Let's keep these outliers as they are since these are the ones which will halp the model in classification.

<a id="section-six"></a>
# Modeling

Let's try different ML models and see which performs best.

In [None]:
train = train.reset_index(drop = True)

In [None]:
# Storing the target variable separately

X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))

In [None]:
#Stratified K fold Cross Validation

def train_and_validate(model, N):
    
    scores = []
    regex = '^[^\(]+'
    match = re.findall(regex, str(model))
    print(f'Running {N} Fold CV with {match[0]} Model.')
    
    preds = np.zeros((test.shape[0],4))

    importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=train.drop('target', axis = 1).columns)

    skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

    for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
        print('Fold {}\n'.format(fold))
        
        # Fitting the model
        model.fit(X_train.iloc[trn_idx], y_train[trn_idx])

        # Computing Train logloss score
        trn_logloss_score = log_loss(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx]))
        # Computing Validation logloss score
        val_logloss_score = log_loss(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx]))

        scores.append((trn_logloss_score, val_logloss_score))

        preds += model.predict_proba(X_test)/skf.n_splits
        importances.iloc[:, fold - 1] = model.feature_importances_
        
        print(scores[-1])
    
    trlogloss = mean([i[0] for i in scores])
    cvlogloss = mean([i[1] for i in scores])
    
    print(f'Average Training logloss: {trlogloss}, Average CV logloss: {cvlogloss}')
    print ("*"*40)
    print ("\n")
    
    return trlogloss, cvlogloss, importances, preds, model

In [None]:
#Testing multiple ML models using stratified K fold CV

df_row = []
N = 3

for i in [DecisionTreeClassifier(),
    LGBMClassifier(),
    RandomForestClassifier(n_estimators = 10, max_depth = 10)]:
    
    trlogloss, cvlogloss, importances, preds, model = train_and_validate(i, N)
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(i))
    
    df_row.append([match[0], trlogloss, cvlogloss])

df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training logloss', f'{N} Fold CV logloss'])
df

**Observations**

* LGBM Model has scored the least Logloss. 
* But the best performing model here is RandomForest because the difference between training logloss and CV logloss is least in this model. 
* Random Forest is generalizing the data very well here and is not overfitting much.

In [None]:
#Plotting the RandomForest importances

importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15), palette = 'coolwarm_r')

plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Top 15 features', size=10)

plt.show()

**Observations**

* As expected from EDA, features 38 and 14 are coming among the most important feature since they had a different behaviour than all other features in our data.
* It's interesting to see feature 2, 15, 6 appearing at the top of importance list. Let's explore more about these features using SHAP.

Let's try making a submission with RandomForest model and see the performance on leaderboard.

In [None]:
#Creating the submission with Random Forest Model

model = RandomForestClassifier(n_estimators = 10, max_depth = 10)
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)

sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)

We got a logloss of 1.104 on leaderboard on submitting the above csv. Let's check if LGBM model gets a better score.

In [None]:
#Creating the submission with LGBM Model

model = LGBMClassifier()
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)

sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)

Great! LGBM scored a logloss of 1.088 which is an improvement over the Random Forest Model.

Let's try to understand more about the features behaviour using SHAP.

<a id="section-seven"></a>
# Model Explainability using SHAP

In [None]:
#Fitting the SHAP on our model and training data

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

In [None]:
#Plotting the SHAP summary. (Note: Class 0 in the analysis below correspond to class 1 in the data and so on.)

shap.summary_plot(shap_values, X_train, color=plt.get_cmap("tab20c"))

**Observations**

* feature 31, 24, 16 have a major impact in predicting class 3.
* feature 15, 11, 2, 1, 33 helps most in detecting class 2.
* feature 37, 25, 28 have greatest impact in deciding class 0.
* feature 6, 37, 28, 34 have a major impact in predicting class 1.

In [None]:
shap.summary_plot(shap_values[0], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

**Observations**

* Red bulbs in the center are indicating that 0 values are no having any impact in predicting class 1.
* feature 6, 15, 41 are positively correlated with class 1.
* feature 25, 19 show slight negative correlation with class 1.
* feature 17, 1 have least impact on class 1.

In [None]:
shap.summary_plot(shap_values[1], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

**Observations**

* Red bulbs in the center are indicating that 0 values are no having any impact in predicting class 2.
* feature 19, 35, 29, 14, 28 show positive correlation with class 2.
* feature 15, 6, 10, 42, 30 show negative correlation with class 2.
* feature 2 have least impact on class 2.

In [None]:
shap.summary_plot(shap_values[2], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

**Observations**

* Red bulbs are not completely in center, indicates that zeros have a bit of impact on class 3.
* feature 43, 14, 42 show positive correlation with class 3.
* feature 15, 38, 11, 0 show negative correlation with class 3.
* feature 32 have least impact on class 3.

In [None]:
shap.summary_plot(shap_values[3], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

**Observations**

* Zero values in feature 31, 14, 24 have a bit of impact on class 4.
* No feature show strong positive correlation with class 4.
* feature 31, 14, 24, 16, 23, 7 show negative correlation with class 4.
* feature 22, 17, 32, 9 have least impact on class 4.

##  Most useless features in this data

As per the above analysis, we can safely conclude that feature_32, feature_17, feature_1 are the most indecisive features to predict any class in this data.

# The End!

Thank you for reading this notebook. I have learnt alot from this exercise, hope you have learnt something too.
Please share feedback if you find any flaws or have a better approach.

Please upvote the notebook if you liked! 

Thank you!