# My Approach to TPS March 2021 Competition

# Table of Contents
* [Importing Libraries](#section-one)
* [Reading the data files](#section-two)
* [Exploring the data](#section-three)
* [Exploratory Data Analysis (EDA)](#section-four)
    - [Scaling](#subsection-fourone)
    - [Correlation Check](#subsection-fourtwo)
    - [Outlier Treatment](#subsection-fourthree)
* [Feature Engineering](#section-five)
* [Modeling](#section-six)
    - [LGBM Hyperparameter Tuning with Optuna](#subsection-sixone)

<a id="section-one"></a>
# Importing Libraries

In [None]:
#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import roc_curve, auc, roc_auc_score
from statistics import mean
from imblearn.over_sampling import SMOTE

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

<a id="section-two"></a>
# Reading the data files

In [None]:
#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-mar-2021/sample_submission.csv')

<a id="section-three"></a>
# Exploring the data

In [None]:
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()

In [None]:
train.info()
print ("*"*40)
train.nunique()

* Training data has 300000 records and 32 features. 
* Column 'id'is the primary key.
* It's a binary classification problem since we need to predict the binary 'target' feature.
* There are 11 numerical features which are already scaled and 19 categorical features in the data.
* There is no missing value in the data.

In [None]:
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()

In [None]:
test.info()
print ("*"*40)
test.nunique()

* Test data has 200000 records and 31 features. 
* Column 'id'is the primary key.
* There are 11 numerical features which are already scaled and 19 categorical features in the data.
* There is no missing value in the data.

In [None]:
sample.head()

* We need to submit the predicted probability values for each id in the test data.

<a id="section-four"></a>
# Exploratory Data Analysis (EDA)

In [None]:
# Setting index as 'id'
train = train.set_index('id')
test = test.set_index('id')

In [None]:
#Checking if there is any difference between the behaviour of train and test data
train.describe() - test.describe()

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in correct validation.

In [None]:
train.shape, train.nunique()

Features cat5, cat7, cat8, cat10 have high cardinality.

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

#### Target Feature

In [None]:
#Let's check the distribution of target variable

target1 = train['target'].value_counts()[1]
target0 = train['target'].value_counts()[0]
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100

print('{} of {} records have target 1 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 0 and it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))

plt.figure(figsize=(10, 8))
sns.countplot(train['target'])

plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1), ['0 ({0:.2f}%)'.format(target0per), '1 ({0:.2f}%)'.format(target1per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Training Set Target Distribution', size=15, y=1.05)

plt.show()

The distribution of the target variable is imbalanced. We can try filling the minor class with synthetic samples using SMOTE.

#### Continuous Features

In [None]:
# Checking the distribution of continuous features

i = 1
fig, ax = plt.subplots(4, 3, figsize=(14, 14))

for feature in num_columns:
    plt.subplot(4, 3, i)
    sns.kdeplot(data = train, y = feature, vertical=True, hue='target', legend = True, shade = True)
    plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}')
    i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])

plt.show()

* No featre is highly skewed.
* All continuous features are multimodal in nature.
* We can observe difference in peaks between target 1 and target 0. This should help the model in classifying the target accurately.

#### Categorical Features

In [None]:
train.head()

In [None]:
# Checking the distribution of categorical features

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):    
    plt.subplot(5, 4, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=20, labelpad=5)
    plt.ylabel('Count', size=20, labelpad=15)    
    plt.tick_params(axis='x', labelsize=20)
    plt.tick_params(axis='y', labelsize=20)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()

* We can observe that some categories are much dominating than others. Such features are not useful for the models.

* Let's club the insignificant categories to reduce the cardinality.

In [None]:
#Clubbing the insignificant categories together

for i in cat_columns:
    x = train[i].value_counts()*100/train.shape[0]
    for j in x[x<1].index:
        train.loc[train[i] == j, i] = 'Clubbed'
        test.loc[test[i] == j, i] = 'Clubbed'

In [None]:
# Checking the distribution of categorical features after clubbing

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):    
    plt.subplot(5, 4, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=20, labelpad=5)
    plt.ylabel('Count', size=20, labelpad=15)    
    plt.tick_params(axis='x', labelsize=20)
    plt.tick_params(axis='y', labelsize=20)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()

<a id="subsection-fourone"></a>
### Scaling

In [None]:
train.describe()

All continuous features are already scaled in the dataset.

<a id="subsection-fourtwo"></a>
### Correlation Check

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

In [None]:
#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(12) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(12)
    tick.label.set_rotation(0)
    
plt.show()

* (cont1 & cont2), (cont0 & cont10), (cont7 & cont10), (cont0 & cont7) are highly correlated with each other.
* None of the feature show strong correlation with the target feature.

In [None]:
# Removing the correlated variables

train = train.drop(['cont2', 'cont10'], axis = 1)
test = test.drop(['cont2', 'cont10'], axis = 1)

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

<a id="subsection-fourthree"></a>
### Outlier Treatment

In [None]:
#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])

In [None]:
#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])

* There is no extreme outlier present in this data. But it has some mild outliers.

* Let's replace the mild outliers with median value.

In [None]:
#Replacing outliers with median value

def replace_outliers(data):
    for col in data.columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        median_ = data[col].median()
      
        data.loc[((data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)), col] = median_
    return data

train[num_columns] = replace_outliers(train[num_columns])

<a id="section-five"></a>
# Feature Engineering

#### Continuous Features

In [None]:
# Splitting and labelencoding the multimodal continuous variables

tr_size = len(train)
df_full = pd.concat([train, test])

for i in num_columns:
    df_full[i] = pd.qcut(df_full[i], 7)
    df_full[i] = LabelEncoder().fit_transform(df_full[i])
    
train = df_full[:tr_size]
test = df_full[tr_size:]

In [None]:
# Checking the distribution of continuous features

fig, axs = plt.subplots(4, 3, figsize=(14,14))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(num_columns, 1):    
    plt.subplot(4, 3, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=12, labelpad=5)
    plt.ylabel('Count', size=12, labelpad=15)    
    plt.tick_params(axis='x', labelsize=12)
    plt.tick_params(axis='y', labelsize=12)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 12})

fig.delaxes(axs[3,0])
fig.delaxes(axs[3,1])
fig.delaxes(axs[3,2])

plt.show()

* We have turned the multimodal continuous features into ordinal categorical features.

#### Categorical Features

In [None]:
#Applying one hot encoding to categorical features

tr_size = len(train)
df_all = pd.concat([train, test])
df_all = pd.get_dummies(df_all, columns=cat_columns)

train = df_all[:tr_size]
test = df_all[tr_size:]

In [None]:
test = test.drop('target', axis = 1, errors = 'ignore')

In [None]:
train.shape, test.shape

<a id="section-six"></a>
# Modeling

Let's try different ML models and see which performs best.

In [None]:
train = train.reset_index(drop = True)

In [None]:
# Storing the target variable separately

X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))

In [None]:
#Stratified K fold split Cross Validation

def train_and_validate(model, N):
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(model))
    print(f'Running {N} Fold CV with {match[0]} Model.')
    
    probs = pd.DataFrame(np.zeros((len(X_test), N * 2)), columns=['Fold_{}_Prob_{}'.format(i, j) for i in range(1, N + 1) for j in range(2)])
    importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=train.drop('target', axis = 1).columns)
    fprs, tprs, scores = [], [], []

    skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

    for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
        print('Fold {}\n'.format(fold))
        
        # Fitting the model
        model.fit(X_train.iloc[trn_idx], y_train[trn_idx])

        # Computing Train AUC score
        trn_fpr, trn_tpr, trn_thresholds = roc_curve(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx])[:, 1])
        trn_auc_score = auc(trn_fpr, trn_tpr)
        # Computing Validation AUC score
        val_fpr, val_tpr, val_thresholds = roc_curve(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx])[:, 1])
        val_auc_score = auc(val_fpr, val_tpr)  

        scores.append((trn_auc_score, val_auc_score))
        fprs.append(val_fpr)
        tprs.append(val_tpr)

        # X_test probabilities
        probs.loc[:, 'Fold_{}_Prob_0'.format(fold)] = model.predict_proba(X_test)[:, 0]
        probs.loc[:, 'Fold_{}_Prob_1'.format(fold)] = model.predict_proba(X_test)[:, 1]
        importances.iloc[:, fold - 1] = model.feature_importances_
        
        print(scores[-1])    
    
    trauc = mean([i[0] for i in scores])
    cvauc = mean([i[1] for i in scores])
    print(f'Average Training AUC: {trauc}, Average CV AUC: {cvauc}')
    print ("*"*40)
    print ("\n")
    
    return trauc, cvauc, importances, probs

In [None]:
#Testing multiple ML models using stratified K fold CV

df_row = []
N = 3

for i in [
    LGBMClassifier(),
    RandomForestClassifier(n_estimators = 10, max_depth = 30),
    XGBClassifier(verbosity = 0)]:
    
    trauc, cvauc, importances, probs = train_and_validate(i, N)
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(i))
    
    df_row.append([match[0], trauc, cvauc])

df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training AUC', f'{N} Fold CV AUC'])
df

* Best Performing Model: LGBM since the gap between Training AUC and CV AUC is lesser in LGBM as compared to XGBoost.

In [None]:
#Plotting the XGBoost importances

importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15))

plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Classifier Mean Feature Importance Between Folds', size=10)

plt.show()

Let's try tuning the LGBM parameters using Optuna.

<a id="subsection-sixone"></a>
## LGBM Hyperparameter Tuning using Optuna

In [None]:
## Install optuna library
# !pip install optuna

In [None]:
#Importing optuna library
import optuna

In [None]:
#Function for hyperparameter tuning using optuna

def objective(trial, data=X_train, target=y_train):
    seed = 2021
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed)

    for train_index, valid_index in split.split(data, target):
        X_train = data.iloc[train_index]
        y_train = target.iloc[train_index]
        X_valid = data.iloc[valid_index]
        y_valid = target.iloc[valid_index]


    lgbm_params = {
        'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 11, 333),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.01, 0.02, 0.05, 0.1]),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.5),
        'n_estimators': trial.suggest_int('n_estimators', 100, 5000),
        'random_state': seed,
        'boosting_type': 'gbdt',
        'metric': 'AUC',
        #'device': 'gpu'
    }
    

    model = LGBMClassifier(**lgbm_params)  
    
    model.fit(
            X_train,
            y_train,
            early_stopping_rounds=100,
            eval_set=[(X_valid, y_valid)],
            verbose=False
        )

    y_valid_pred = model.predict_proba(X_valid)[:,1]
    
    roc_auc = roc_auc_score(y_valid, y_valid_pred)
    
    return roc_auc

In [None]:
#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction = 'maximize')
study.optimize(objective, n_trials = 10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Best value:', study.best_value)

In [None]:
#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")

In [None]:
#Storing final parameters

params=study.best_params

In [None]:
#Training the best model
trauc, cvauc, importances, probs = train_and_validate(LGBMClassifier(**params), 3)

In [None]:
#Creating the submission
cols = [i for i in probs.columns if i.endswith('1')]

probs = probs[cols]

sample['target'] = probs.sum(axis = 1)/5
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard score: 0.89634 after tuning the LGBM Classifier.

However, it can be improved further by stacking the models together.

# What did not work:

* The continuous features are multimodal in nature but still Gaussian Mixture Modeling didn't improve the score.
* Standard scaling didn't help in improving the score.
* Applying SMOTE didn't improve the leaderboard score.

# The End!

Thank you for reading this notebook. I have learnt alot from this exercise, hope you have learnt something too.
Please share feedback if you find any flaw or have a better approach.

Please upvote the notebook if you liked! 

Thank you!