# Table of Contents
<a id="table-of-contents"></a>
* [1. Introduction](#1)
* [2. General](#2)
    * [2.1. Numbers of rows and columns](#2.1)
    * [2.2. Numbers of missing values](#2.2)
    * [2.3. First 5 rows](#2.3)
    * [2.4. Basic statistics on continuous features](#2.4)
    * [2.5. Count on categorical features](#2.5)
    * [2.6. Target variables](#2.6)
* [3. Features & Target Relation](#3)
    * [3.1. Continuous features](#3.1)
    * [3.2. Categorical features](#3.2)
* [4. Baseline Model](#4)
    * [4.1 Preparation](#4.1)
    * [4.2 Catboost baseline model](#4.2)
    * [4.3 XGBoost baseline model](#4.3)
    * [4.4 LGBM baseline model](#4.4)
    * [4.5 Average baseline model](#4.5)
* [5. Tuned Model](#5)
    * [5.1 Catboost tuned model](#5.1)
    * [5.2 XGBoost tuned model](#5.2)
    * [5.3 LGBM tuned model](#5.3)
    * [5.4 Average tuned model](#5.4)
* [6. Winners Solutions](#6)

[back to top](#table-of-contents)
<a id="1"></a>
# 1. Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

[back to top](#table-of-contents)
<a id="2"></a>
# 2. General

**Observations:**

* `Train` set has 300,000 rows while `test` set has 200,000 rows.
* There are 19 categorical features from `cat0` - `cat18` and 11 continuous features from `cont0` - `cont10` with total of 30 features.
* There is no missing values in the train and test dataset.
* `Continuous` features on train anda test dataset ranging from -0.03 to 1 which are a multimodal distribution and similar between train and test dataset.
* Correlation between `continuous` features:
    * `cont1` has the highest correlation with `cont2` with a correlation of 0.9.
    * `cont10` has a high correlation of 0.8 with `cont0` and `cont7`.
    * `Continuous` features that have a correlation of 0.7 are:
        * between `cont0` and `cont7`
        * between `cont8` with `cont1`
        * between `cont8` with `cont2`
* `Category` features in `train` and `test` are similar at each others. Below are category features that have more than 50 categories:  
    * `cat5` has 84 categories mostly from `BI` category which has propotion around 79% followed by `AB` that has propotion around 14%.
    * `cat7` has 51 categories with `AH`, `E` and `AS` are the top 3 categories which have propotion around 15%, 13% and 8% respectively. Total propotion on top 20 categories is around 87%.
    * `cat8` has 61 categories with `BM`, `AE` and `AX` are the top 3 categories which have propotion of 14.1%, 8.1% and 7.4% respectively. Total propotion on top 20 categories is also around 87% same as `cat7`.
    * `cat10` has 299 categories with `DJ`, `HK` and `DP` are the top 3 categories which have propotion of 10.5%, 10.3% and 7.9% respectively. Total propotion on top 20 categories is also around 73%.
* Categories on `cat10` are different between `train` and `test` where `train` has 299 categories and `test` has 295 categories, this can be found on category `BS`, `JF`, `CH`, `MW`, `AW`, `FW`, `MO`, `MK`, `IL`, `GH`, `CX`, `LK` which are not found in `test` and there are `KM`, `BW`, `EJ`, `BU`, `CA`, `JM`, `DG`, `KE` which can not be found in `train`.
* There is an imbalance data on `target` variable where target variable: `0` is 73.5% while target variable: `1` is 26.5%. This should be treated carefully especially when creating cross validation.

In [None]:
import os
import joblib
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/test.csv')

In [None]:
cat_features = [feature for feature in train_df.columns if 'cat' in feature]
cont_features = [feature for feature in train_df.columns if 'cont' in feature]

[back to top](#table-of-contents)
<a id="2.1"></a>
## 2.1. Numbers of rows and columns

In [None]:
print('Rows and Columns in train dataset:', train_df.shape)
print('Rows and Columns in test dataset:', test_df.shape)

[back to top](#table-of-contents)
<a id="2.2"></a>
## 2.2. Numbers of missing values

In [None]:
print('Missing values in train dataset:', sum(train_df.isnull().sum()))
print('Missing values in test dataset:', sum(test_df.isnull().sum()))

[back to top](#table-of-contents)
<a id="2.3"></a>
## 2.3. First 5 rows

**First 5 rows in the train dataset**

In [None]:
train_df.head()

**First 5 rows in the test dataset**

In [None]:
test_df.head()

[back to top](#table-of-contents)
<a id="2.4"></a>
## 2.4. Basic statistics on continuous features
**Train dataset**

In [None]:
fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 4)
gs.update(wspace=0.2, hspace=0.05)

background_color = "#f6f5f5"

run_no = 0
for col in range(0, 4):
    for row in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.15, 4.5, 'Continuous Features Distribution on Train Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.15, 4, 'Continuous features have multimodal', fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cont_features:
    sns.kdeplot(train_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#f088b7', alpha=0.9, zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1, 0)
    locals()["ax"+str(run_no)].set_xlim(-0.2, 1.2)
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
ax11.remove()

In [None]:
train_df[cont_features].describe()

**Test dataset**

In [None]:
fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 4)
gs.update(wspace=0.2, hspace=0.05)

background_color = "#f6f5f5"

run_no = 0
for col in range(0, 4):
    for row in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.15, 4.5, 'Continuous Features Distribution on Test Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.15, 4, 'Continuous features on test dataset is similar to train dataset', fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cont_features:
    sns.kdeplot(test_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#f088b7', alpha=0.9, zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1, 0)
    locals()["ax"+str(run_no)].set_xlim(-0.2, 1.2)
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
ax11.remove()

In [None]:
test_df[cont_features].describe()

**Correlation between continuous variables**

In [None]:
background_color = "#f6f5f5"

fig = plt.figure(figsize=(18, 8), facecolor=background_color)
gs = fig.add_gridspec(1, 2)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
colors = ["#f088b7", "#f6f5f5","#f088b7"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(0, -1, 'Features Correlation on Train Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(0, -0.4, 'Highest correlation in the dataset is 0.9', fontsize=13, fontweight='light', fontfamily='serif')

ax1.set_facecolor(background_color)
ax1.text(-0.1, -1, 'Features Correlation on Test Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax1.text(-0.1, -0.4, 'Features in test dataset are similar with features in train dataset ', 
         fontsize=13, fontweight='light', fontfamily='serif')

sns.heatmap(train_df[cont_features].corr(), ax=ax0, vmin=-1, vmax=1, annot=True, square=True, 
            cbar_kws={"orientation": "horizontal"}, cbar=False, cmap=colormap, fmt='.1g')

sns.heatmap(test_df[cont_features].corr(), ax=ax1, vmin=-1, vmax=1, annot=True, square=True, 
            cbar_kws={"orientation": "horizontal"}, cbar=False, cmap=colormap, fmt='.1g')

plt.show()

[back to top](#table-of-contents)
<a id="2.5"></a>
## 2.5. Categorical features proportion
**Train dataset**

In [None]:
cat5_category = list(pd.DataFrame(train_df['cat5'].value_counts()/len(train_df['cat5'])*100)[:2].index)
cat7_category = list(pd.DataFrame(train_df['cat7'].value_counts()/len(train_df['cat7'])*100)[:13].index)
cat8_category = list(pd.DataFrame(train_df['cat8'].value_counts()/len(train_df['cat8'])*100)[:13].index)
cat10_category = list(pd.DataFrame(train_df['cat10'].value_counts()/len(train_df['cat10'])*100)[:13].index)
train_df['cat5'] = np.where(~train_df['cat5'].isin(cat5_category), 'Others', train_df['cat5'])
train_df['cat7'] = np.where(~train_df['cat7'].isin(cat7_category), 'Others', train_df['cat7'])
train_df['cat8'] = np.where(~train_df['cat8'].isin(cat8_category), 'Others', train_df['cat8'])
train_df['cat10'] = np.where(~train_df['cat10'].isin(cat10_category), 'Others', train_df['cat10'])

In [None]:
background_color = "#f6f5f5"

fig = plt.figure(figsize=(20, 15), facecolor=background_color)
gs = fig.add_gridspec(7, 3)
gs.update(wspace=0.2, hspace=0.2)

run_no = 0
for row in range(0, 7):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 115, 'Categorical Features Proportion on Train Dataset (%)', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.8, 100, 'Some features are dominated by one category', fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cat_features:
    chart_df = pd.DataFrame(train_df[col].value_counts() / len(train_df) * 100)
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#f088b7', zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    run_no += 1
    
ax19.remove()
ax20.remove()

**Test dataset**

In [None]:
cat5_category = list(pd.DataFrame(test_df['cat5'].value_counts()/len(test_df['cat5'])*100)[:2].index)
cat7_category = list(pd.DataFrame(test_df['cat7'].value_counts()/len(test_df['cat7'])*100)[:13].index)
cat8_category = list(pd.DataFrame(test_df['cat8'].value_counts()/len(test_df['cat8'])*100)[:13].index)
cat10_category = list(pd.DataFrame(test_df['cat10'].value_counts()/len(test_df['cat10'])*100)[:13].index)
test_df['cat5'] = np.where(~test_df['cat5'].isin(cat5_category), 'Others', test_df['cat5'])
test_df['cat7'] = np.where(~test_df['cat7'].isin(cat7_category), 'Others', test_df['cat7'])
test_df['cat8'] = np.where(~test_df['cat8'].isin(cat8_category), 'Others', test_df['cat8'])
test_df['cat10'] = np.where(~test_df['cat10'].isin(cat10_category), 'Others', test_df['cat10'])

In [None]:
background_color = "#f6f5f5"

fig = plt.figure(figsize=(20, 15), facecolor=background_color)
gs = fig.add_gridspec(7, 3)
gs.update(wspace=0.2, hspace=0.2)

run_no = 0
for row in range(0, 7):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 115, 'Categorical Features Proportion on Test Dataset (%)', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.8, 100, 'Categorical features on test dataset similar to train dataset', 
         fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cat_features:
    chart_df = pd.DataFrame(test_df[col].value_counts() / len(test_df) * 100)
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#f088b7', zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    run_no += 1
    
ax19.remove()
ax20.remove()

[back to top](#table-of-contents)
<a id="2.5"></a>
## 2.6. Target

In [None]:
# reset train and test dataset
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/test.csv')

print('Propotion of target variable: 0 is', len(train_df[train_df['target']==0])/len(train_df))
print('Propotion of target variable: 1 is', len(train_df[train_df['target']==1])/len(train_df))

[back to top](#table-of-contents)
<a id="3"></a>
# 3. Features & Target Relations

**Observations:**
* `Continuous` features:
    * Target variable: `0` is marked by color <span style='color:#facd00' > Yellow </span> while continuous features with target `1` marked by color <span style='color:#f088b7' > Pink </span>.
    * In general, there is no distinct distribution on `continuous` features between target `0` and target `1`.
* `Categorical` features:
    * `cat0`: `A`, `cat5`: `BI`, `cat6`: `A`, `cat9`: `A`, `cat11`: `A`, `cat12`: `A`, `cat13`: `A`, `cat14`: `A`, `cat15`: `B`, `cat16`: `D`, `cat17`: `D` and `cat18`: `B` are categories that have more than 40% target variable: `0` of total train dataset.
    * There is no category in `categorical` features that has target variable: `1` above 25% of total train dataset.

[back to top](#table-of-contents)
<a id="3.1"></a>
## 3.1 Continuous features

In [None]:
fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 4)
gs.update(wspace=0.2, hspace=0.05)

background_color = "#f6f5f5"

run_no = 0
for col in range(0, 4):
    for row in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.15, 4.5, 'Continuous Features Distribution & Target Variables', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.15, 4, 'Continuous features distribution of target 0 and 1 are similar', fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cont_features:
    sns.kdeplot(train_df.loc[train_df['target'] == 0, col], ax=locals()["ax"+str(run_no)], 
                shade=True, color='#ffd514', alpha=0.9, zorder=2)
    sns.kdeplot(train_df.loc[train_df['target'] == 1, col], ax=locals()["ax"+str(run_no)], 
                shade=True, color='#f088b7', alpha=0.9, zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1, 0)
    locals()["ax"+str(run_no)].set_xlim(-0.2, 1.2)
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
ax11.remove()

[back to top](#table-of-contents)
<a id="3.2"></a>
## 3.2 Categorical features
**Target: 0**

In [None]:
# reset train and test dataset
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/test.csv')

In [None]:
cat5_category = list(pd.DataFrame(train_df.loc[train_df['target']==0, 'cat5'].value_counts()/len(train_df['cat5'])*100)[:2].index)
cat7_category = list(pd.DataFrame(train_df.loc[train_df['target']==0, 'cat7'].value_counts()/len(train_df['cat7'])*100)[:13].index)
cat8_category = list(pd.DataFrame(train_df.loc[train_df['target']==0, 'cat8'].value_counts()/len(train_df['cat8'])*100)[:13].index)
cat10_category = list(pd.DataFrame(train_df.loc[train_df['target']==0, 'cat10'].value_counts()/len(train_df['cat10'])*100)[:13].index)
train_df['cat5'] = np.where(~train_df['cat5'].isin(cat5_category), 'Others', train_df['cat5'])
train_df['cat7'] = np.where(~train_df['cat7'].isin(cat7_category), 'Others', train_df['cat7'])
train_df['cat8'] = np.where(~train_df['cat8'].isin(cat8_category), 'Others', train_df['cat8'])
train_df['cat10'] = np.where(~train_df['cat10'].isin(cat10_category), 'Others', train_df['cat10'])

In [None]:
background_color = "#f6f5f5"

fig = plt.figure(figsize=(20, 15), facecolor=background_color)
gs = fig.add_gridspec(7, 3)
gs.update(wspace=0.2, hspace=0.2)

run_no = 0
for row in range(0, 7):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 70, 'Categorical Features & Target Variables: 0', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.8, 60, 'Categorical features compared with target: 0 as a percentage of total train dataset', 
         fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cat_features:
    chart_df = pd.DataFrame(train_df.loc[train_df['target']==0, col].value_counts() / len(train_df) * 100)
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#f088b7', zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    run_no += 1
    
ax19.remove()
ax20.remove()

**Target: 1**

In [None]:
cat5_category = list(pd.DataFrame(train_df.loc[train_df['target']==1, 'cat5'].value_counts()/len(train_df['cat5'])*100)[:2].index)
cat7_category = list(pd.DataFrame(train_df.loc[train_df['target']==1, 'cat7'].value_counts()/len(train_df['cat7'])*100)[:13].index)
cat8_category = list(pd.DataFrame(train_df.loc[train_df['target']==1, 'cat8'].value_counts()/len(train_df['cat8'])*100)[:13].index)
cat10_category = list(pd.DataFrame(train_df.loc[train_df['target']==1, 'cat10'].value_counts()/len(train_df['cat10'])*100)[:13].index)
train_df['cat5'] = np.where(~train_df['cat5'].isin(cat5_category), 'Others', train_df['cat5'])
train_df['cat7'] = np.where(~train_df['cat7'].isin(cat7_category), 'Others', train_df['cat7'])
train_df['cat8'] = np.where(~train_df['cat8'].isin(cat8_category), 'Others', train_df['cat8'])
train_df['cat10'] = np.where(~train_df['cat10'].isin(cat10_category), 'Others', train_df['cat10'])

In [None]:
background_color = "#f6f5f5"

fig = plt.figure(figsize=(20, 15), facecolor=background_color)
gs = fig.add_gridspec(7, 3)
gs.update(wspace=0.2, hspace=0.2)

run_no = 0
for row in range(0, 7):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 34, 'Categorical Features & Target Variables: 1', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.8, 29
         , 'Categorical features compared with target: 1 as a percentage of total train dataset', 
         fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in cat_features:
    chart_df = pd.DataFrame(train_df.loc[train_df['target']==1, col].value_counts() / len(train_df) * 100)
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#f088b7', zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    run_no += 1
    
ax19.remove()
ax20.remove()

[back to top](#table-of-contents)
<a id="4"></a>
# 4. Baseline Model

This section will evaluate the performance of `Catboost`, `XGBoost` and `LGBM` using 3 dataset `all features`, `categorical features only` and `continuous features only`.. At the end, `Voting Classifiers` will be used to ensemble all of the model.

**Observations:**
* `LGBM` has the highest AUC of `0.89156` for using `all features` and `categorical features only` dataset but the results is close with `XGBoost` and `Catboost`.
* `XGBoost` performs the best on `continuous features only` dataset with AUC of `0.88412` and again the result is close with other model.
* Ensembling the 3 models by averaging them beats the individual baseline model with OOF AUC of `0.89360` for `all features`, `0.89017` for `categorical features only` and `0.86194` for `continuous features only`.

[back to top](#table-of-contents)
<a id="4.1"></a>
## 4.1 Preparation

**Steps:**
1. Load `packages` for performing label encoding, cross validation, modeling and AUC measurement.
2. Combine `train` and `test` dataset, the purpose is to tackle missing categories on `train` and `test` when performing label encoding.
3. Label encode all the `categorical` features.
4. Split back `combine` dataset that has been label encoded into `train` and `test` dataset.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
combine_df = pd.concat([train_df, test_df], axis=0)
average_all_df = pd.DataFrame()
average_cat_df = pd.DataFrame()
average_cont_df = pd.DataFrame()

le = LabelEncoder()
for col in cat_features:
    combine_df[col] = le.fit_transform(combine_df[col])
train_df = combine_df.iloc[:len(train_df), :]
test_df = combine_df.iloc[len(train_df):, :]
test_df = test_df.drop('target', axis=1)

folds = 10
kf = KFold(n_splits=folds, shuffle=True, random_state=42)

features = [feature for feature in train_df.columns if feature not in ['id', 'target']]

[back to top](#table-of-contents)
<a id="4.2"></a>
## 4.2 Catboost baseline model

`CatBosstClassifier` is used for the baseline model in this notebooks without hyperparameters tuning using 10 fold cross validation. 

**Observations**:
* Using `all features` still resulting the best AUC compared to use `categorical` or `continuous` features only.
* The AUC gap between `continuous features only` and `categorical features only` is big.
* Below are the OOF AUC results:
    * `all features` resulting a `0.88977` OOF AUC.
    * `categorical features only` resulting a `0.88512` OOF AUC.
    * `continuous features only` resulting a `0.81459` OOF AUC.


**Note:** Remove `task_type="GPU"` and `devices="0"` to use CPU only.

**Using all features**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = CatBoostClassifier(verbose=0,
                                eval_metric="AUC",
                                random_state=42,
                                cat_features=[x for x in range(len(cat_features))],
                                task_type="GPU",
                                devices="0")

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_all_df['catboost'] = train_oof 

**Using categorical features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cat_features]
    X_valid = valid[cat_features]
    y_train = train['target']
    y_valid = valid['target']

    model = CatBoostClassifier(verbose=0,
                                eval_metric="AUC",
                                random_state=42,
                                cat_features=[x for x in range(len(cat_features))],
                                task_type="GPU",
                                devices="0")

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Categorical features only: ', roc_auc_score(train_df['target'], train_oof))
average_cat_df['catboost'] = train_oof 

**Using continuous features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cont_features]
    X_valid = valid[cont_features]
    y_train = train['target']
    y_valid = valid['target']

    model = CatBoostClassifier(verbose=0,
                                eval_metric="AUC",
                                random_state=42,
                                task_type="GPU",
                                devices="0")

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Continuous features only: ', roc_auc_score(train_df['target'], train_oof))
average_cont_df['catboost'] = train_oof 

[back to top](#table-of-contents)
<a id="4.2"></a>
## 4.3 XGBoost baseline model

`XGBClassifier` is used for the baseline model in this notebooks without hyperparameters tuning using 10 fold cross validation. 

**Observations**:
* Using `all features` still resulting the best AUC compared to use `categorical` or `continuous` features only.
* The AUC gap between `continuous features only` and `categorical features only` is big.
* Below are the OOF AUC results:
    * `all features` resulting a `0.88986` OOF AUC.
    * `categorical features only` resulting a `0.88412` OOF AUC.
    * `continuous features only` resulting a `0.81767` OOF AUC.


**Note:** Remove `tree_method="gpu_hist"` and `gpu_id="0"` to use CPU only.

**Using all features**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = XGBClassifier(eval_metric="auc",
                          random_state=42,
                          tree_method="gpu_hist",
                          gpu_id="0",
                          use_label_encoder=False,)

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_all_df['xgboost'] = train_oof 

**Using categorical features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cat_features]
    X_valid = valid[cat_features]
    y_train = train['target']
    y_valid = valid['target']

    model = XGBClassifier(eval_metric="auc",
                          random_state=42,
                          tree_method="gpu_hist",
                          gpu_id="0",
                          use_label_encoder=False,)

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Categorical features only: ', roc_auc_score(train_df['target'], train_oof))
average_cat_df['xgboost'] = train_oof 

**Using continuous features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cont_features]
    X_valid = valid[cont_features]
    y_train = train['target']
    y_valid = valid['target']

    model = XGBClassifier(eval_metric="auc",
                          random_state=42,
                          tree_method="gpu_hist",
                          gpu_id="0",
                          use_label_encoder=False,)

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Continuous features only: ', roc_auc_score(train_df['target'], train_oof))
average_cont_df['xgboost'] = train_oof 

[back to top](#table-of-contents)
<a id="4.4"></a>
## 4.4 LGBM baseline model

`XGBClassifier` is used for the baseline model in this notebooks without hyperparameters tuning using 10 fold cross validation. 

**Observations**:
* Using `all features` still resulting the best AUC compared to use `categorical` or `continuous` features only.
* The AUC gap between `continuous features only` and `categorical features only` is big.
* Below are the OOF AUC results:
    * `all features` resulting a `0.89156` OOF AUC.
    * `categorical features only` resulting a `0.88602` OOF AUC.
    * `continuous features only` resulting a `0.81028` OOF AUC.


**Note:** Remove `device='gpu'` to use CPU only.

**Using all features**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = LGBMClassifier(metric="auc",
                          random_state=42,
                          cat_feature=[x for x in range(len(cat_features))],
                          device='gpu')

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_all_df['lgbm'] = train_oof 

**Using categorical features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cat_features]
    X_valid = valid[cat_features]
    y_train = train['target']
    y_valid = valid['target']

    model = LGBMClassifier(metric="auc",
                          random_state=42,
                          cat_feature=[x for x in range(len(cat_features))],
                          device='gpu')

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Categorical features only: ', roc_auc_score(train_df['target'], train_oof))
average_cat_df['lgbm'] = train_oof 

**Using continuous features only**

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[cont_features]
    X_valid = valid[cont_features]
    y_train = train['target']
    y_valid = valid['target']

    model = LGBMClassifier(metric="auc",
                          random_state=42,
                          device='gpu')

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - Continuous features only: ', roc_auc_score(train_df['target'], train_oof))
average_cont_df['lgbm'] = train_oof 

[back to top](#table-of-contents)
<a id="4.5"></a>
## 4.5 Average baseline model

`Votingclassifier` is used to ensemble `Catboost`, `XGBoost` and `LGBM`. 

**Observations**:
* As expected, using `all features` still resulting the best AUC compared to use `categorical` or `continuous` features only.
* The AUC gap between `continuous features only` and `categorical features only` is big.
* Below are the OOF AUC results:
    * `all features` resulting a `0.89330` OOF AUC.
    * `categorical features only` resulting a `0.88726` OOF AUC.
    * `continuous features only` resulting a `0.81651` OOF AUC.

In [None]:
average_all_df['average'] = average_all_df.mean(axis=1)
average_cat_df['average'] = average_cat_df.mean(axis=1)
average_cont_df['average'] = average_cont_df.mean(axis=1)

print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'],average_all_df['average']))
print(f'OOF AUC - Cat features: ', roc_auc_score(train_df['target'],average_cat_df['average']))
print(f'OOF AUC - Cont features: ', roc_auc_score(train_df['target'],average_cont_df['average']))

[back to top](#table-of-contents)
<a id="5"></a>
# 5. Tuned Model

This section will explore model had has been tuned using `all features`. Hyperparmaters are taken from [TPS Mar 2021 - Stacked Starter](https://www.kaggle.com/craigmthomas/tps-mar-2021-stacked-starter) by [Craig Thomas](https://www.kaggle.com/craigmthomas)

**Observations:**
* With a proper hyperparameters tuning, tuned model produce higher OOF AUC compared to baseline model. Tuned `Catboost`, `XGBoost` and `LGBM` AUC are consistently increased by around `0.005` which is quite a significant increased in a competitive competition so it's very important to tune the model properly. Below are the comparison between tuned and baseline models:
    - `Catboost` AUC performance increased by `0.00511` from `0.88977` to `0.89488`.
    - `XGBoost` AUC performance also increased from `0.88986` to `0.89467` which is an increased by `0.00481`.
    - `LGBM` AUC performance of `0.89689.` increased by `0.00533` from `0.89156`.
* Ensembling the 3 models by averaging them beats all the individual tuned models. It generates an OOF AUC of `0.89699` which is slightly above the LGBM tuned model of `0.89689`.

In [None]:
average_tuned_df = pd.DataFrame()

[back to top](#table-of-contents)
<a id="5.1"></a>
## 5.1 Catboost tuned model

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = CatBoostClassifier( 
                                verbose=0,
                                eval_metric="AUC",
                                loss_function="Logloss",
                                random_state=2021,
                                num_boost_round=20000,
                                od_type="Iter",
                                od_wait=200,
                                task_type="GPU",
                                devices="0",
                                cat_features=[x for x in range(len(cat_features))],
                                bagging_temperature=1.288692494969795,
                                grow_policy="Depthwise",
                                l2_leaf_reg=9.847870133539244,
                                learning_rate=0.01877982653902465,
                                max_depth=8,
                                min_data_in_leaf=1,
                                penalties_coefficient=2.1176668909602734)

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_tuned_df['catboost'] = train_oof

[back to top](#table-of-contents)
<a id="5.2"></a>
## 5.2 XGBoost tuned model

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = XGBClassifier(        
                        seed=2021,
                        n_estimators=10000,
                        verbosity=1,
                        eval_metric="auc",
                        tree_method="gpu_hist",
                        gpu_id=0,
                        alpha=7.105038963844129,
                        colsample_bytree=0.25505629740052566,
                        gamma=0.4999381950212869,
                        reg_lambda=1.7256912198205319,
                        learning_rate=0.011823142071967673,
                        max_bin=338,
                        max_depth=8,
                        min_child_weight=2.286836198630466,
                        subsample=0.618417952155855,
                        use_label_encoder=False)

    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_tuned_df['xgboost'] = train_oof

[back to top](#table-of-contents)
<a id="5.3"></a>
## 5.3 LGBM tuned model

In [None]:
train_oof = np.zeros((300000,))

for fold, (train_idx, valid_idx) in enumerate(kf.split(train_df[features], train_df['target'])):
    train, valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    
    X_train = train[features]
    X_valid = valid[features]
    y_train = train['target']
    y_valid = valid['target']

    model = LGBMClassifier(
                cat_feature=[x for x in range(len(cat_features))],
                random_state=2021,
                cat_l2=25.999876242730252,
                cat_smooth=89.2699690675538,
                colsample_bytree=0.2557260109926193,
                early_stopping_round=200,
                learning_rate=0.00918685483594994,
                max_bin=788,
                max_depth=81,
                metric="auc",
                min_child_samples=292,
                min_data_per_group=177,
                n_estimators=1600000,
                n_jobs=-1,
                num_leaves=171,
                reg_alpha=0.7115353581785044,
                reg_lambda=5.658115293998945,
                subsample=0.9262904583735796,
                subsample_freq=1,
                verbose=-1)
    model =  model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)
    temp_oof = model.predict_proba(X_valid)[:,1]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} AUC: ', roc_auc_score(y_valid, temp_oof))
    
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], train_oof))
average_tuned_df['lgbm'] = train_oof

[back to top](#table-of-contents)
<a id="5.4"></a>
## 5.4 Tuned voting classifier

In [None]:
average_tuned_df['average'] = average_tuned_df.mean(axis=1)
print(f'OOF AUC - All features: ', roc_auc_score(train_df['target'], average_tuned_df['average']))

[back to top](#table-of-contents)
<a id="6"></a>
## 6. Winners Solutions

Congratulations for all the winners and thank you for sharing your solution. Below are the winners and their solutions:
* 1st place position: [Dave E](https://www.kaggle.com/davidedwards1) - [#1 LB Ideas](https://www.kaggle.com/c/tabular-playground-series-mar-2021/discussion/229833)
* 2nd place position: [danzel](https://www.kaggle.com/springmanndaniel) - [2nd place solution - dae + embeddings](https://www.kaggle.com/c/tabular-playground-series-mar-2021/discussion/229868)
* 3rd place position: [BIZEN](https://www.kaggle.com/hiro5299834) - [3rd place solution - stacking](https://www.kaggle.com/c/tabular-playground-series-mar-2021/discussion/230101)