# Table of Contents

<a id="table-of-contents"></a>
1. [Introduction](#introduction)
2. [Preparation](#preparation)
3. [General](#general)
    * 3.1. [No of rows and columns](#rows_columns)
        * 3.1.1 [Dataset size](#Dataset_size)
    * 3.2. [No of missing values](#missing_values)
    * 3.3. [Quick view](#first_5_rows)
    * 3.4. [Basic statistics on continuous features](#basic_statistics_cont)
        * 3.4.1 [Distribution](#distribution)
        * 3.4.2 [Skewness](#Skewness)
    * 3.5. [Count of categorical features](#count_cat)
4. [Features & Target Correlation](#features_target_correlation)
    * 4.1. [Correlation between features](#features_correlation)
    * 4.2. [Correlation with target](#target_correlation)
    * 4.3. [Categorical Features](#Categorical_Features)
5. [Model Development](#Model_Development)

   

[back to top](#table-of-contents)
<a id="introduction"></a>
# 1. Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. **The original dataset deals with predicting the biological response of molecules given various chemical properties.** Although the features are anonymized, they have properties relating to real-world features.

This competition will asked to predict whether the biological response of a given molecule with given chemical properties is positive or not. The ground truth is binary valued, but the predictions may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

Submissions are evaluated on **area under the ROC curve(AUC)** between the predicted probability and the observed target.

[back to top](#table-of-contents)
<a id="preparation"></a>
# 2. Preparation

In [None]:
import os
import joblib
import warnings
import numpy as np
import pandas as pd
import datatable as dt

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc, roc_auc_score

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In [None]:
train_df = dt.fread('../input/tabular-playground-series-oct-2021/train.csv').to_pandas()
test_df = dt.fread('../input/tabular-playground-series-oct-2021/test.csv').to_pandas()
sample=dt.fread("../input/tabular-playground-series-oct-2021/sample_submission.csv").to_pandas()

[back to top](#table-of-contents)
<a id="general"></a>
# 3. General

**Observations:**
* `Train` set has **1,000,000** rows while `test` set has **500,000** rows.
* There are **45 categorical features** from `f242` - `f284` and `f22` & `f43` and **241 continuous features** other than the categorical features.
* **Categorical features** are of **boolean** type with values as `True` / `False`.
* **Continuous features** are of `float32` with **range from 0.0 to 1.0** in both train and test sets. **Multimodal distributions** are observed which resemble in both the sets.
* `target` is a **categorical feature** with **boolean** values `True` / `False`. Training data is well **balanced** with target values.
* `id` is a unique key for each observation which can be dropped for further analysis.

In [None]:
cat_features = [feature for feature in test_df.columns if test_df[feature].dtype=='bool']
cont_features = [feature for feature in test_df.columns if feature not in cat_features+['id']]

[back to top](#table-of-contents)
<a id="rows_columns"></a>
## 3.1. No of rows and columns

In [None]:
print('Rows and Columns in train dataset:', train_df.shape)
print('Rows and Columns in test dataset:', test_df.shape)

<a id="Dataset_size"></a>
### 3.1.1 Dataset size

In [None]:
fig, ax = plt.subplots(figsize=(10, 0.8))
bar1 =  ax.barh(0, len(train_df)+len(test_df), color="#0f7fff", height=0.1)
bar2 =  ax.barh(0, len(train_df), color="#fce726", height=0.1)
ax.set_title("Train and test datasets size comparison", fontsize=10, pad=5)
ax.bar_label(bar1, ["Test dataset"], label_type="edge", padding=-120,
             fontsize=10, color="white", weight="bold")
ax.bar_label(bar2, ["Train dataset"], label_type="center",
             fontsize=10, color="white", weight="bold")
ax.set_xticks([ len(train_df), len(train_df)+len(test_df)])
ax.set_xticklabels([ len(train_df),len(test_df)])
ax.set_yticks([])
plt.show();

[back to top](#table-of-contents)
<a id="missing_values"></a>
## 3.2. No of missing values

In [None]:
print('Missing values in train dataset:', sum(train_df.isnull().sum()))
print('Missing values in test dataset:', sum(test_df.isnull().sum()))

[back to top](#table-of-contents)
<a id="first_5_rows"></a>
## 3.3. Quick View

**First 5 rows in the train dataset**

In [None]:
train_df.head()

**First 5 rows in the test dataset**

In [None]:
test_df.head()

**First 5 rows in the sample dataset**

In [None]:
sample.head()

[back to top](#table-of-contents)
<a id="basic_statistics_cont"></a>
## 3.4. Basic statistics on continuous features

**Train dataset**

In [None]:
train_df[cont_features].describe()

**Target Distribution**

Target variable has a value of True or False which indicates biological responce of a molecules given various chemical properties. Let's see the distribution of the claim variable.

**Observations:**

The target variable has balanced distribution of `True`/`False` with approximately 50% in each category.

In [None]:
targ='target'
claim_df = pd.DataFrame(train_df[targ].value_counts()).reset_index()
claim_df.columns = [targ, 'count']

claim_percent_df = pd.DataFrame(train_df[targ].value_counts()/train_df.shape[0]).reset_index()
claim_percent_df.columns = [targ, 'count']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(5, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.3, hspace=0.05)

background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*120)

ax0 = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=claim_df[targ], x=claim_df['count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.set_xlabel("count",fontsize=4, weight='bold')
ax0_sns.set_ylabel("",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -0.85, targ, fontsize=5, ha='left', va='top', weight='bold')
ax0.text(0, -0.65, 'Both of 0 and 1 has almost the same numbers', fontsize=4, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 10000
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=3.5, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.2))
    
ax1 = fig.add_subplot(gs[0, 1])
for s in ["right", "top"]:
    ax1.spines[s].set_visible(False)
ax1.set_facecolor(background_color)
ax1_sns = sns.barplot(ax=ax1, y=claim_percent_df[targ], x=claim_percent_df['count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax1_sns.set_xlabel("percentage",fontsize=4, weight='bold')
ax1_sns.set_ylabel("",fontsize=3, weight='bold')
ax1_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax1_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1.text(0, -0.85, 'target in %', fontsize=5, ha='left', va='top', weight='bold')
ax1.text(0, -0.65, 'Both of True and False distributrion are alomost the same of 50%', fontsize=4, ha='left', va='top')
# data label
for p in ax1.patches:
    value = f'{p.get_width():.2f}'
    x = p.get_x() + p.get_width() + 0.01
    y = p.get_y() + p.get_height() / 2 
    ax1.text(x, y, value, ha='left', va='center', fontsize=3.5, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.2))

**Test dataset**

In [None]:
test_df[cont_features].describe()

[back to top](#table-of-contents)
<a id="distribution"></a>
### 3.4.1 Distribution

Showing distribution on continuous features that are available in train and test dataset. As there are 240 continuos features(excluding `id` column), it will be broken down into 25 features for each sections. Yellow represents train dataset and pink represents test dataset

**Observations:**

All features distribution on `train` and `test` dataset are almost similar

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[:25]

ax0.text(-0.1, 35, 'Comparison of distribution of continuous features on Train  and test dataset', fontsize=10, fontweight='bold')
ax0.text(-0.1, 32, 'Most features have similar distribution', fontsize=9, fontweight='light')        
ax0.text(6, 35, 'Train set', style='italic', fontsize=8, bbox={'facecolor': '#ffd514', 'boxstyle':'round', 'linewidth':0.4})
ax0.text(6, 31, 'Test set', style='italic', fontsize=8, bbox={'facecolor': '#ff355d', 'boxstyle':'round', 'linewidth':0.4})


run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[25:50]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[50:75]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[75:100]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[100:125]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[125:150]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[150:175]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[175:200]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(5):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[200:225]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"

run_no = 0
for row in range(3):
    for col in range(5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = cont_features[225:240]

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].legend(title=col, title_fontsize=7)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
plt.show()

[back to top](#table-of-contents)
<a id="Skewness"></a>
### 3.4.2 Skewness

Most of the features are positively skewed. Skewness of many features is less than -3 and greater than +3.

In [None]:
skew = pd.DataFrame(train_df[cont_features].skew())
skew.columns = ['value']
skew=skew.sort_values(by='value',ascending=False)
skew = skew.reset_index(drop=False)
skew.columns = ['feature', 'value']
skew['color']=skew['value'].apply(lambda x: '#ffd514' if x>0 else '#00A4CCFF')





plt.rcParams['figure.dpi'] = 170

background_color = "#f6f5f5"

fig = plt.figure(figsize=(5, 50), facecolor=background_color)
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(-10, -2.5, 'Skewness of continuous features', fontsize=6, fontweight='bold')
ax0.text(-10, -1.7, 'Many features have skewness less than or greater tha -/+ 5', fontsize=5, fontweight='light')
ax0.text(14, -2.5, ' ')

sns.barplot(x=skew['value'], y=skew['feature'], ax=ax0, orient='h',palette=skew['color'], zorder=3, edgecolor='black', linewidth=0.1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
ax0.set_ylabel('features', fontsize=7)
ax0.set_xlabel('skewness',fontsize=5)
ax0.tick_params(axis='x',labelsize=4)
ax0.tick_params(axis='y',labelsize=4)

ii=0
for patch in ax0.patches:
    ii+=1
    if ii<=(len(skew[skew['value']>0])):
        value = f'{(patch.get_width()):.4f}'
        x = patch.get_x() + patch.get_width() + 0.7
        y = patch.get_y() + patch.get_height() - 0.4
        ax0.text(x, y, value, ha='center', va='center', fontsize=4, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))
        
ii=0    
for patch in ax0.patches:
    ii+=1
    if ii>(len(skew[skew['value']>0])):
        value = f'{(patch.get_width()):.4f}'
        x = patch.get_x() + patch.get_width() - 0.7
        y = patch.get_y() + patch.get_height() - 0.4
        ax0.text(x, y, value, ha='center', va='center', fontsize=4, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))  

    
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)
    
for s in ['left','bottom']:
    ax0.spines[s].set_linewidth(2)

plt.show()

[back to top](#table-of-contents)
<a id="count_cat"></a>
## 3.5. Count of categorical features

Showing distributions of count on categorical features that are available in train and test dataset. As there are 45 categorical features, it will be broken down into 25 features for each sections. Yellow represents train dataset and blue represents test dataset

In [None]:
background_color = '#e6e6e6'
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(30, 28), facecolor=background_color)
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.5, hspace=0.5)

features = cat_features[:25]

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 850000, 'Comparison of categorical features on Train  and test dataset(%)', fontsize=28, fontweight='bold')
ax0.text(-0.8, 800000, 'Some features are dominated by one category', fontsize=18, fontweight='light')        
ax0.text(12, 850000, 'Train set', style='italic', fontsize=22, bbox={'facecolor': '#fce726', 'boxstyle':'round', 'linewidth':0.4})
ax0.text(12, 750000, 'Test set', style='italic', fontsize=22, bbox={'facecolor': '#0f7fff', 'boxstyle':'round', 'linewidth':0.4})

run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#fce726', zorder=3, edgecolor='black', linewidth=0.1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=15, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=14, width=.3)
    for patch in locals()["ax"+str(run_no)].patches:
        current_width = patch.get_width()
        diff = current_width - 0.8 
        patch.set_width(0.8)
        patch.set_x(patch.get_x() + 0.1 * .5)
        
        value = f'{(patch.get_height()/len(train_df) * 100):.2f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() + 50000
        locals()["ax"+str(run_no)].text(x, y, value, ha='center', va='center', fontsize=16, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4)) 
        
    for s in ['left']:
        locals()["ax"+str(run_no)].spines[s].set_visible(True)
    
    for s in ['left','bottom']:
        locals()["ax"+str(run_no)].spines[s].set_linewidth(2)  
        
    run_no += 1
      
run_no = 0
for col in features:
    chart_df = pd.DataFrame(test_df[col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#0f7fff', zorder=3, edgecolor='black', linewidth=0.1)
    ii=0
    for patch in locals()["ax"+str(run_no)].patches:
        ii+=1
        if ii>2:
            current_width = patch.get_width()
            diff = current_width - 0.6 
            patch.set_width(0.6)
            patch.set_x(patch.get_x() + 0.3 * 0.5)
            
            value = f'{(patch.get_height()/len(test_df) * 100):.2f}'
            x = patch.get_x() + patch.get_width() - 0.3
            y = patch.get_y() + patch.get_height() + 40000
            locals()["ax"+str(run_no)].text(x, y, value, ha='center', va='center', fontsize=16, 
                    bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4)) 
    run_no += 1
    
plt.show()

In [None]:
background_color = '#e6e6e6'
fig = plt.figure(figsize=(30, 28), facecolor=background_color)
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.5, hspace=0.5)

features = cat_features[25:]

run_no = 0
for row in range(0, 4):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 850000, 'Comparison of categorical features on Train  and test dataset(%)', fontsize=28, fontweight='bold')
ax0.text(-0.8, 800000, 'Some features are dominated by one category', fontsize=18, fontweight='light')        
ax0.text(12, 850000, 'Train set', style='italic', fontsize=22, bbox={'facecolor': '#fce726', 'boxstyle':'round', 'linewidth':0.4})
ax0.text(12, 740000, 'Test set', style='italic', fontsize=22, bbox={'facecolor': '#0f7fff', 'boxstyle':'round', 'linewidth':0.4})

run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#fce726', zorder=3, edgecolor='black', linewidth=0.1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=15, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=14, width=.3)
    for patch in locals()["ax"+str(run_no)].patches:
        current_width = patch.get_width()
        diff = current_width - 0.8 
        patch.set_width(0.8)
        patch.set_x(patch.get_x() + 0.1 * .5)
        
        value = f'{(patch.get_height()/len(train_df) * 100):.2f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() + 50000
        locals()["ax"+str(run_no)].text(x, y, value, ha='center', va='center', fontsize=16, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))
        
    for s in ['left']:
        locals()["ax"+str(run_no)].spines[s].set_visible(True)
    
    for s in ['left','bottom']:
        locals()["ax"+str(run_no)].spines[s].set_linewidth(2)
            
    run_no += 1
      
run_no = 0
for col in features:
    chart_df = pd.DataFrame(test_df[col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#0f7fff', zorder=3, edgecolor='black', linewidth=0.1)
    ii=0
    for patch in locals()["ax"+str(run_no)].patches:
        ii+=1
        if ii>2:
            current_width = patch.get_width()
            diff = current_width - 0.6 
            patch.set_width(0.6)
            patch.set_x(patch.get_x() + 0.3 * 0.5)
            
            value = f'{(patch.get_height()/len(test_df) * 100):.2f}'
            x = patch.get_x() + patch.get_width() - 0.3
            y = patch.get_y() + patch.get_height() + 40000
            locals()["ax"+str(run_no)].text(x, y, value, ha='center', va='center', fontsize=16, 
                    bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4)) 
    run_no += 1
    
plt.show()

[back to top](#table-of-contents)
<a id="features_target_correlation"></a>
# 4. Features & Target Correlation
**Observations:**
* There is no continuous features that has correlation with other features above/below +/- 0.0279.
* Correlation between features on train and test dataset are quite similar.
* There is no continuous features that has correlation with `target` above/below +/- 0.1545.
* `f236` has the lowest correlation with target, almost reaching 0 while `f179` has the highest correlation with the target.

[back to top](#table-of-contents)
<a id="features_correlation"></a>
## 4.1. Correlation between features

A barchart of 10 most positive corelation features and 10 most negative corelation feature is  plot. It can be observed that no feature has a correlation greater than 0.0279 or less than -0.0279 with any other feature.

In [None]:
a = train_df[cont_features].corr()

cor=[]
for col in cont_features[:10]:
    cors=pd.DataFrame()
    cors['feature1']=a[col].index.tolist()
    cors['feature2']=col
    cors['corelation value']=a[col].values.tolist()
    cor.append(cors)
    
core=pd.DataFrame()
for i in cor:
    core=pd.concat([core,i],axis=0,ignore_index=True)
core=core[core['feature1']!=core['feature2']]

non_dublicates=[]
for i in list(zip(core['feature1'],core['feature2'])):
    if i[::-1] not in non_dublicates:
        non_dublicates.append(i)
        
core['non_dublicate']=core[['feature1','feature2']].apply(lambda x: 1 if tuple(x) in non_dublicates else 0,axis=1)
core['corr features'] = core['feature1'] + ' - ' + core['feature2'] 
core['color']=core['corelation value'].apply(lambda x: '#ffd514' if x>0 else '#00A4CCFF')
core=core[core['non_dublicate']==1]
core_pos=core.sort_values(by='corelation value',ascending=False).reset_index(drop=True)
core_pos=core_pos.iloc[:10]

core_neg=core.sort_values(by='corelation value',ascending=True)
core_neg=core_neg.iloc[:10].copy()
core_neg=core_neg.sort_values(by='corelation value',ascending=False).reset_index(drop=True)

core=pd.concat([core_pos,core_neg],axis=0).reset_index(drop=True)




background_color = "#f6f5f5"

fig = plt.figure(figsize=(15, 6), facecolor=background_color)
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(-1.1, 0.035, 'Most +ve and most -ve correlation of Continuous Features with other features', fontsize=20, fontweight='bold')
ax0.text(-1.1, 0.033, 'There is no features that pass 0.02 correlation', fontsize=13, fontweight='light')

sns.barplot(x=core['corr features'], y=core['corelation value'], ax=ax0, palette=core['color'], zorder=3, edgecolor='black', linewidth=0.1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
ax0.set_xticklabels(labels=core['corr features'],rotation=30)
ax0.set_ylabel('corelation value')

ii=0
for patch in ax0.patches:
    ii+=1
    if ii<=(len(core_pos)):
        value = f'{(patch.get_height()):.4f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() + .0014
        ax0.text(x, y, value, ha='center', va='center', fontsize=9, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))
        
ii=0    
for patch in ax0.patches:
    ii+=1
    if ii>(len(core_pos)):
        value = f'{(patch.get_height()):.4f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() - .0014
        ax0.text(x, y, value, ha='center', va='center', fontsize=9, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))  

    
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)
    
for s in ['left','bottom']:
    ax0.spines[s].set_linewidth(2)

plt.show()

[back to top](#table-of-contents)
<a id="target_correlation"></a>
## 4.2. Correlation with target

A barchart of 10 most positive corelation features and 10 most negative corelation feature is plot. It can be observed that no feature has a correlation greater than 0.071 or less than -0.1545 with any other feature.

In [None]:
core = pd.DataFrame(train_df[cont_features].corrwith(train_df['target']))
core.columns=['corelation value']
core_pos=core.sort_values(by='corelation value',ascending=False).iloc[:10]
core_neg=core.sort_values(by='corelation value',ascending=True).iloc[:10]
core_neg=core_neg.sort_values(by='corelation value',ascending=False)
core=pd.concat([core_pos,core_neg],axis=0).reset_index()
core['color']=core['corelation value'].apply(lambda x: '#ffd514' if x>0 else '#00A4CCFF')





background_color = "#f6f5f5"

fig = plt.figure(figsize=(15, 6), facecolor=background_color)
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(-1.1, 0.11, 'Most +ve and most -ve correlation of Continuous Features with target', fontsize=20, fontweight='bold')
ax0.text(-1.1, 0.10, 'There is no features that pass 0.16 correlation', fontsize=13, fontweight='light')

sns.barplot(x=core['index'], y=core['corelation value'], ax=ax0, palette=core['color'], zorder=3, edgecolor='black', linewidth=0.1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.9)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.9)
ax0.set_ylabel('corelation value')

ii=0
for patch in ax0.patches:
    ii+=1
    if ii<=(len(core_pos)):
        value = f'{(patch.get_height()):.4f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() + .0061
        ax0.text(x, y, value, ha='center', va='center', fontsize=9, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))
        
ii=0    
for patch in ax0.patches:
    ii+=1
    if ii>(len(core_pos)):
        value = f'{(patch.get_height()):.4f}'
        x = patch.get_x() + patch.get_width() - 0.4
        y = patch.get_y() + patch.get_height() - .0061
        ax0.text(x, y, value, ha='center', va='center', fontsize=9, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.4))  

    
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)
    
for s in ['left','bottom']:
    ax0.spines[s].set_linewidth(2)

plt.show()

[back to top](#table-of-contents)
<a id="Categorical_Features"></a>
## 4.3 Categorical Features

Distribution of categorical features with respect to different target values (`True`/`False`) are plot. Feature `f22` shows different distributions when target value is different.

In [None]:
background_color = '#e6e6e6'
fig = plt.figure(figsize=(30, 28), facecolor=background_color)
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.5, hspace=0.5)

features = cat_features[:25]

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 560000, 'Count of categorical features on Train set with different target values', fontsize=28, fontweight='bold')
ax0.text(-0.8, 500000, 'Feature f22 show different distributions when target value is different', fontsize=18, fontweight='light')        
ax0.text(12, 560000, 'Target = True', style='italic', fontsize=22, bbox={'facecolor': '#0a50ff', 'boxstyle':'round', 'linewidth':0.4})
ax0.text(12, 500000, 'Target = False', style='italic', fontsize=22, bbox={'facecolor': '#ff890a', 'boxstyle':'round', 'linewidth':0.4})

run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[train_df['target']==True][col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#0a50ff', zorder=3, edgecolor='black', linewidth=0.1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=15, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=14, width=.3)
    for patch in locals()["ax"+str(run_no)].patches:
        current_width = patch.get_width()
        diff = current_width - 0.8 
        patch.set_width(0.8)
        patch.set_x(patch.get_x() + 0.1 * .5)
        
    for s in ['left']:
        locals()["ax"+str(run_no)].spines[s].set_visible(True)
    
    for s in ['left','bottom']:
        locals()["ax"+str(run_no)].spines[s].set_linewidth(2)  
        
    run_no += 1
      
run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[train_df['target']==False][col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#ff890a', zorder=3, edgecolor='black', linewidth=0.1)
    ii=0
    for patch in locals()["ax"+str(run_no)].patches:
        ii+=1
        if ii>2:
            current_width = patch.get_width()
            diff = current_width - 0.6 
            patch.set_width(0.6)
            patch.set_x(patch.get_x() + 0.3 * 0.5)
            
    run_no += 1
    
plt.show()

In [None]:
background_color = '#e6e6e6'
fig = plt.figure(figsize=(30, 28), facecolor=background_color)
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.5, hspace=0.5)

features = cat_features[25:]

run_no = 0
for row in range(0, 4):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 540000, 'Count of categorical features on Train set with different target values', fontsize=28, fontweight='bold')
ax0.text(12, 540000, 'Target = True', style='italic', fontsize=22, bbox={'facecolor': '#0a50ff', 'boxstyle':'round', 'linewidth':0.4})
ax0.text(12, 480000, 'Target = False', style='italic', fontsize=22, bbox={'facecolor': '#ff890a', 'boxstyle':'round', 'linewidth':0.4})

run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[train_df['target']==True][col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#0a50ff', zorder=3, edgecolor='black', linewidth=0.1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#a6a6a6', linewidth=0.7)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=15, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=14, width=.3)
    for patch in locals()["ax"+str(run_no)].patches:
        current_width = patch.get_width()
        diff = current_width - 0.8 
        patch.set_width(0.8)
        patch.set_x(patch.get_x() + 0.1 * .5)
        
    for s in ['left']:
        locals()["ax"+str(run_no)].spines[s].set_visible(True)
    
    for s in ['left','bottom']:
        locals()["ax"+str(run_no)].spines[s].set_linewidth(2)  
        
    run_no += 1
      
run_no = 0
for col in features:
    chart_df = pd.DataFrame(train_df[train_df['target']==False][col].value_counts())
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#ff890a', zorder=3, edgecolor='black', linewidth=0.1)
    ii=0
    for patch in locals()["ax"+str(run_no)].patches:
        ii+=1
        if ii>2:
            current_width = patch.get_width()
            diff = current_width - 0.6 
            patch.set_width(0.6)
            patch.set_x(patch.get_x() + 0.3 * 0.5)
            
    run_no += 1
    
plt.show()

[back to top](#table-of-contents)
<a id="Model_Development"></a>
# 5. Model Development

A single **Xgboost** model is used with **5 folds**. Some hypertuned parameters are used and the model is run on **GPU**

**Note-** To prevent memory overflow, original sets are deleted after splitting the data in labels and targets

In [None]:
train_df = reduce_memory_usage(train_df)
test_df = reduce_memory_usage(test_df)
sample = reduce_memory_usage(sample)

In [None]:
train_df.loc[:, cat_features+['target']] = train_df.loc[:, cat_features+['target']].astype(int)
test_df.loc[:, cat_features] = test_df.loc[:, cat_features].astype(int)

In [None]:
X = train_df.drop('target', axis=1)
y = train_df['target']
X_test = test_df

In [None]:
X['std'] = X.std(axis=1)
X['min'] = X.min(axis=1)
X['max'] = X.max(axis=1)

X_test['std'] = X_test.std(axis=1)
X_test['min'] = X_test.min(axis=1)
X_test['max'] = X_test.max(axis=1)

In [None]:
params = {
    'max_depth': 6,
    'n_estimators': 9500,
    'learning_rate': 0.007279718158350149,
    'subsample': 0.7,
    'colsample_bytree': 0.2,
    'colsample_bylevel': 0.6000000000000001,
    'min_child_weight': 56.41980735551558,
    'reg_lambda': 75.56651890088857,
    'reg_alpha': 0.11766857055687065,
    'gamma': 0.6407823221122686
    }

In [None]:
%%time

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

preds = []
scores = []

for fold, (idx_train, idx_valid) in enumerate(kf.split(X, y)):
    X_train, y_train = X.iloc[idx_train], y.iloc[idx_train]
    X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid]
    
    model = XGBClassifier(**params,
                            booster= 'gbtree',
                            eval_metric = 'auc',
                            tree_method= 'gpu_hist',
                            predictor="gpu_predictor",
                            use_label_encoder=False)
    
    model.fit(X_train,y_train,
              eval_set=[(X_valid,y_valid)],
              early_stopping_rounds=100,
              verbose=False)
    
    pred_valid = model.predict_proba(X_valid)[:,1]
    score = roc_auc_score(y_valid, pred_valid)
    scores.append(score)
    
    print(f"Fold: {fold + 1} Score: {score}")
    print('--'*40)
    
    test_preds = model.predict_proba(X_test)[:,1]
    preds.append(test_preds)
    
print(f"Overall Validation Score: {np.mean(scores)}")

In [None]:
predictions = np.mean(np.column_stack(preds),axis=1)

sample['target'] = predictions
sample.to_csv('submission.csv', index=False)
sample.head()

#### Thank you for reading. Please share your views and suggestions in the comment section and give it a thumps up if you like it. Thank you.