## <p style="background-color:BlueViolet; font-family:newtimeroman; margin-bottom:2px; font-size:32px; color: white; text-align:center">Table of Content</p>  

<a id="table-of-contents"></a>
1. [Preperation](#preperation)
    * 1.1. [Loading Packages and Importing Libraries](#load_packages_import_libraries)
    * 1.2. [Data Description](#data_description)
2. [Exploratory Data Analysis (EDA)](#eda)
    * 2.1. [Numerical Variables](#numerical_variables)
    * 2.2. [Normality Check and Outlier Detection](#norm_check_outlier_detect)
3. [Feature Engineering and Modeling](#feat_eng_model)
    * 3.1. [LightGBM Model](#lgbm_model)
    * 3.2. [XGB Model](#xgb_model)
4. [Model Blending](#model_blending)

[back to top](#table-of-contents)
<a id="preperation"></a>
# <p style="background-color:BlueViolet; font-family:newtimeroman; font-size:150%; text-align:center">1. Preperation</p>


<a id="load_packages_import_libraries"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">1.1. Loading Packages and Importing Libraries</p>

Loading packages and importing some helpful libraries.

In [None]:
!pip install simple-colors

In [None]:
import pandas as pd
import numpy as np
import datatable as dt
import optuna

from simple_colors import *
from termcolor import colored

import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import normaltest
from scipy import stats

from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Supress Warnings
import warnings
warnings.filterwarnings("ignore")

<a id="data_description"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">1.2. Data Description</p>

First of all, some setting up options were made. It is aimed to show all rows and columns in order to improve the view, especially while giving the definitions of the data sets. Next, I will load the train, test and sample_solution data sets and display train and test data sets.

In [None]:
#Setting up options

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.3f}".format

In [None]:
# Load the data

train_eda = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
test_eda = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
# sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

train = "../input/tabular-playground-series-sep-2021/train.csv"
test = "../input/tabular-playground-series-sep-2021/test.csv"
sample_solution = "../input/tabular-playground-series-sep-2021/sample_solution.csv"

In [None]:
train = dt.fread(train).to_pandas()
test = dt.fread(test).to_pandas()

In [None]:
train.claim = train.claim.astype('int16')

In [None]:
train_memory = train.memory_usage().sum() / 1024**2
print('Memory usage of original training set (in MB): {}'.format(train_memory))

def reduce_memory(df):
    for col in df.columns:
        if str(df[col].dtypes)[:5] == 'float':
            low = df[col].min()
            high = df[col].max()
            if((low > np.finfo(np.float16).min) and (high < np.finfo(np.float16).max)):
                df[col] = df[col].astype('float16')
            elif((low > np.finfo(np.float32).min) and (high < np.finfo(np.float).max)):
                df[col] = df[col].astype('float32')
    return df

reduce_memory(train)
train_memory_reduced = train.memory_usage().sum() / 1024**2
print('Memory usage of reduced training set (in MB): {}'.format(train_memory_reduced))

In [None]:
test_memory = test.memory_usage().sum() / 1024**2
print('Memory usage of original test set(in MB): {}'.format(test_memory))

reduce_memory(test)
test_memory_reduced = test.memory_usage().sum() / 1024**2
print('Memory usage of reduced test set(in MB): {}'.format(test_memory_reduced))

Such a cell was added to avoid error while generating the histogram and kde plots. Otherwise, kde plot of some features are not created and the histogram plot gives an error for some reason. 

In [None]:
train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
def data_desc(df):
    
    """
    This function helps us with simple data analysis.
    We may explore the common information about the dataset, missing values, features distribution and duplicated rows
    """
    
    # applying info() method
    print('*******************')
    print(cyan('General information of this dataset', 'bold'))
    print('*******************\n')
    print(df.info())
    
    print('\n*******************')
    print(cyan('Number of rows and columns', 'bold'))
    print('*******************\n')
    print("Number of rows:", colored(df.shape[0], 'green', attrs=['bold']))
    print("Number of columns:", colored(df.shape[1], 'green', attrs=['bold']))
    
    # missing values
    print('\n*******************')
    print(cyan('Missing value checking', 'bold'))
    print('*******************\n')
    if df.isna().sum().sum() == 0:
        print(colored('There are no missing values', 'green'))
        print('*******************')
    else:
        print(colored('Missing value detected!', 'green', attrs=['bold']))
        print("\nTotal number of missing values:", colored(sum(df.isna().sum()), 'green', attrs=['bold']))
        
        print('\n*******************')
        print(cyan('Missing values of features', 'bold'))
        print('*******************\n')
        display(df.isna().sum().sort_values(ascending = False).to_frame().rename({0:'Counts'}, axis = 1).T.style.background_gradient('Purples', axis = None))
        print('\n*******************')
        print(cyan('Percentage of missing values of features', 'bold'))
        print('*******************\n')
        display(round((df.isnull().sum() / (len(df.index)) * 100) , 3).sort_values(ascending = False).to_frame().rename({0:'%'}, axis = 1).T.style.background_gradient('PuBuGn', axis = None))

    # describe() for numerical features
    cont_feats = [col for col in df.columns if df[col].dtype != object and col not in ('id', 'claim')]
    print('\n*******************')
    print(cyan('Numerical columns', 'bold'))
    print('*******************\n')
    print("Total numerical features:", colored(len(cont_feats), 'green', attrs=['bold']))
    df = df[df.columns.difference(['id', 'claim'], sort = False)]
    display(df.describe())
    
    # Checking for duplicated rows -if any-
    if df.duplicated().sum() == 0:
        print('\n*******************')
        print(colored('There are no duplicates!', 'green', attrs=['bold']))
        print('*******************')
    else:
        print('\n*******************')
        print(colored('Duplicates found!', 'green', attrs=['bold']))
        print('*******************')
        display(df[df.duplicated()])

    print('\n*******************')
    print(cyan('Preview of the data - Top 10 rows', 'bold'))
    print('*******************\n')
    display(df.head(10))
    print('*******************\n')
    
    print('\n*******************')
    print(cyan('End of the report', 'bold'))

In [None]:
data_desc(train)

In [None]:
data_desc(test)

[back to top](#table-of-contents)
<a id="eda"></a>
# <p style="background-color:BlueViolet; font-family:newtimeroman; font-size:150%; text-align:center">2. Exploratory Data Analysis (EDA)</p>

All numerical variables will be explored in this section.

In [None]:
numerical_features = [x for x in train.columns if x.startswith("f")]

<a id="numerical_variables"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">2.1. Numerical Variables</p>

In [None]:
plt.figure(figsize=(10, 7))
ax = sns.countplot(y=train["claim"], palette='muted', zorder=3, linewidth=5, orient='h', saturation=1, alpha=1)
ax.set_title('Distribution of Target', fontname = 'Times New Roman', fontsize = 30, color = '#8c49e7', x = 0.5, y = 1.05)
background_color = "#8c49e7"
sns.set_palette(['#ffd514']*120)

for a in ax.patches:
    value = f'Amount and percentage of values: {a.get_width():,.0f} | {(a.get_width()/train.shape[0]):,.3%}'
    x = a.get_x() + a.get_width() / 2 - 220000
    y = a.get_y() + a.get_height() / 2 
    ax.text(x, y, value, ha='left', va='center', fontsize=18, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round4', linewidth=0.7))


# ax.margins(-0.12, -0.12)
ax.grid(axis="x")

sns.despine(right=True)
sns.despine(offset=15, trim=True)

In [None]:
def missing_data_percentage_plot(data, titleText):
    
    data = data[data.columns.difference(['id', 'claim'], sort = False)]
    
    missing_data_percentage = (data.isnull().sum() / (len(data.index)) * 100).to_frame().reset_index().rename({0:'%'}, axis = 1)


    v0 = sns.color_palette(palette = "mako").as_hex()[3]
    fig = plt.figure(figsize=(40, 40))
    ax = sns.barplot(missing_data_percentage['%'], missing_data_percentage['index'], color=v0, saturation=.75, zorder=3, linewidth=5, orient='h', alpha=1)
    ax.set_ylabel("Numerical features", fontsize=18, labelpad=15)
    ax.set_xlabel("Percentage of missing values (%)", fontsize=18, labelpad=15)
    ax.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    ax.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    plt.xticks(fontsize = 10)
    plt.yticks(fontsize = 10)
    ax.tick_params(axis="x", rotation=90, labelsize=12)
#     ax.margins(-0.05, -0.02)
    plt.title(titleText, fontsize=32, pad=15, fontweight = 'bold');
    
    for a in ax.patches:
        value = "%.3f%%" % a.get_width()
        x = a.get_width() + 0.005
        y = a.get_y() + a.get_height() / 2
        ax.text(x, y, value, ha='left', va='center', fontsize=10, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round4', linewidth=1.5))

In [None]:
missing_data_percentage_plot(train, 'Percetage of Missing Values - Train Dataset')

In [None]:
missing_data_percentage_plot(test, 'Percetage of Missing Values - Test Dataset')

The percentages of missing data in both train and test data sets are very similar. 

In [None]:
def missing_data_number_plot(data, titleText):
    
    data = data[data.columns.difference(['id', 'claim'], sort = False)]
    
    missing_data_number = data.isnull().sum().to_frame().reset_index().rename({0:'%'}, axis = 1)

    v0 = sns.color_palette(palette = "mako").as_hex()[3]
    fig = plt.figure(figsize=(40, 40))
    ax = sns.barplot(missing_data_number['%'], missing_data_number['index'], color=v0, saturation=.75, zorder=3, linewidth=5, orient='h', alpha=1)
    ax.set_ylabel("Numerical features", fontsize=18, labelpad=15)
    ax.set_xlabel("Percentage of missing values (%)", fontsize=18, labelpad=15)
    ax.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    ax.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    plt.xticks(fontsize = 10)
    plt.yticks(fontsize = 10)
    ax.tick_params(axis="x", rotation=90, labelsize=12)
#     ax.margins(-0.05, -0.02)
    plt.title(titleText, fontsize=32, pad=15, fontweight = 'bold');
    
    for a in ax.patches:
        value = a.get_width()
        x = a.get_width() + 40
        y = a.get_y() + a.get_height() / 2
        ax.text(x, y, value, ha='left', va='center', fontsize=10, 
                bbox=dict(facecolor='none', edgecolor='black', boxstyle='round4', linewidth=1.5))

In [None]:
missing_data_number_plot(train, 'Number of Missing Value per Feature')

In [None]:
missing_data_number_plot(test, 'Number of Missing Value per Feature')

In [None]:
def missing_value_distribution(data, titleText):
    
    data["number_of_null"] = data.isnull().sum(axis=1)

    counts = data.groupby("number_of_null")["claim"].count().to_dict()
    null_data = {"{} Null Value(s) Per Row".format(k) : v for k, v in counts.items() if k < 8}
    null_data["8 or More Null Values Per Row"] = sum([v for k, v in enumerate(counts.values()) if k > 7])

    pie, ax = plt.subplots(1, 1, figsize=[20, 12])
    plt.title(titleText, fontsize=32, fontweight='bold')
    pie.set_facecolor('#e4e4e4')
    plt.pie(x=null_data.values(), autopct="%.3f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.8, shadow=True, labeldistance = 1.025, wedgeprops = {'linewidth': 1})
    plt.tight_layout()
    
    #Percentage of Null Values Per Row (Train Data)

In [None]:
missing_value_distribution(train, 'Percentage of Null Values Per Row (Train Data)')

In [None]:
missing_value_distribution(train, 'Percentage of Null Values Per Row (Test Data)')

In [None]:
def box_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 7
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 100))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(hspace = 0.5)
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        v0 = sns.color_palette(palette = "crest").as_hex()[3]
        ax = sns.boxplot(data[feature], color=v0, saturation=.75)  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Values', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
box_plot(train, numerical_features, 'Box Plot of Numerical Columns of Train Dataset')

In [None]:
box_plot(test, numerical_features, 'Box Plot of Numerical Columns of Test Dataset')

It is very obvious that some features contain significant amount of outlier value. This situation must be handled. 

In [None]:
def kde_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 7
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 100))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9,  
                    wspace=0.4, 
                    hspace=0.4)
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.kdeplot(data[feature], color="m", shade=True, label="%.3f"%(data[feature].skew()))  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Density', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    
    plt.show()

In [None]:
train_frac = train_eda.sample(frac = 0.1).reset_index(drop = True)

kde_plot(train_frac, numerical_features, titleText = 'KDE Plot of Numerical Features of Train Dataset', hue = None)

In [None]:
test_frac = test_eda.sample(frac = 0.1).reset_index(drop = True)

kde_plot(test_frac, numerical_features, titleText = 'KDE Plot of Numerical Features of Test Dataset', hue = None)

Since KDE plots are processed in a long time, plots were created on 10% of the data sets. Supporting the box chart, it can be seen from this chart that there are various outliers. 

In [None]:
def correlation_matrix(data):

    fig, ax = plt.subplots(1, 1, figsize=(25, 10))
    plt.title('Pearson Correlation Matrix', fontweight='bold', fontsize=25)
    fig.set_facecolor('#d0d0d0') 
    corr = data.drop('id', axis=1).corr()

    # Mask to hide upper-right part of plot as it is a duplicate
    mask = np.triu(np.ones_like(corr, dtype = bool))
    sns.heatmap(corr, annot = False, center = 0, cmap = 'jet', mask = mask, linewidths=.5)
    ax.set_xticklabels(ax.get_xticklabels(), fontfamily='sans', rotation=90, fontsize=12)
    ax.set_yticklabels(ax.get_yticklabels(), fontfamily='sans', rotation = 0, fontsize=12)
    plt.tight_layout()
    plt.show()

In [None]:
correlation_matrix(train)

There is no significant correlation between variables in train dataset.

In [None]:
correlation_matrix(test)

There is also no significant correlation between variables in test dataset.

In [None]:
def hist_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 7
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 100))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9,  
                    wspace=0.4, 
                    hspace=0.4)
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.histplot(data[feature], edgecolor="black", color="darkseagreen", alpha=0.7)  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Frequency', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
hist_plot(train_frac, numerical_features, titleText = 'Histogram of Numerical Features of Train Dataset', hue = None)

In [None]:
hist_plot(test_frac, numerical_features, titleText = 'Histogram of Numerical Features of Test Dataset', hue = None)

The logic in the KDE plots is also executed in the histogram plots.

<a id="norm_check_outlier_detect"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">2.2. Normality Check and Outlier Detection</p>

In [None]:
def qqplot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 7
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 100))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9,  
                    wspace=0.4, 
                    hspace=0.4)
        
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)   
        stats.probplot(data[feature],plot=plt)
        plt.title('\nQ-Q Plot')
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Sample Quantile', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
qqplot(train_frac, numerical_features, 'Q-Q Plot of Numerical Features of Train Dataset', hue=None)

In [None]:
qqplot(test_frac, numerical_features, 'Q-Q Plot of Numerical Features of Test Dataset', hue=None)

The Q-Q plot with clues to the normal distribution also shows tremendously that the data is not normally distributed. 

In [None]:
# D'Agostino and Pearson's Test

def normality_check(data):
  for i in numerical_features:
    # normality test
    stat, p = normaltest(data[[i]])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret results
    alpha = 1e-2
    if p > alpha:
        print(f'{i} looks Gaussian (fail to reject H0)\n')
    else:
        print(f'{i} does not look Gaussian (reject H0)\n')

In [None]:
normality_check(train)

In [None]:
normality_check(test)

There is not a single feature that fits the normal distribution in either data set. 

In [None]:
def detect_outliers(x, c = 1.5):
    """
    Function to detect outliers.
    """
    q1, q3 = np.percentile(x, [25,75])
    iqr = (q3 - q1)
    lob = q1 - (iqr * c)
    uob = q3 + (iqr * c)

    # Generate outliers

    indicies = np.where((x > uob) | (x < lob))

    return indicies

In [None]:
# Detect all Outliers 
outliers = detect_outliers(train['claim'])
print("Total Outliers count for claim : ", len(outliers[0]))

print("\nShape before removing outliers : ",train.shape)

# Remove outliers
#train.drop(outliers[0],inplace=True, errors = 'ignore')
print("Shape after removing outliers : ",train.shape)

However, considering the large number of outliers and also the possibility of losing rows from the train data set tremendously negatively impacting the outcome of the model, no drops were made. 

In [None]:
train_iqr = pd.DataFrame()
train_iqr.reindex(columns=[*train_iqr.columns.tolist(), "-3 IQR", "-1.5 IQR", "1.5 IQR", "3 IQR"], fill_value = 0)

In [None]:
from scipy.stats import iqr

data = []

k = 0
columns = ["-3 IQR", "-1.5 IQR", "1.5 IQR", "3 IQR"]

for i in numerical_features:

    q1 = train[i].quantile(0.25)
    q3 = train[i].quantile(0.75)
    
    iqr = (q3 - q1)
    lob_1 = q1 - (iqr * 1.5)
    uob_1 = q3 + (iqr * 1.5)
    lob_3 = q1 - (iqr * 3)
    uob_3 = q3 + (iqr * 3)
    
    number_uob_1 = f'{round(sum(train[numerical_features[k]] > uob_1) / len(train[numerical_features[k]]), 5):,.3%}'
    number_lob_1 = f'{round(sum(train[numerical_features[k]] < lob_1) / len(train[numerical_features[k]]), 5):,.3%}'
    number_uob_3 = f'{round(sum(train[numerical_features[k]] > uob_3) / len(train[numerical_features[k]]), 5):,.3%}'
    number_lob_3 = f'{round(sum(train[numerical_features[k]] < lob_3) / len(train[numerical_features[k]]), 5):,.3%}'

    values = [number_lob_3, number_lob_1, number_uob_1, number_uob_3]
    zipped = zip(columns, values)
    a_dictionary = dict(zipped)
    print(a_dictionary)
    data.append(a_dictionary)
    
    k = k + 1

In [None]:
train_iqr = train_iqr.append(data, True)
train_iqr.set_axis([numerical_features], axis='index')

In [None]:
def colour(value):

    if float(value.strip('%')) > 10:
      color = 'red'
    elif float(value.strip('%')) > 5:
        color = 'darkorange'   
    else:
      color = 'green'

    return 'color: %s' % color

# train_iqr = train_iqr.set_axis([numerical_features], axis='index')
train_iqr = train_iqr.style.applymap(colour)

In [None]:
train_iqr

Unlike many notebooks, the percentage of mild and extreme outputs relative to the total number of lines is taken into account for each feature. The reason for this is that when using the scalar (StandardScaler, MinMaxScaler, RobustScaler etc.), the Q1 and Q3 values can be given manually in order to better manage the outliers. 

[back to top](#table-of-contents)
<a id="feat_eng_model"></a>
# <p style="background-color:BlueViolet; font-family:newtimeroman; font-size:150%; text-align:center">3. Feature Engineering and Modeling</p>

Creating 10-fold using Stratified KFold method.

In [None]:
def create_stratified_folds_for_classification(df, n_splits = 10):

    """
    @param data_df: training data to split in Stratified K Folds for a continous target value
    @param n_splits: number of splits
    @return: the training data with a column with kfold id
    """

    df['StratifiedKFold'] = -1

    # randomize the data
    df = df.sample(frac=1).reset_index(drop=True)

    # calculate the optimal number of bins based on log2(df.shape[0])
    df_test = []
    k = 0
    df_ = df.select_dtypes(include='number')
    df_ = df_.drop(['id', 'claim', 'StratifiedKFold'], axis=1)

    while k <= len(df_.columns)-1:
      q1 = df_.iloc[:,k].quantile(0.25)
      q3 = df_.iloc[:,k].quantile(0.75)
      iqr = q3 - q1
      bin_width = (2 * iqr) / (len(df_) ** (1 / 3))
      bin_count = int(np.ceil((df_.iloc[:,k].max() - df_.iloc[:,k].min()) / bin_width))
      df_test.append(bin_count)
      mean_bin = np.ceil(sum(df_test) / len(df_test))
      k = k + 1
    print(f"Num bins: {mean_bin}")

    # bins value will be the equivalent of class value of target feature used by StratifiedKFold to distribute evenly the classed over each fold
    df.loc[:, "bins"] = pd.cut(pd.to_numeric(df['claim'], downcast="signed"), bins=int(mean_bin), labels=False)
    kf = model_selection.StratifiedKFold(n_splits=n_splits, shuffle = True, random_state = 606)
    
    # set the fold id as a new column in the df data
    for fold, (df_indicies, valid_indicies) in enumerate(kf.split(X=df, y=df.bins.values)):
        df.loc[valid_indicies, "StratifiedKFold"] = fold
    
    # drop the bins column (no longer needed)
    df = df.drop("bins", axis=1)
    
    return df

In [None]:
train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
n_splits = 10
train = create_stratified_folds_for_classification(train, n_splits)

In [None]:
train.to_csv("train_folds(10).csv", index=False)

In [None]:
train.StratifiedKFold.value_counts()

In [None]:
plt.figure(figsize=(25,12))
plt.title("Distribution of claim values (StratifiedKFolds with bins)")
for k in range(0,n_splits):
    df = train.loc[train.StratifiedKFold==k]
    sns.distplot(df['claim'], kde=True, hist=False, bins=12, label=k)
plt.legend(); plt.show()

<a id="lgbm_model"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">3.1. LGBM Model</p>

After some preprocessing, we can tune LGBM parameters and train model.

In [None]:
## LGBM parameter tuning

train = pd.read_csv("train_folds(10).csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

##In line with the information obtained from the train_iqr data set, it was observed that some features contain significant outlier values. To manage this, Q1 and Q3 quantiles were adapted while scaling some features. 

numerical_features = ['f1','f4','f5','f6','f7','f8','f9','f11','f12','f14','f15','f17','f18','f19','f21','f22','f24','f25','f27','f28','f29','f30','f31','f34','f35','f37','f39','f40','f42','f43','f44','f45','f47','f48','f49','f50','f52','f54','f56','f57','f59','f60','f61','f63','f64','f65','f67','f68','f70','f71','f72','f75','f76','f79','f80','f81','f82','f84','f85','f86','f87','f88','f89','f90','f93','f95','f97','f98','f100','f101','f102','f104','f105','f106','f107','f108','f109','f110','f111','f113','f117','f118']
numerical_features_1 = ["f2", "f13", "f23", "f58", "f66", "f91"]
numerical_features_2 = ["f55", "f94"]
numerical_features_3 = ["f36"]
numerical_features_4 = ["f46"]
numerical_features_5 = ["f33", "f62", "f78"]
numerical_features_6 = ["f3", "f10", "f16", "f20", "f32", "f38", "f41", "f51", "f53", "f69", "f73", "f77", "f83", "f92", "f96", "f103", "f114", "f115", "f116"]
numerical_features_7 = ["f26", "f99"]
numerical_features_8 = ["f74", "f112"]

train[numerical_features] = RobustScaler(quantile_range=(25, 75)).fit_transform(train[numerical_features])
test[numerical_features] = RobustScaler(quantile_range=(25, 75)).fit_transform(test[numerical_features])

train[numerical_features_1] = RobustScaler(quantile_range=(15, 75)).fit_transform(train[numerical_features_1])
test[numerical_features_1] = RobustScaler(quantile_range=(15, 75)).fit_transform(test[numerical_features_1])

train[numerical_features_2] = RobustScaler(quantile_range=(20, 75)).fit_transform(train[numerical_features_2])
test[numerical_features_2] = RobustScaler(quantile_range=(20, 75)).fit_transform(test[numerical_features_2])

train[numerical_features_3] = RobustScaler(quantile_range=(20, 90)).fit_transform(train[numerical_features_3])
test[numerical_features_3] = RobustScaler(quantile_range=(20, 90)).fit_transform(test[numerical_features_3])

train[numerical_features_4] = RobustScaler(quantile_range=(20, 80)).fit_transform(train[numerical_features_4])
test[numerical_features_4] = RobustScaler(quantile_range=(20, 80)).fit_transform(test[numerical_features_4])

train[numerical_features_5] = RobustScaler(quantile_range=(25, 80)).fit_transform(train[numerical_features_5])
test[numerical_features_5] = RobustScaler(quantile_range=(25, 80)).fit_transform(test[numerical_features_5])

train[numerical_features_6] = RobustScaler(quantile_range=(25, 85)).fit_transform(train[numerical_features_6])
test[numerical_features_6] = RobustScaler(quantile_range=(25, 85)).fit_transform(test[numerical_features_6])

train[numerical_features_7] = RobustScaler(quantile_range=(25, 90)).fit_transform(train[numerical_features_7])
test[numerical_features_7] = RobustScaler(quantile_range=(25, 90)).fit_transform(test[numerical_features_7])

train[numerical_features_8] = RobustScaler(quantile_range=(25, 100)).fit_transform(train[numerical_features_8])
test[numerical_features_8] = RobustScaler(quantile_range=(25, 100)).fit_transform(test[numerical_features_8])

numerical_features = [c for c in train.columns if c.startswith("f")]
test = test[numerical_features]

In [None]:
def opt_lgbm(trial):
    fold = 6

    params = {
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': 0,
        'boosting_type': 'gbdt',
        'device' : 'gpu',
        'gpu_platform_id': 0,
        'gpu_device_id': 0,
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate' : trial.suggest_float('learning_rate', 1e-2, 0.30, log=True),
        'reg_lambda' : trial.suggest_loguniform('reg_lambda', 1e-8, 100.0),
        'num_leaves' : trial.suggest_int('num_leaves', 25, 250),
        'reg_alpha' : trial.suggest_loguniform('reg_alpha', 1e-8, 100.0),
        'subsample' : trial.suggest_float('subsample', 0.1, 1.0),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', 0.05, 1.0),
        'min_child_samples' : trial.suggest_int('min_child_samples', 10, 250),
        'cat_smooth' : trial.suggest_float('cat_smooth', 10, 150),
        'min_data_per_group' : trial.suggest_int('min_data_per_group', 10, 250),
        'cat_l2' : trial.suggest_float('cat_l2', 1e-2, 10),
        'bagging_freq' : trial.suggest_int('bagging_freq', 1, 10),
        'bagging_fraction' : trial.suggest_float('bagging_fraction', 1e-2, 1),
        'max_depth' : trial.suggest_int('max_depth', 1, 100)
    }

    xtrain = train[train.StratifiedKFold != fold].reset_index(drop=True)
    xvalid = train[train.StratifiedKFold == fold].reset_index(drop=True)

    ytrain = xtrain.claim
    yvalid = xvalid.claim

    xtrain = xtrain[numerical_features]
    xvalid = xvalid[numerical_features]

    model = LGBMClassifier(**params)

    model.fit(xtrain, ytrain, early_stopping_rounds=200, eval_set=[(xvalid, yvalid)], verbose = False)
    preds_valid = model.predict_proba(xvalid)[: , 1]
    score = roc_auc_score(yvalid, preds_valid)
    return score

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(opt_lgbm, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial of LGBM:', study.best_trial.params)

In [None]:
study.trials_dataframe()

In [None]:
lgbm_params = study.best_params
lgbm_params['num_iteration'] = 10000
lgbm_params['n_jobs'] = -1
lgbm_params['early_stopping_round'] = 200

In [None]:
## LightGBM Classifier Model

train = pd.read_csv("train_folds(10).csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

numerical_features = [c for c in train.columns if c.startswith("f")]
test = test[numerical_features]

final_test_predictions = []
final_valid_predictions = {}
scores = []

for fold in range(len(train['StratifiedKFold'].unique().tolist())):
    xtrain =  train[train.StratifiedKFold != fold].reset_index(drop=True)
    xvalid = train[train.StratifiedKFold == fold].reset_index(drop=True)
    xtest = test.copy()

    valid_ids = xvalid.id.values.tolist()

    ytrain = xtrain.claim
    yvalid = xvalid.claim
    
    xtrain = xtrain[numerical_features]
    xvalid = xvalid[numerical_features]

    model = LGBMClassifier(**lgbm_params)
    model.fit(xtrain, ytrain, early_stopping_rounds = 200, eval_set=[(xvalid, yvalid)], eval_metric = 'auc', verbose = False)
    preds_valid = model.predict_proba(xvalid)[: , 1]
    test_preds = model.predict_proba(xtest)[: , 1]
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    score = roc_auc_score(yvalid, preds_valid)
    print(fold, score)
    scores.append(score)

print(np.mean(scores), np.std(scores))


final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_1"]
final_valid_predictions.to_csv("train_pred_1.csv", index=False)

sample_solution.claim = np.mean(np.column_stack(final_test_predictions), axis=1)
sample_solution.columns = ["id", "pred_1"]
sample_solution.to_csv("test_pred_1.csv", index=False)

<a id="xgb_model"></a>
## <p style="background-color:MediumPurple; font-family:newtimeroman; font-size:120%; text-align:center">3.2. XGB Model</p>

After some preprocessing, we can tune XGB parameters and train model.

In [None]:
## XGB Classifier parameter tuning

train = pd.read_csv("train_folds(10).csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

##In line with the information obtained from the train_iqr data set, it was observed that some features contain significant outlier values. To manage this, Q1 and Q3 quantiles were adapted while scaling some features. 

numerical_features = ['f1', 'f4','f5','f6','f7','f8','f9','f11','f12','f14','f15','f17','f18','f19','f21','f22','f24','f25','f27','f28','f29','f30','f31','f34','f35','f37','f39','f40','f42','f43','f44','f45','f47','f48','f49','f50','f52','f54','f56','f57','f59','f60','f61','f63','f64','f65','f67','f68','f70','f71','f72','f75','f76','f79','f80','f81','f82','f84','f85','f86','f87','f88','f89','f90','f93','f95','f97','f98','f100','f101','f102','f104','f105','f106','f107','f108','f109','f110','f111','f113','f117','f118']
numerical_features_1 = ["f2", "f13", "f23", "f58", "f66", "f91"]
numerical_features_2 = ["f55", "f94"]
numerical_features_3 = ["f36"]
numerical_features_4 = ["f46"]
numerical_features_5 = ["f33", "f62", "f78"]
numerical_features_6 = ["f3", "f10", "f16", "f20", "f32", "f38", "f41", "f51", "f53", "f69", "f73", "f77", "f83", "f92", "f96", "f103", "f114", "f115", "f116"]
numerical_features_7 = ["f26", "f99"]
numerical_features_8 = ["f74", "f112"]

train[numerical_features] = RobustScaler(quantile_range=(25, 75)).fit_transform(train[numerical_features])
test[numerical_features] = RobustScaler(quantile_range=(25, 75)).fit_transform(test[numerical_features])

train[numerical_features_1] = RobustScaler(quantile_range=(15, 75)).fit_transform(train[numerical_features_1])
test[numerical_features_1] = RobustScaler(quantile_range=(15, 75)).fit_transform(test[numerical_features_1])

train[numerical_features_2] = RobustScaler(quantile_range=(20, 75)).fit_transform(train[numerical_features_2])
test[numerical_features_2] = RobustScaler(quantile_range=(20, 75)).fit_transform(test[numerical_features_2])

train[numerical_features_3] = RobustScaler(quantile_range=(20, 90)).fit_transform(train[numerical_features_3])
test[numerical_features_3] = RobustScaler(quantile_range=(20, 90)).fit_transform(test[numerical_features_3])

train[numerical_features_4] = RobustScaler(quantile_range=(20, 80)).fit_transform(train[numerical_features_4])
test[numerical_features_4] = RobustScaler(quantile_range=(20, 80)).fit_transform(test[numerical_features_4])

train[numerical_features_5] = RobustScaler(quantile_range=(25, 80)).fit_transform(train[numerical_features_5])
test[numerical_features_5] = RobustScaler(quantile_range=(25, 80)).fit_transform(test[numerical_features_5])

train[numerical_features_6] = RobustScaler(quantile_range=(25, 85)).fit_transform(train[numerical_features_6])
test[numerical_features_6] = RobustScaler(quantile_range=(25, 85)).fit_transform(test[numerical_features_6])

train[numerical_features_7] = RobustScaler(quantile_range=(25, 90)).fit_transform(train[numerical_features_7])
test[numerical_features_7] = RobustScaler(quantile_range=(25, 90)).fit_transform(test[numerical_features_7])

train[numerical_features_8] = RobustScaler(quantile_range=(25, 100)).fit_transform(train[numerical_features_8])
test[numerical_features_8] = RobustScaler(quantile_range=(25, 100)).fit_transform(test[numerical_features_8])

numerical_features = [c for c in train.columns if c.startswith("f")]
test = test[numerical_features]

In [None]:
def opt_xgb(trial):
    fold = 1

    params = {
        'objective': 'binary:logistic',
        'eval_metric' : 'auc',
        'random_state': 606,
        'tree_method' : 'gpu_hist',
        'booster' : 'gbtree',
        'gpu_id' : 1,
        'predictor' : 'gpu_predictor',
        'lambda': trial.suggest_loguniform('lambda', 1e-4, 10),
        'alpha': trial.suggest_loguniform('alpha', 1e-4, 10),
        'max_depth': trial.suggest_int('max_depth', 1, 15),
        'learning_rate': trial.suggest_float('learning_rate', 1e-2, 0.30, log=True),
        #'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 100.0),
        #'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 100.0),
        'gamma': trial.suggest_float('gamma', 0, 1),
        'subsample': trial.suggest_float("subsample", 1e-1, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 50, 400),
        'colsample_bytree' : trial.suggest_float("colsample_bytree", 1e-1, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
    }

    xtrain = train[train.StratifiedKFold != fold].reset_index(drop=True)
    xvalid = train[train.StratifiedKFold == fold].reset_index(drop=True)

    ytrain = xtrain.claim
    yvalid = xvalid.claim

    xtrain = xtrain[numerical_features]
    xvalid = xvalid[numerical_features]

    model = XGBClassifier(**params)
    model.fit(xtrain, ytrain, early_stopping_rounds = 200, eval_set=[(xvalid, yvalid)], eval_metric = 'auc', verbose = False)
    preds_valid = model.predict_proba(xvalid)[: , 1]
    score = roc_auc_score(yvalid, preds_valid)
    return score

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(opt_xgb, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial of XGB Classifier:', study.best_trial.params)

In [None]:
study.trials_dataframe()

In [None]:
xgb_params = study.best_params
xgb_params['tree_method'] = "gpu_hist"
xgb_params['gpu_id'] = 0
xgb_params['use_label_encoder'] = False
xgb_params['predictor'] = "gpu_predictor"
xgb_params['n_estimators'] = 10000

In [None]:
## XGB Classifier Model

train = pd.read_csv("train_folds(10).csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

numerical_features = [c for c in train.columns if c.startswith("f")]
test = test[numerical_features]

final_test_predictions = []
final_valid_predictions = {}
scores = []

for fold in range(len(train['StratifiedKFold'].unique().tolist())):
    xtrain =  train[train.StratifiedKFold != fold].reset_index(drop=True)
    xvalid = train[train.StratifiedKFold == fold].reset_index(drop=True)
    xtest = test.copy()

    valid_ids = xvalid.id.values.tolist()

    ytrain = xtrain.claim
    yvalid = xvalid.claim
    
    xtrain = xtrain[numerical_features]
    xvalid = xvalid[numerical_features]

    model = XGBClassifier(**xgb_params)
    model.fit(xtrain, ytrain, early_stopping_rounds = 200, eval_set=[(xvalid, yvalid)], eval_metric = 'auc', verbose = False)
    preds_valid = model.predict_proba(xvalid)[: , 1]
    test_preds = model.predict_proba(xtest)[: , 1]
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))
    score = roc_auc_score(yvalid, preds_valid)
    print(fold, score)
    scores.append(score)

print(np.mean(scores), np.std(scores))


final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_2"]
final_valid_predictions.to_csv("train_pred_2.csv", index=False)

sample_solution.claim = np.mean(np.column_stack(final_test_predictions), axis=1)
sample_solution.columns = ["id", "pred_2"]
sample_solution.to_csv("test_pred_2.csv", index=False)

[back to top](#table-of-contents)
<a id="model_blending"></a>
# <p style="background-color:BlueViolet; font-family:newtimeroman; font-size:150%; text-align:center">4. Model Blending</p>

We will begin by using two trained model to blend predictions, which we will save to a CSV file.

In [None]:
train = pd.read_csv("train_folds(10).csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_solution = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

train1 = pd.read_csv("train_pred_1.csv")
train2 = pd.read_csv("train_pred_2.csv")

test1 = pd.read_csv("test_pred_1.csv")
test2 = pd.read_csv("test_pred_2.csv")

train = train.merge(train1, on="id", how="left")
train = train.merge(train2, on="id", how="left")

test = test.merge(test1, on="id", how="left")
test = test.merge(test2, on="id", how="left")

In [None]:
useful_features = ["pred_1", "pred_2"]
test = test[useful_features]

poly = preprocessing.PolynomialFeatures(degree = 2, interaction_only = False, include_bias = False)
train_poly = poly.fit_transform(train[useful_features])
test_poly = poly.fit_transform(test[useful_features])

df_train_poly = pd.DataFrame(train_poly, columns=[f"POLY_{i}" for i in range(train_poly.shape[1])])
df_test_poly = pd.DataFrame(test_poly, columns=[f"POLY_{i}" for i in range(test_poly.shape[1])])

train = pd.concat([train, df_train_poly], axis=1)
test = pd.concat([test, df_test_poly], axis=1)

useful_features = [col for col in test.columns]

final_predictions = []
scores = []

for fold in range(10):
    xtrain =  train[train.StratifiedKFold != fold].reset_index(drop=True)
    xvalid = train[train.StratifiedKFold == fold].reset_index(drop=True)
    xtest = test.copy()

    ytrain = xtrain.claim
    yvalid = xvalid.claim
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    model = LGBMClassifier(**lgbm_params)
    model.fit(xtrain, ytrain, early_stopping_rounds = 200, eval_set=[(xvalid, yvalid)], eval_metric = 'auc')
    
    preds_valid = model.predict_proba(xvalid)[: , 1]
    test_preds = model.predict_proba(xtest)[: , 1]
    final_predictions.append(test_preds)
    score = roc_auc_score(yvalid, preds_valid)
    print(fold, score)
    scores.append(score)

print(np.mean(scores), np.std(scores))

In [None]:
sample_solution.claim = np.mean(np.column_stack(final_predictions), axis=1)
sample_solution.to_csv("submission.csv", index=False)