![](http://storage.googleapis.com/kaggle-competitions/kaggle/26479/logos/header.png?t=2021-04-09-00-55-58)

#### <span style="color: orange; font-family: Segoe UI; font-size: 1.7em; font-weight: 300;">Simple Tabular Playground Series - Sep 2021</span>


**Bugra Sebati E. - September - 2021**

#### **INTRODUCTION**

The dataset is used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

**Eval Metric** : Submissions are evaluated on area under the **ROC curve** between the predicted probability and the observed target.

#### **ROC Curve**

AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

![](http://miro.medium.com/max/451/1*pk05QGzoWhCgRiiFbz-oKQ.png)

#### Import Data & Libraries

In [None]:
import pandas as pd
import numpy as np
#pd.set_option("max_columns" , None)
pd.set_option("max_rows" , None)
pd.set_option("display.float_format", lambda x: "%.4f" % x)
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split 
from sklearn.metrics import roc_auc_score
import optuna

In [None]:
train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
train1 = train.copy()
test1 = test.copy()

Let's meet the dataset ! 

In [None]:
train.head()

In [None]:
test.head()

In [None]:
submission.head()

In [None]:
train.info()

In [None]:
print(f" In Train : {train.shape[0]} obs and {train.shape[1]} features \n In Test : {test.shape[0]} obs and {test.shape[1]} features")

In [None]:
target = train.claim
test = test.drop(["id"] , axis = 1)
train = train.drop(["id", "claim"] , axis = 1)

In [None]:
train.describe().T

In [None]:
target.describe()

In [None]:
print(pd.isnull(train).values.any())
print(pd.isnull(test).values.any())
print(pd.isnull(target).values.any())

As can be seen we have missing values. We should solve this problem. We will solve later this problem with **Simple Imputer** method. This is basic and easy method. We can use also other methods.

Now, we focus to target variable !

In [None]:
fig = px.histogram(target, x = target, color = target)
fig.update_layout(
    title_text = "Target Distribution",
    xaxis_title_text = "Value",
    yaxis_title_text = "Count",
    bargap = 0.1)
fig.show()

In [None]:
plt.figure(1, figsize = (12,7))
plt.title("Target distribution", color = "orange", fontsize = 15)
target.value_counts().plot.pie(autopct = "%1.4f%%");

It looks like good. In classification problems, the distribution of the target variable is important.

Now, let's look independent variables distributions.

In [None]:
fig = plt.figure(figsize = (30,60))
ax = fig.gca()
hist = train.hist(bins = 50, layout = (24,5),
                       color = "r", alpha = 0.5,  ax = ax)

In [None]:
fig = plt.figure(figsize = (15, 60))
for i in range(len(train.columns.tolist()[:-1])):
    plt.subplot(24,5,i+1)
    sns.set_style("ticks")
    plt.title(train.columns.tolist()[:-1][i], size = 12, fontname = "monospace")
    a = sns.boxplot(train[train.columns.tolist()[:-1][i]], linewidth = 2.5 , color = "lightgreen")
    plt.ylabel("")
    plt.xlabel("")
    plt.xticks(fontname = "monospace")
    plt.yticks([])
    for j in ["right", "left", "top"]:
        a.spines[j].set_visible(False)
        a.spines["bottom"].set_linewidth(1.2)
        
fig.tight_layout(h_pad = 3)
plt.show()

Focus Correlation

In [None]:
fig, ax = plt.subplots(figsize=(20 , 15))
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = np.bool))

sns.heatmap(corr,square = True, center = 0, 
            linewidth = 0.2, cmap = "coolwarm",
           mask = mask, ax = ax) 

ax.set_title("Feature Correlation Matrix", loc = "left")
plt.show()

In [None]:
plt.figure(figsize=(25, 6))
train1.corr()["claim"][:-1].plot(kind = "bar", grid = True)
plt.title("Feature Correlation Table" , fontdict = {"fontsize": 20});

Weak Correlation ! In other words, there is no correlation.

Now, We will use **Simple Imputer** method for missing values.

In [None]:
Imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")

df1 = pd.DataFrame(Imputer.fit_transform(train))
df2 = pd.DataFrame(Imputer.fit_transform(test))
df1.columns = train.columns
df2.columns = test.columns

train_ = df1
test_ = df2

In [None]:
print(pd.isnull(train_).values.any())
print(pd.isnull(test_).values.any())

Next step is Standard Scaler method for data normalize.

In [None]:
sc = StandardScaler()
df_standardize = train_.copy()
df_standardize_test = test_.copy()
df_standardize[df_standardize.columns.tolist()] = sc.fit_transform(df_standardize[df_standardize.columns.tolist()])
df_standardize_test[df_standardize_test.columns.tolist()] = sc.fit_transform(df_standardize_test[df_standardize_test.columns.tolist()])

In [None]:
df_standardize.head()

In [None]:
df_standardize_test.head()

If we want to see differences , we can see in the before-after graphs.

Let's look at the transformation of first 10 variables.

In [None]:
features = train.columns[:10].tolist()
for i in features:
    fig, ax = plt.subplots(1,2,figsize=(7,3.5))    
    ax[0].hist(train[i], color = "red", bins = 30, alpha = 0.3, label = "Skew = %s" %(str(round(train[i].skew(),3))) )
    ax[0].set_title(str(i))   
    ax[0].legend(loc = 0)
    ax[1].hist(df_standardize[i], color = "green", bins = 30, alpha = 0.3, label = "Skew = %s" %(str(round(df_standardize[i].skew(),3))) )
    ax[1].set_title(str(i)+ "  After scaling")
    ax[1].legend(loc = 0)
    plt.show()


If you like it , dont forget to upvote ! :) **Thanks !**