# Introduction
Tabular Playground Series are a month-long competions that are released on 1st of every month. These are designed to be beginner friendly and help bridge the gap between inclass competition and featured competition.

The aim of TPS September 2021  is to predict if the customer will claim a insurance policy or not. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

This table of contents gives an overview about different sections in the notebook.

1. [Load Required Libraries](#1)
2. [Import the Dataset](#2)
3. [Exploratory Data Analysis](#3)
    * [Train Dataset](#3)
    * [Test Dataset](#4)
    * [Missing Values](#5)
    * [Distributions](#6)
    * [Correlations](#7)
4. [Modeling](#8)
5. [Submission](#9)

<a id = "1" ></a>
## Loading Required Libraries

In [None]:
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

#modeling
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

In [None]:
#set color palette
sns.set_palette("Spectral_r")

<a id = "2" ></a>
## Importing the dataset
We are using three different files in this notebook and we will import all three files before starting our analysis.

* `train.csv` - the training data with the target claim column
* `test.csv` - the test set; you will be predicting the claim for each row in this file
* `sample_submission.csv` - a sample submission file in the correct format

In [None]:
#import dataset 
train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")

#output file 
submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

FEATURES = train.columns[:-1]
TARGET = train.columns[-1]

<a id = "3" ></a>
## Exploratory Data Analysis
The aim of this step is to explore the dataset a bit to get insights about the shape of the data, datatypes of the feature columns, missing values and so on.

### Train Dataset

In [None]:
#Overview of train dataset
train.head()

### Dataframe dimensions
* The `train` dataset contains 957919 rows of data and 120 features

In [None]:
#dimensions of the dataset
print(f'The shape of the train dataset {train.shape}')

### Quick summary statistics of the data
The summary statistics shows the min, max, mean, standard deviation and quartile infomation for each feature column

In [None]:
train.describe()

<a id = "4" ></a>
### Test Data

In [None]:
#overview of test data
test.head()

### Dataset Dimensions
* The `test` dataset contains 493474 rows of data and 119 features

In [None]:
print(f'The shape of the test data is {test.shape}')

### Submission File
The format of the output submission file is shown below: 
* It contains only two columns namely, the `id` column and the `claim` column

In [None]:
submission.head()

<a id = "5" ></a>
### Missing values 
We will check if there are missing values in our dataset

In [None]:
#missing values
missing = train.isnull().sum()
missing

As shown above, our dataset contains missing values. Now, I will check the proportion of missing values in each column.

It can be noted that, on an average the proportion of missing values ranges between **(1.60 - 1.65)%**

In [None]:
#missing values plot
missing/len(train)

### Imbalance in the distribution of Target variable
From the plot below, we see that the target variable `claim` is fairly balanced

In [None]:
#checking for imbalance in the dataset
count = train['claim'].value_counts().values
sns.barplot(x = [0,1], y = count)
plt.title('Target variable count')

<a id = "6" ></a>
### Feature Distributions
Showing distribution on each feature that are available in train and test dataset. We observe that all features distribution on train and test dataset are almost similar.

In [None]:
#distribution of features in train dataset
fig = plt.figure(figsize = (20, 140))
for idx, i in enumerate(train.columns):
    fig.add_subplot(np.ceil(len(train.columns)/4), 4, idx+1)
    train.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.show()

In [None]:
#ditribution of features in test data
fig = plt.figure(figsize = (20, 140))
for idx, i in enumerate(test.columns):
    fig.add_subplot(np.ceil(len(test.columns)/4), 4, idx+1)
    test.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.show()

<a id = "7" ></a>
### Correlations
There seem to be a very little or no correlation between features as well as feature-to-target correlation.

In [None]:
#correlation between features
corr = train.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(16, 16))
    ax = sns.heatmap(corr, mask=mask, cmap = 'Spectral_r', vmax=.3, square=True)

<a id = "8" ></a>
## Modeling
In this notebook, I will be using XGBoost Classifier. I base my model on this notebook given by the [kaggle competition team](https://www.kaggle.com/hsuchialun/tps-xgboost-kfold-with-gpu#Step1:-Import-Helpful-Libraries). I changed few parameters in my model. 

In [None]:
#modeling
X = train.loc[:, FEATURES]
y = train.loc[:, TARGET]

final_predictions = []
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X, y)):
    X_train = X.loc[train_indicies]
    X_valid = X.loc[valid_indicies]
    X_test = test.copy()
    
    y_train = y.loc[train_indicies]
    y_valid = y.loc[valid_indicies]
    
    model = XGBClassifier(random_state=42, verbosity=0, tree_method='gpu_hist')
    
    model.fit(X_train, y_train,
             verbose = False,
             eval_set = [(X_train, y_train), (X_valid, y_valid)],
             eval_metric = "auc",
             early_stopping_rounds = 200)
    preds_valid = model.predict_proba(X_valid)[:,1]
    preds_test = model.predict_proba(X_test)[:,1]
    final_predictions.append(preds_test)
    print(fold, roc_auc_score(y_valid, preds_valid))

<a id = "9" ></a>
## Submission
This is my final submission file. 

In [None]:
preds = np.mean(np.column_stack(final_predictions), axis=1)

# Make predictions
y_pred = pd.DataFrame({'id': submission['id'], 'claim': preds})

# Create submission file
y_pred.to_csv("submission.csv")

# Thanks for reading! Upvote if you find this notebook useful 