# Overview:

It is always better to start with a simple model in every Machine Learning problem. Then we can apply different methods to increase the score and make a more robust model. This notebook aims to make a baseline model and make my first submission in this competition.

# Table of Contents

* [Let's Know our data](#Let's-know-our-data)
* [Train data Preprocessing](#Train-data-Preprocessing)
* [Handling Missing Values](#Handling-Missing-Values)
* [Feature Exploration](#Feature-Exploration)
* [Feature Scaling](#Feature-Scaling)
* [Modeling](#Modeling)
* [Test data Processing](#Test-data_processing)
* [Prediction & Submission](#Predicting-and-submission-file)

# Let's know our data

Every samples of this dataset is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.We need to predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from `0.0` to `1.0`, representing `the probability` of a claim. The features in this dataset have been anonymized and it contains missing values.

**Evaluation**: Submissions are evaluated on area under the **ROC curve** between the predicted probability and the observed target.

### What do we need to submit?

The submission file is expected to have an id and claim columns.

Ok! Now we are familar with our playground. Let's practice. We will try to score goals later

In [None]:
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

**N.B:** I always try to import libraries right before where I use them. This process helps me and hopefully the reader to track the required library much efficiently than importing it all in one cell.

In [None]:
# reading train and test data
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')

# Train data Preprocessing

### Basic Train data Info

In [None]:
# viewing first 5 rows of our train dataset
train.head()

In [None]:
# shape of train-data
train.shape

In [None]:
# concise summary of a DataFrame.
train.info()

In [None]:
# descriptive statistics of the data
train.describe().T

In [None]:
# check for null values
train.isna().sum()

We have nearly the same numbers of null values in every column. You may remember all values are synthetic. So this is possible.

In [None]:
# let's see how many sample each of our class has
train.claim.value_counts()

In [None]:
# plot a pie chart
plt.pie(train.claim.value_counts(), labels = ['0', '1']);

In [None]:
train.drop('id', axis = 1, inplace = True)

# Handling Missing Values

We are making a baseline. So, let's just fill them with mean. We can analysis them more later.

In [None]:
for i in range(118):
    train['f'+str(i+1)].fillna(train['f'+str(i+1)].mean(), inplace = True)

# Feature Exploration

**Skewness** is a measure of *symmetry*, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. **Kurtosis** is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or **outliers**. If outliers is a new term for you, check out this [great notebook ](https://www.kaggle.com/nareshbhat/outlier-the-silent-killer) by [Naresh Bhat](#https://www.kaggle.com/nareshbhat).

#### What is acceptable skewness and kurtosis?
The values for asymmetry and kurtosis between -2 and +2 are considered acceptable in order to prove normal univariate distribution (George & Mallery, 2010)

Let's check `Skewness` and `Kurtosis` for our data.

In [None]:
def skew_kurt(column, data = train):
    sns.displot(x = column, data = data, kde = True)
    skewness=str(data[column].skew())
    kurtosis=str(data[column].kurt())
    plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
    plt.show()

In [None]:
skew_kurt('f1')

In [None]:
skew_kurt('f2')

In [None]:
skew_kurt('f25')

In [None]:
skew_kurt('f50')

In [None]:
skew_kurt('f100')

I chose some column randomly and all of their skew and kurt are in expected range. I assume rest of the columns will be same. You may use **transformation** for better distribution.

# Modeling

In [None]:
# dependent and independent features
x = train.drop(['claim'], axis = 1)
y = train['claim']

In [None]:
x.head()

# Feature Scaling

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
x = scaler.fit_transform(x)

In [None]:
#splitting the dataset into train and test set.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = .05, random_state = 31)

In [None]:
len(x_train), len(x_test), len(y_train), len(y_test)

In [None]:
x_train.shape, y_train.shape

In [None]:
%%time
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(objective= 'binary',
                    n_estimators= 20000,
                    random_state= 2021,
                    learning_rate= 5e-3,
                    subsample= 0.6,
                    subsample_freq= 1,
                    colsample_bytree= 0.4,
                    reg_alpha= 10.0,
                    reg_lambda= 1e-1,
                    min_child_weight= 256,
                    min_child_samples= 20).fit(x_train, y_train)

y_preds = lgbm.predict_proba(x_test)[:, 1]
roc_auc_score(y_test, y_preds)

# Test data Preprocessing

In [None]:
test.head()

In [None]:
# descriptive statistics of test data
test.describe().T

In [None]:
# check for null values
test.isna().sum()

In [None]:
for i in range(118):
    test['f'+str(i+1)].fillna(test['f'+str(i+1)].mean(), inplace = True)

In [None]:
test.drop('id', axis = 1, inplace = True)

In [None]:
test.shape

In [None]:
test = scaler.transform(test)

# Predicting and submission file

In [None]:
preds = lgbm.predict_proba(test)[:, 1]
testforsub = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
preds = pd.DataFrame(preds, columns = ['claim'])
sub = pd.concat([testforsub.id, preds] , axis = 1)
sub.to_csv('baseline_submission.csv', index = False)