# Problem Statement- 
To predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

# 1. Preparation

# a) Load libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

# b) Load Dataset

In [None]:
df = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")

# 2. Summarize Data
# a) Descriptive Statistics

In [None]:
df.head(2)

In [None]:
df.shape

In [None]:
df.describe().T

In [None]:
# Percentage of missing values
(df.isnull().sum()/df.shape[0]*100)

In [None]:
# Correlation with the target
df.corr()[['claim']]

## Observations so far
* Dataset is huge - 957919 rows, 120 columns
* We have 118 feature columns excluding the id and target - claim column
* All the feature columns have varied range of values
* Missing values are constantly around 1.6% in each column! hmm, Should we drop?(I'm not for it) should we impute them somehow?? we'll ponder over it later..
* As seen there is also not much of a correlation with the target variable. o-O

# b) Data visualizations

In [None]:
fig, ax = plt.subplots(6, 1, figsize = (30,20))
sns.boxplot(data = df.iloc[:, 1:20], ax = ax[0])
sns.boxplot(data = df.iloc[:, 20:40], ax = ax[1])
sns.boxplot(data = df.iloc[:, 40:60], ax = ax[2])
sns.boxplot(data = df.iloc[:, 60:80], ax = ax[3])
sns.boxplot(data = df.iloc[:, 80:100], ax = ax[4])
sns.boxplot(data = df.iloc[:, 100:], ax = ax[5])

### This just confirms what we observed earlier that the ranges are so varied we cannot see the plots clearly.. So lets normalize the data for the visualization

In [None]:
features = df.columns[1:-1]
feat_df=df[features]
n_df=((feat_df - feat_df.min())/(feat_df.max() - feat_df.min()))

In [None]:
fig, ax = plt.subplots(6, 1, figsize = (30,20))
sns.boxplot(data = n_df.iloc[:, 1:20], ax = ax[0])
sns.boxplot(data = n_df.iloc[:, 20:40], ax = ax[1])
sns.boxplot(data = n_df.iloc[:, 40:60], ax = ax[2])
sns.boxplot(data = n_df.iloc[:, 60:80], ax = ax[3])
sns.boxplot(data = n_df.iloc[:, 80:100], ax = ax[4])
sns.boxplot(data = n_df.iloc[:, 100:], ax = ax[5])

In [None]:
nrows = 30
ncols = 4
i = 0
fig, ax = plt.subplots(nrows, ncols, figsize = (40,120))
for row in range(nrows):
    for col in range(ncols):
        if i==118:
            break
        else:
            sns.histplot(data = n_df.iloc[:, i], bins = 30, ax = ax[row, col]).set(ylabel = '')
            i += 1

In [None]:
i = 0
fig, ax = plt.subplots(nrows, ncols, figsize = (40,120))
for row in range(nrows):
    for col in range(ncols):
        if i==118:
            break
        else:
            sns.kdeplot(x = n_df.iloc[:, i], ax = ax[row, col]).set(ylabel = '')
            i += 1

### Observations from visualizations
* We can clearly see from the visualisations that there are quite a few outliers - we will need a strategy to handle them
* Also, we see that the distributions are non-gaussian, all kinds of skewness exists in the data. we have to work towards staandardizing this too
* Yet, there are some columns that are following a gaussian pattern. to handle this we could use Scalers viz - Standard, Robust or MinMax and check with them later.

### Now an introduction to some tools that provide quick EDA - I'll be only sharing the codes on how to use them. Some take a lot of time and are also slow to use with the dataset being so huge. Take your pick

## 1. Sweetviz

> * import sweetviz
> * my_report  = sweetviz.analyze([df,'Train'], target_feat='claim')
> * my_report.show_html('FinalReport.html')

In [None]:
# !pip install sweetviz
import sweetviz
my_report = sweetviz.analyze([df,'Train'], target_feat='claim')
my_report.show_html('FinalReport.html')

## [sweetviz report](./FinalReport.html)

## 2. Pandas Profiling

### Since the data is huge pandas profiling took a lot of time and the report generated was also over 500mb on my local machine.Code is as below:

> * from pandas_profiling import ProfileReport
> * profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
> * profile.to_file("PandasProfilingReport.html")

## 3. D-Tale
### quickest of the lot. It seemed like it hosted report on the machine and created the visulaizations as requested through the browser interface. You should definitely check it out! 

> * !pip install dtale
> * import dtale
> * d = dtale.show(df)
> * d.open_browser()

### **** Thats it for this version! Do let me know your critiques through comments and/or appreciation through upvotes. TIA ****