# Tabular Playground Series - Sep 2021

## Dataset

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.

## Data Description

For this competition, you will predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

# Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


sns.set(rc={f'figure.figsize':(12,10)})

In [None]:
from collections import Counter

In [None]:
train_filepath = '/kaggle/input/tabular-playground-series-sep-2021/train.csv'
test_filepath = '/kaggle/input/tabular-playground-series-sep-2021/test.csv'

# Loading Data

In [None]:
train_df = pd.read_csv(train_filepath)
test_df = pd.read_csv(test_filepath)

In [None]:
columns = [col for col in train_df.columns if col not in ['id', 'claim']]

# Explanatory Data Analysis

## First insights

In [None]:
full_dataset_size = train_df.shape[0] + test_df.shape[0]
print(f'Total size of both datasets: {full_dataset_size}')
print(f'Train data contains {train_df.shape[0]} rows ({round(train_df.shape[0]/full_dataset_size*100)}% of all data) and {train_df.shape[1]} columns')
print(f'Test data contains {test_df.shape[0]} rows ({round(test_df.shape[0]/full_dataset_size*100)}% of all data) and {test_df.shape[1]} columns')

In [None]:
pd. set_option("display.max_columns", None)

In [None]:
train_df.head(3)

In [None]:
test_df.head(3)

In [None]:
print(f'Total missing values in training set is {sum(train_df.isna().sum())}')
print(f'Total missing values in testing set is {sum(test_df.isna().sum())}')

In [None]:
train_df.describe()

In [None]:
test_df.describe()

In [None]:
train_df['claim'].isna().value_counts()

## Outliers

In [None]:
'''
Got inspiration from:
1. https://www.kaggle.com/harshsharma511/titanic-eda-visualization-top-ensemble-models
2. https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
'''

def outliers_itq_method(data, n, columns):
    outlier_indices = []
    
    for col in columns:
        q25, q75 = np.percentile(data[col], 25), np.percentile(data[col], 75)
        iqr = q75 - q25
        
        cut_off = iqr * 1.5
        lower, upper = q25 - cut_off, q75 + cut_off
        
        outliers = data[(data[col] < lower) | (data[col] > upper)].index
        
        outlier_indices.extend(outliers)
        
    outlier_indices = Counter(outlier_indices)
    
    final_outliers = list(k for k, v in outlier_indices.items() if v > n)
    
    return final_outliers

In [None]:
train_df_outliers = outliers_itq_method(train_df, 2, columns)
print(f'outliers found: {len(train_df_outliers)}')

## Observations

1. All features are of continues values in both datasets;
2. Both datasets have 118 features and both contains id column. For this I will need to remove id column from train dataset and save id column from test dataset for later submissions;
3. Total rows in both columns are over 1.4M with 957k in training data and 493k in test data;
4. We have 1.8M missing values in training dataset and 936k missing values in testing dataset, will need to first, check how many columns have missing values in each row, then set a threshold and remove rows which exeeds the threshhold, lastyle prepare a pipeline for imputing values on remaining columns;
5. Our target column is named claim;
6. Some columns contains data which is small, some that is in thousands, will need to scale it. This applies to both datasets;
7. Target columns has no missing values (which is nice);
8. Using IQR outliers detection technique found no outliers;



## Dropping id columns

In [None]:
# saving testing data ids
ids = test_df['id'].copy()

# dropping id columns
train_df.drop('id', axis=1, inplace=True)
test_df.drop('id', axis=1, inplace=True)

# Visualisation

## 1. Claims count

In [None]:
sns.barplot(x=train_df['claim'].value_counts(), y=train_df['claim'], ci=False, orient='h')

### Observations
Can see that claims distribution is balanced meaning both approved and denied claims have similar entries, which approved claimes having very small advantage

## 2. Correlations

In [None]:
train_corr = train_df.corr()

mask = np.triu(np.ones_like(train_corr, dtype=bool))

sns.heatmap(train_corr, mask=mask, center=0, square=True, cbar_kws={'shrink':.5})

### Observations

Majority of correlation is between -.01 and .01.

## 3. Distributions of features

In [None]:
fig, axis = plt.subplots(59,2,figsize=(24,200))

for i, col in enumerate(columns):
    sns.histplot(train_df[col].values, kde=True, ax=axis[(i // 2),(i % 2)]).set(title=str(i+1))

### Observations

Majority of features are skewed, so I will probobly will try log transformation before further scaling.