# Porto Seguro Safe Driver Prediction Competation

https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data

In this competition, we will predict the probability that an auto insurance policy holder files a claim.

Please read the overview and Evaluation process.

We already knew that the `target` columns signifies whether or not a claim was filed for that policy holder.

## In here first i am going to focus on the Imbalanced Class handling.

# Import Library

In [None]:
import numpy as np  
import pandas as pd  

import matplotlib.pyplot as plt
import seaborn as sns




In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Set the label

In [None]:
LABELS = ["No Claim Filed", "Claim Filed"]

# Data Load

In [None]:
data = pd.read_csv('/kaggle/input/porto-seguro-safe-driver-prediction/train.csv')

In [None]:
data.head()

In [None]:
data.shape

# Target / Class Exploration

Get the count of target.

In [None]:
data.target.value_counts()

As we already know that the proportion of records with `target` = 1 (Claim Filed) is far less than `target` = 0 (No Claim Filed). 

This can lead to a model that has great accuracy but does have any added value in practice.

In [None]:
sns.countplot(data.target);
plt.xlabel('Is Filed Claim?');
plt.ylabel('Number of occurrences');
plt.show()

Lets visualize the same with more precise

In [None]:
count_classes = pd.value_counts(data['target'], sort = True)

count_classes.plot(kind = 'bar', rot = 0)

plt.title("Claims Distribution")

plt.xticks(range(2), LABELS)

plt.xlabel("Claims --> ")

plt.ylabel("Frequency --> ")

plt.show()

# Techniques to handle Class Imbalance

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. 

It consists of removing samples from the majority class (under sampling) and / or adding more examples from the minority class (over sampling).

Despite the advantage of balancing classes, these techniques also have their weaknesses. 

* The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. 
* In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

To implement this resampling techniques we are going to use Python imbalanced-learn module (`imblearn`). It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Other resampling techniques like SMOTE; SMOTETomek etc are also there, which will see below.

Before we start with handling, lets have X and y.

In [None]:
X = data.drop('target', axis = 1)
y = data.target

data.shape, X.shape, y.shape

In [None]:
# y_us.value_counts()[0] 
# X.shape[0]

Lets have the counts in some kind of DataFrame to see the difference in a glance.

In [None]:
data_table = pd.DataFrame()

data_table['technique'] = ['Original Data']
data_table['X_Shape'] = [X.shape[0]]
data_table['y_Shape'] = [y.shape[0]]
data_table['target_0'] = [y.value_counts()[0]]
data_table['target_1'] = [y.value_counts()[1]]

data_table

# 1: Under Sampling

# 1.1 Under Sampling using NearMiss

In [None]:
from imblearn.under_sampling import NearMiss

In [None]:
nm = NearMiss()

In [None]:
X_us, y_us = nm.fit_sample(X, y)

In [None]:
print('Shape for Imbalanced Class :')
display(X.shape, y.shape)
print('Count of target : {} '.format(y.value_counts()))


print('Shape for Balanced Class :')
display(X_us.shape, y_us.shape)
print('Count of target : {} '.format(y_us.value_counts()))

In [None]:
new_row = {'technique': 'Under Sampling - NearMiss', 'X_Shape': X_us.shape[0], 'y_Shape':y_us.shape[0], 'target_0': y_us.value_counts()[0], 'target_1' : y_us.value_counts()[1]}
data_table = data_table.append(new_row,ignore_index=True)

data_table

# 1.2 Under Sampling using RandomUnderSampler
`RandomUnderSampler` is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes. 

Under-sample the majority class(es) by randomly picking samples with or without replacement.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
rus = RandomUnderSampler(random_state=42, replacement=True)  
X_rus, y_rus = rus.fit_resample(X, y)

In [None]:
new_row = {
    'technique': 'Under Sampling - RandomUnderSampler', 
    'X_Shape': X_rus.shape[0], 
    'y_Shape':y_rus.shape[0], 
    'target_0': y_rus.value_counts()[0], 
    'target_1' : y_rus.value_counts()[1]
}

data_table = data_table.append(new_row,ignore_index=True)

data_table

# 2: Over Sampling
One way to fight imbalance data is to generate new samples in the minority classes. The most naive strategy is to generate new samples by randomly sampling with replacement of the currently available samples. The `RandomOverSampler` offers such a scheme.

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
os = RandomOverSampler() # Default sampling_strategy='auto'

In [None]:
X_os, y_os = os.fit_sample(X, y)

In [None]:
new_row = {
    'technique': 'Over Sampling - Auto', 
    'X_Shape': X_os.shape[0], 
    'y_Shape':y_os.shape[0], 
    'target_0': y_os.value_counts()[0], 
    'target_1' : y_os.value_counts()[1]
}
data_table = data_table.append(new_row,ignore_index=True)

data_table

In [None]:
os2 = RandomOverSampler(sampling_strategy=0.5)

X_os2, y_os2 = os2.fit_sample(X, y)

In [None]:
new_row = {
    'technique': 'Over Sampling - half', 
    'X_Shape': X_os2.shape[0], 
    'y_Shape':y_os2.shape[0], 
    'target_0': y_os2.value_counts()[0], 
    'target_1' : y_os2.value_counts()[1]
}
data_table = data_table.append(new_row,ignore_index=True)

data_table

# 3: SMOTE
`SMOTE` (Synthetic Minority Oversampling TEchnique) consists of synthesizing elements for the minority class, based on those that already exist. It works randomly picingk a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

This technique generates synthetic data for the minority class.

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html 

In [None]:
from imblearn.over_sampling import SMOTE


In [None]:
smote = SMOTE(sampling_strategy = 'minority')
X_smote, y_smote = smote.fit_sample(X, y)

In [None]:
new_row = {
    'technique': 'SMOTE - minority', 
    'X_Shape': X_smote.shape[0], 
    'y_Shape':y_smote.shape[0], 
    'target_0': y_smote.value_counts()[0], 
    'target_1' : y_smote.value_counts()[1]
}
data_table = data_table.append(new_row,ignore_index=True)

data_table

There are various other parameters such as `random_state`, `k_neighbors` etc which can be changed as well.

# 4: SMOTETomek
Its a combination of over-sampling and under-sampling, using the SMOTE and Tomek links techniques.

In [None]:
from imblearn.combine import SMOTETomek

In [None]:
smk = SMOTETomek(random_state=9)
X_smk, y_smk = smk.fit_sample(X, y)

In [None]:
new_row = {
    'technique': 'SMOTETomek_9', 
    'X_Shape': X_smk.shape[0], 
    'y_Shape':y_smk.shape[0], 
    'target_0': y_smk.value_counts()[0], 
    'target_1' : y_smk.value_counts()[1]
}
data_table = data_table.append(new_row,ignore_index=True)

data_table

You can try with various other random state.

Note: One can select any of the technique based on the problem, but make sure to use proper and correct evulation metrix.

Dont go with accuracy; score. 

Some of the metrics, which might works best with such imbalanced dataset are as below. Try considering them.
1. Confusion Matrix
2. Precision
3. Recall
4. F1-Score
5. AUC-ROC Curve