# $$Health-Insurance-Cross-Sell-Prediction$$


![how-to-get-a-low-car-insurance-quote-1.jpg](attachment:how-to-get-a-low-car-insurance-quote-1.jpg)

# Prelude

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

# Objective

Build a model to predict whether a policyholder would be interested in Vehicle Insurance. 

# Notebook 
   - Step-1: **Importing Libraries and dataset**
   - Step-2: **Exploratory Data Analysis**
   - Step-3: **Data Cleaning and Preprocessing**
   - Step-4: **Data-Modelling-and-Evaluation**
   - Step-5: **Result and Conclusion**

# **Step-1: $$Importing-Libraries-and-Dataset$$**

## Libraries

In [None]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as ply

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import tensorflow as tf

## Dataset
   - 1.a. Training Dataset

In [None]:
dataset= pd.read_csv('train.csv')
dataset.head()

In [None]:
# Checking Missing values
dataset.isna().sum()

In [None]:
# Identifying datatypes and shape of training dataset
print(dataset.shape)
dataset.dtypes

In [None]:
# Drop ID
dataset.drop(labels='id',axis=1,inplace=True)
dataset.head()

In [None]:
# numeric features in dataset
dataset_num_features = ['Age', 'Region_Code','Annual_Premium','Policy_Sales_Channel','Vintage']

#Catagorical features dataset
dataset_cat_features = ['Gender','Driving_License','Previously_Insured','Vehicle_Age','Vehicle_Damage']

# Step-2 $$Exploratory-Data-Analysis$$

### 2.a - Numeric features

In [None]:
print(dataset_num_features)

In [None]:
dataset[dataset_num_features].describe()

### Data Distribution
   - **numeric features**

In [None]:
for i in dataset_num_features:   
    plt.figure(figsize=(10,8),dpi=100)
    sns.violinplot(x="Response",y=i, data=dataset)
    plt.title(f"Response by {i}")
    plt.show()

**<U> Inference<U>** :
   - Age: 
       - Between 20 to 85 years old.
       - 3/4 of the customer are below 50 years of age

   - Annual_premium:
       - Average annual premium ~ 30.5k. 
       
   - Vintage: 
       - Average customer association rate with company is approx 5 months.
       - Half of the customers are associated with the company between ~3 to ~7 months.


### Converting num_features to Cat_features

In [None]:
 

# Age
Age_range = pd.Series(pd.cut(dataset.Age, bins = 6, precision = 0),name='Age_range')

# Region_Code
Region_Code_range = pd.Series(pd.cut(dataset.Region_Code, bins = 10, precision = 0),name='Region_Code_range')

#Annual_Primium
Annual_Primium_range = pd.Series(pd.cut(dataset.Annual_Premium, bins = 5, precision = 0),name='Annual_Primium_range')

#Policy_Sales_Channel
Policy_Sales_Channel_range = pd.Series(pd.cut(dataset.Policy_Sales_Channel,bins = 10, precision = 0),name='Policy_Sales_Channel_range')

#Vintage
Vintage_range =pd.Series(pd.cut(dataset.Vintage,bins = 5, precision = 0),name='Vintage_range')

In [None]:
# Modified Categorical Dataset:

mod_dataset = pd.concat([Age_range,
                Region_Code_range,
                Annual_Primium_range,
                Policy_Sales_Channel_range,
                Vintage_range], 
               axis=1)

print('Modified Dataset Shape :', mod_dataset.shape,'\n______________________________________\n')
print('New Columns:',mod_dataset.columns,'\n______________________________________\n' )
print(mod_dataset.dtypes,'\n______________________________________\n')

mod_dataset.head()

In [None]:
#Age_range
plt.figure(figsize=(10,7),dpi=300)
sns.countplot(x=Age_range, hue=dataset.Response)
plt.xticks()
plt.xlabel('Age_range',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.show()

#Region_Code_range
plt.figure(figsize=(10,7),dpi=300)
sns.countplot(x=Region_Code_range, hue=dataset.Response)
plt.xticks(fontsize=8)
plt.xlabel('Region_Code_range',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.show()

#Annual_Primium_range
plt.figure(figsize=(10,7),dpi=300)
sns.countplot(x=Annual_Primium_range, hue=dataset.Response)
plt.xticks(fontsize=8)
plt.xlabel('Annual_Primium_range',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.show()

#Policy_Sales_Channel_range
plt.figure(figsize=(10,7),dpi=300)
sns.countplot(x=Policy_Sales_Channel_range, hue=dataset.Response)
plt.xticks(fontsize=8)
plt.xlabel('Policy_Sales_Channel_range',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.show()

#Vintage_range
plt.figure(figsize=(10,7),dpi=300)
sns.countplot(x=Vintage_range, hue=dataset.Response)
plt.xticks(fontsize=8)
plt.xlabel('Vintage_range',fontsize=14)
plt.ylabel('Count',fontsize=14)

plt.show()

# plt.subplot(2,3,6)
# sns.countplot(x=Age_range)
# plt.xticks(fontsize=8)
# plt.xlabel('Age_range',fontsize=14)
# plt.ylabel('Count',fontsize=14)


### -  Age Distribution in the dataset and the Positive  response WRT Age

In [None]:
sns.displot(dataset.Age,kde=True, height=6,aspect=2,color=None,label = "Total customers", bins=65)
plt.legend()
sns.displot(dataset.Age.loc[dataset.Response == 1],kde=True,height=6,aspect=2,color='Purple',label = "Positive Response", bins=64)
plt.legend()

**<U>Inference:<U>**

**Age**
   - Majority of the customers are between the age group of 20-30 years.
   - Most of the negative response came from these age group. 
   - In contrast, age groups between 30 - 60 years were more positive towards buying Car Insurance.
   
**Region Code**
  - The region code doesn't seems to influence the response. Larger customer base is observed in the region code between 26 to 31, the majority of the customers are from these regions in both positive and negative response groups.
   
**Annual Premium**
   - The Annual Premium also doesn't seems to influence the response.
   
**Policy Sales Channel**
   - The Policy Sales Channel between 147-163 showed a significant spike in negative response, but overal Policy Sales Channe doesn't influence the response.

**Response by Vintage**
   - The data distribution appeared to be simmilar in both positive and negative response group.

### 2.b - Categorical features

In [None]:
dataset_cat_features

In [None]:
# getting Dummies for Binary Catagorical features in dataset_mod1:
dataset_mod1=pd.get_dummies(dataset, columns=['Gender','Driving_License','Previously_Insured','Vehicle_Damage'],
                            drop_first=True)

In [None]:
# getting Dummies for non-binary Catagorical features ataset_mod1:
dataset_mod1=pd.get_dummies(dataset_mod1, columns=['Vehicle_Age'])

In [None]:
print('\n columns :',dataset_mod1.columns)

print('\n________________________\n Shape:',dataset_mod1.shape)

dataset_mod1.head()

In [None]:
dataset_mod1_cat_features = ['Gender_Male',
                             'Driving_License_1',
                             'Previously_Insured_1',
                             'Vehicle_Age_1-2 Year',
                             'Vehicle_Age_< 1 Year',
                             'Vehicle_Age_> 2 Years',
                             'Vehicle_Damage_Yes']

#### Data Counts in Categorical features:

In [None]:
# Data Counts and basic information in Catagorical features
for category in dataset_mod1_cat_features:
    print(dataset_mod1[category].value_counts(), '\n____________________________________\n')


In [None]:
# ploting counts 

plt.figure(figsize=(30,20),dpi=300)
#Gender
plt.subplot(2,3,1)
sns.countplot(x=dataset_mod1.Gender_Male)
plt.xlabel('Gender_Male',fontsize=14)
plt.ylabel('Count',fontsize=14)


# Driving_License
plt.subplot(2,3,2)
sns.countplot(x=dataset_mod1.Driving_License_1)
plt.xlabel('Driving_License_1',fontsize=14)
plt.ylabel('Count',fontsize=14)

# Previously_Insured
plt.subplot(2,3,3)
sns.countplot(x=dataset_mod1.Previously_Insured_1)
plt.xlabel('Previously_Insured_1',fontsize=14)
plt.ylabel('Count',fontsize=14)

# Vehicle_Age
plt.subplot(2,3,4)
sns.countplot(x=dataset_mod1['Vehicle_Age_> 2 Years'])
plt.xlabel('Vehicle_Age_> 2 Years',fontsize=14)
plt.ylabel('Count',fontsize=14)

# Vehicle_Damage
plt.subplot(2,3,5)
sns.countplot(x=dataset_mod1.Vehicle_Damage_Yes)
plt.xlabel('Vehicle_Damage_Yes',fontsize=14)
plt.ylabel('Count',fontsize=14)

#Response
plt.subplot(2,3,6)
sns.countplot(x=dataset.Response)
plt.xlabel('Response',fontsize=14)
plt.ylabel('Count',fontsize=14)

plt.show()

print(f"""Positive Response - {dataset.Response.value_counts()[1]/
(dataset.Response.value_counts()[1] + dataset.Response.value_counts()[0])*100}%""")



**<U>Inference<U>** :
   - The Number of male is slightly higher than the female customer.
   - Majority of them have driving license
   - The Number of previously insured is slightly less compared to number of previously uninsured customers.
   - Majority of the customers have vehicle age less than 2 years.
   - the dataset has slightly more customers with previous vehicle damage experience compared to the customers with no previous experince with vehicle damage.
   - Only 12% out of 381109 customers responded positive to the purchase the additional vehicle insurance offer. this results in a sample size of 46710 positive responses. 

#### Data Distribution in Categirical features 

In [None]:
#data distribution of categorical features in both response group
for i in dataset_mod1_cat_features:   
    plt.figure(figsize=(10,8),dpi=100)
    sns.violinplot(x="Response",y=i, data=dataset_mod1)
    plt.title(f"Response by {i}")
    plt.show()


**<U>Inference<U>**:
- Based on the dataset $Gender$, $DrivingLicense$, $RegionCode$ doesn't influence the Response to the vehicle insurence purchasing option.

- A majority of the customer who gave positive response are not $PreviouslyInsured$.

- Customers with newer cars are less willing to buy vehicle insurance

- Customers with no previous experience in Vehicle damage is more reluctent for vehicle insurance.

- Customers from Policy_Sales_Channel arround 25 and 125 showed more positive response. 


#### Correlation between feature

In [None]:
plt.figure(figsize=(10,7),dpi=100)
plt.title("Correlation plot")
sns.heatmap(dataset_mod1.corr(),linewidths=5, annot=True,annot_kws={'size': 8},cmap='coolwarm')

### - Premium Vs Age :

In [None]:
plt.figure(figsize=(10,7),dpi=100)
sns.scatterplot(x=dataset_mod1.Annual_Premium,y=dataset.Age, hue=dataset.Response)
plt.show()

- Most of the customers who responded yes are paying less than 10K as Annual Primium.

### - Premium vs Vintage 

In [None]:
plt.figure(figsize=(10,7),dpi=100)
sns.scatterplot(x=dataset_mod1.Annual_Premium,y=dataset.Vintage, hue=dataset.Response)
plt.show()

- Majority of the customers who responded yes are with company for more than 3 months.

### - Age Vs Vintage 

In [None]:
plt.figure(figsize=(10,7),dpi=100)
sns.scatterplot(x=dataset_mod1.Age,y=dataset.Annual_Premium, hue=dataset.Response)
plt.show()

- Most of the customers who responded yes are between 30 to 60 years of age and paying less than 10K as Annual Primium

# Step-3 $$Data-Cleaning-and-Preprocessing$$

In [None]:
# Colums in dataset_mod1
print(dataset_mod1.columns)

In [None]:
# Arrange columns in dataset_mod1:
dataset_mod1= dataset_mod1[['Age', 'Region_Code', 'Annual_Premium', 'Policy_Sales_Channel',
       'Vintage', 'Gender_Male', 'Driving_License_1',
       'Previously_Insured_1', 'Vehicle_Damage_Yes', 'Vehicle_Age_1-2 Year',
       'Vehicle_Age_< 1 Year', 'Vehicle_Age_> 2 Years','Response']]

In [None]:
# Renaming columns in dataset_mod1 to prevent future problems with XGBClassifier
dataset_mod1=dataset_mod1.rename(columns={'Vehicle_Age_1-2 Year':'Vehicle_Age_1_to_2 Year','Vehicle_Age_< 1 Year':'Vehicle_Age_lessthan_1_Year',
                             'Vehicle_Age_> 2 Years':'Vehicle_Age_morethan_2 Years'})
dataset_mod1

In [None]:
print('\n Shape:', dataset_mod1.shape,'\n__________________________\n')
print('Column:',dataset_mod1.columns)

In [None]:
# Previous Colum names:
print(dataset_mod1_cat_features)

In [None]:
# modified column names: 
dataset_mod1_cat_features = ['Gender_Male', 'Driving_License_1', 'Previously_Insured_1',
'Vehicle_Damage_Yes', 'Vehicle_Age_1_to_2 Year',
       'Vehicle_Age_lessthan_1_Year', 'Vehicle_Age_morethan_2 Years']
print(dataset_mod1_cat_features)

In [None]:
# Assigned independent and the target features.

X = dataset_mod1.iloc[:, 0:-1]
Y = dataset_mod1.iloc[:, -1]

print(X.shape)
print(Y.shape)

In [None]:
print(X)

In [None]:
print(Y)

### Encoding the dependent/target/response Variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y= le.fit_transform(Y)
print(Y)

## Splitting the dataset into the Training set and Test/validation set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.01, random_state = 0)

In [None]:
print(X_train.shape)
print(X_train)

In [None]:
print(X_test.shape)
print(X_test)

In [None]:
print(Y_train.shape)
print(Y_train)

In [None]:
print(Y_test.shape)
print(Y_test)

## Feature Scaling

In [None]:
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

In [None]:
# print(X_train.shape)
# print(X_train)

In [None]:
# print(X_test.shape)
# print(X_test)

# Step-4 $$Data-Modelling-and-Evaluation$$

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, roc_auc_score, precision_recall_curve, auc, roc_curve, recall_score, classification_report 

## 4.1 - Random Forrest

In [None]:
from sklearn.ensemble import RandomForestClassifier

## 4.1.a - RF (Default)

In [None]:
RF_classifier = RandomForestClassifier()

In [None]:
RF_classifier.fit(X_train, Y_train)

### Predicting the Test/Validation set response

In [None]:
Y_pred = RF_classifier.predict(X_test)
print(np.concatenate((Y_pred.reshape(len(Y_pred),1), Y_test.reshape(len(Y_test),1)),1))

### Classification Report


In [None]:
print (classification_report(Y_test, Y_pred))

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(Y_test, Y_pred)

print(cm)
print(f"Accuracy Score :{accuracy_score(Y_test, Y_pred)}")

plt.figure(figsize=(10,5),dpi=80)
sns.heatmap(cm/np.sum(cm), annot=True, fmt='.2', cmap='Blues')
plt.xlabel('Predicted label',fontsize=14)
plt.ylabel('True label',fontsize=14)
plt.show()

### K-fold Cross validation

In [None]:
from sklearn.model_selection import cross_val_score

CV_accuracies = cross_val_score(estimator = RF_classifier, X = X_train, y = Y_train, cv = 10)
print("Accuracy: {:.2f} %".format(CV_accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(CV_accuracies.std()*100))

### Feature Importances

In [None]:
feat_importances = pd.Series(RF_classifier.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')

### Ploting AUC/ROC  

In [None]:
Y_pred_proba = RF_classifier.predict_proba(X_test)
(fpr, tpr,_) = roc_curve(Y_test, Y_pred_proba[:,1])

plt.figure(figsize=(10,7),dpi=100)
plt.plot(fpr,tpr)
plt.title('Receiver operating characteristic Curve: HICSP')
plt.xlabel('False Positive Rate(FPR):Precision')
plt.ylabel('True Positive Rate (TPR): Recall')
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

# Undersampling:

In [None]:
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced

### Build model with nearmiss

In [None]:
# build model with nearmiss
nearmiss_pipeline = make_pipeline_imb(NearMiss(), RF_classifier)

In [None]:
nearmiss_RF_classifier = nearmiss_pipeline.fit(X_train, Y_train)

### Predicting the Test/Validation set response

In [None]:
Y_pred = smote_RF_classifier.predict(X_test)
print(np.concatenate((Y_pred.reshape(len(Y_pred),1), Y_test.reshape(len(Y_test),1)),1))

### Classification Report


In [None]:
print (classification_report(Y_test, Y_pred))

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(Y_test, Y_pred)

print(cm)
print(f"Accuracy Score :{accuracy_score(Y_test, Y_pred)}")

plt.figure(figsize=(10,5),dpi=80)
sns.heatmap(cm/np.sum(cm), annot=True, fmt='.2', cmap='Blues')
plt.xlabel('Predicted label',fontsize=14)
plt.ylabel('True label',fontsize=14)
plt.show()

### K-fold Cross validation

In [None]:
from sklearn.model_selection import cross_val_score

CV_accuracies = cross_val_score(estimator = RF_classifier, X = X_train, y = Y_train, cv = 10)
print("Accuracy: {:.2f} %".format(CV_accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(CV_accuracies.std()*100))

### Feature Importances

In [None]:
feat_importances = pd.Series(RF_classifier.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')

### Ploting AUC/ROC  

In [None]:
Y_pred_proba = RF_classifier.predict_proba(X_test)
(fpr, tpr,_) = roc_curve(Y_test, Y_pred_proba[:,1])

plt.figure(figsize=(10,7),dpi=100)
plt.plot(fpr,tpr)
plt.title('Receiver operating characteristic Curve: HICSP')
plt.xlabel('False Positive Rate(FPR):Precision')
plt.ylabel('True Positive Rate (TPR): Recall')
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))