# Health Insurance Cross Sell Prediction
### Predict Health Insurance Owners' who will be interested in Vehicle Insurance

In [None]:
# #First import all the libraries for this project
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import cufflinks as cf
import warnings
%matplotlib inline
sns.set_style('whitegrid')
warnings.filterwarnings("ignore")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 1. Business Understanding

#### Problem Background
An Insurance company has provided Health Insurance to its customers. Now they want to <b>find out whether health insurance policyholders (customers)</b> last year's would also be <b>interested in the vehicle insurance</b> provided by the company.

- Therefore, insurance companies need help in building model to predict whether consumers will be interested in the vehicle insurance that will be offered company.
- In order to predict, whether the customer would be interested in Vehicle insurance, given information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

#### Business Value
- With this model, it is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

#### Dataset
The dataset retrieved from: https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction
Brief explanation of the dataset:
- id:	Unique ID for the customer
- Gender:	Gender of the customer
- Age:	Age of the customer
- Driving_License:	0 : Customer does not have DL, 1 : Customer already has DL
- Region_Code:	Unique code for the region of the customer
- Previously_Insured:	1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
- Vehicle_Age:	Age of the Vehicle
- Vehicle_Damage:	1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.
- Annual_Premium:	The amount customer needs to pay as premium in the year
- PolicySalesChannel:	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- Vintage:	Number of Days, Customer has been associated with the company
- Response:	1 : Customer is interested, 0 : Customer is not interested

## 2. Exploratory Data Analysis 

In [None]:
df_train = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction/train.csv")
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.shape

In [None]:
df_train.columns

In [None]:
#Statistical summary of dataset
df_train.describe()

In [None]:
# select numerical data
df_train_numerical=['Age', 'Region_Code','Annual_Premium','Vintage']
df_train_categorical=['Gender','Driving_License','Previously_Insured','Vehicle_Age','Vehicle_Damage','Response']

In [None]:
#statistical summary of the numeric value in dataset
df_train[df_train_numerical].describe()

In [None]:
df_test = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction/test.csv")
df_test.head()

In [None]:
df_test.info()

In [None]:
df_test.shape

In [None]:
df_test.columns

In [None]:
sns.heatmap(df_train.isnull(),yticklabels=False,cbar=False,cmap='viridis')


In [None]:
sns.heatmap(df_test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
df_train.isnull().any().any()

In [None]:
df_test.isnull().any().any()

#### Missing value not found in this dataset

### Data visualization

In [None]:
#we change the 0 and 1 code in the train data based on the category description
change_data1=df_train['Response'].replace({0: 'Not Interested', 1: 'Interested'})
change_data2=df_train['Driving_License'].replace({0: 'No', 1: 'Yes'})
change_data3=df_train['Previously_Insured'].replace({0: 'No', 1: 'Yes'})
d={'id':df_train['id'],'Gender':df_train['Gender'],'Policy_Sales_Channel':df_train['Policy_Sales_Channel'],
   'Age':df_train['Age'],'Vehicle_Age':df_train['Vehicle_Age'],'Region_Code':df_train['Region_Code'],
   'Driving_License':change_data2,'Vehicle_Damage':df_train['Vehicle_Damage'],'Previously_Insured':change_data3,
   'Response':change_data1}
change_data=pd.DataFrame(data=d)
change_data.head()

In [None]:
change_data.Response.value_counts()

Responses data:
- There are 334399 not interested
- There are 46710 interested

In [None]:
# Response data viz
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='RdBu_r')

In [None]:
# Response data viz with Gender
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='RdBu_r',hue='Gender')

From this countplot we can see that Male respondent is more than Female respondent

In [None]:
# Response data viz with Vehicle_Age
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='rainbow',hue='Vehicle_Age')

From this countplot we can see that: <br />
- The most respondent have cars with 1-2 years of age
- Less respondent have cars with >2 years of age

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='RdBu_r',hue='Vehicle_Damage')

But even if the respondent had damage on their vehicles they are still not interested for having vehicle insurance 

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='RdBu_r',hue='Previously_Insured')

In [None]:
change_data.groupby(['Response'])['Previously_Insured'].count().to_frame().reset_index()

Thera are 46710 respondent that doens't have vehicle incurance before interested to have vehicle incurance 

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Response',data=change_data,palette='RdBu_r',hue='Driving_License')

All of the respondent have driving license, but still have less interest for having vehicle insurance  

In [None]:
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
cf.go_offline() 
df_train['Age'].iplot(kind='hist',bins=30,color='blue', title="Age Histogram", 
                              xTitle='Age', yTitle='Frequency')

In [None]:
df_train['Annual_Premium'].iplot(kind='hist',bins=30,color='blue', title="Annual Premium Histogram", 
                              xTitle='Customer', yTitle='Frequency')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Gender',data=df_train,palette='RdBu_r')

Gender's value count almost the same, so we can target both of them  

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
sns.countplot(x='Vehicle_Damage',data=df_train,palette='rainbow',hue='Vehicle_Age')

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
sns.countplot(x='Vehicle_Damage',data=change_data,palette='RdBu_r',hue='Previously_Insured')

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
sns.countplot(x='Vehicle_Damage',data=change_data,palette='RdBu_r',hue='Driving_License')

In [None]:
df_train['Region_Code'].value_counts().head(30).plot(kind='barh', figsize=(20,10), title="Region_Code distribution in df_train", );
plt.xlabel('Count')
plt.ylabel('Region Code')
plt.show()

In [None]:
b = sns.boxplot(y='Policy_Sales_Channel', x='Response', data=change_data);
b.set_title("Policy_Sales_Channel Distribution for each Response");

## 3. Model Creation

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train = df_train.drop(['id', 'Driving_License'], axis=1)
df_train.shape

In [None]:
data_train = pd.get_dummies(df_train)
data_train.shape

In [None]:
data_train.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_train.drop(['Response'], axis=1), data_train.Response, test_size = 0.2)

In [None]:
X_train.shape

We will use the Random Forest algorithm to predict the Response.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
predictions = rfc.predict(X_test)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

In [None]:
y_test.sum()

In [None]:
predictions.sum()

In [None]:
from sklearn.metrics import confusion_matrix
result = confusion_matrix(y_test, predictions)
result

## 4. Implementation

In [None]:
df_test.shape

In [None]:
df_test=df_test.drop(['id','Driving_License'],axis=1)
df_test=pd.get_dummies(df_test)
df_test.head()

In [None]:
df_test.shape

In [None]:
y = rfc.predict(df_test)

In [None]:
temp = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction/test.csv")
id_test = temp.id 

pred_result = {
    'Id' :id_test,
    'Response' : y
}

pred_result = pd.DataFrame(pred_result)

In [None]:
#we change the 0 and 1 code in the pred_result based on the category description
change_test=pred_result['Response'].replace({0: 'Not Interested', 1: 'Interested'})
e={'id':id_test,'Response':change_test}
change_new=pd.DataFrame(data=e)

In [None]:
# Response data viz
sns.set_style('whitegrid')
sns.countplot(x='Response', data=change_new,palette='RdBu_r')

In [None]:
change_new.Response.value_counts()

## 5. Conclusion


- There are 5286 new potential customers 
- Importance features for predicting Response 

In [None]:
result_df = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction/test.csv")
result_df.shape

In [None]:
result_df['Response'] = pred_result.Response

In [None]:
new_customer = result_df[result_df['Response']==1]

In [None]:
new_customer.describe()

In [None]:
#find importance input features based on how useful they are at predicting a target variable - on train data
features = X_train
importances = rfc.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
new_customer1 = pd.get_dummies(new_customer)
new_customer1 = new_customer1.drop(['id', 'Driving_License', 'Response'], axis=1)
# new_customer1.head()

In [None]:
#find importance input features based on how useful they are at predicting a target variable - on test data

features = new_customer1
importances = rfc.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()