This work tries to build a model to predict whether the policyholders (customers) from the past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee[[1](https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction)].

This work using the Random forest classification.
Predictor variable :
                  
1. Gender                
1. Age                   
1. Driving_License       
1. Region_Code           
1. Previously_Insured   
1. Vehicle_Age            
1. Vehicle_Damage       
1. Annual_Premium        
1. Policy_Sales_Channel  
1. Vintage               


In [None]:
import numpy as np
import pylab as pl
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Read Data

In [None]:
train = pd.read_csv("../input/health-insurance-cross-sell-prediction/train.csv")
test = pd.read_csv("../input/health-insurance-cross-sell-prediction/test.csv")

In [None]:
#Select feature column names and target variable we are going to use for training
Gender  = {'Male': 1,'Female': 0} 
  
# traversing through dataframe 
# Gender column and writing 
# values where key matches 
train.Gender = [Gender[item] for item in train.Gender] 
print(train)

In [None]:
#Select feature column names and target variable we are going to use for training
Vehicle_Age  = {'> 2 Years': 0,'1-2 Year': 1,'< 1 Year': 2} 
  
# traversing through dataframe 
# Vehicle_Age column and writing 
# values where key matches 
train.Vehicle_Age = [Vehicle_Age[item] for item in train.Vehicle_Age] 
print(train)

In [None]:
#Select feature column names and target variable we are going to use for training
Vehicle_Damage  = {'Yes': 0,'No': 1} 
  
# traversing through dataframe 
# Vehicle_Age column and writing 
# values where key matches 
train.Vehicle_Damage = [Vehicle_Damage[item] for item in train.Vehicle_Damage] 
print(train)

In [None]:
train.info()
train[0:10]

In [None]:
#Select feature column names and target variable we are going to use for training
Gender  = {'Male': 1,'Female': 0} 
  
# traversing through dataframe 
# Gender column and writing 
# values where key matches 
test.Gender = [Gender[item] for item in test.Gender] 
print(test)

In [None]:
#Select feature column names and target variable we are going to use for training
Vehicle_Damage  = {'Yes': 1,'No':0} 
  
# traversing through dataframe 
# Vehicle_Age column and writing 
# values where key matches 
test.Vehicle_Damage = [Vehicle_Damage[item] for item in test.Vehicle_Damage] 
print(test)

In [None]:
#Select feature column names and target variable we are going to use for training
Vehicle_Age  = {'> 2 Years': 0,'1-2 Year': 1,'< 1 Year': 2} 
  
# traversing through dataframe 
# Vehicle_Age column and writing 
# values where key matches 
test.Vehicle_Age = [Vehicle_Age[item] for item in test.Vehicle_Age] 
print(test)

In [None]:
test.info()
test[0:10]

In [None]:
print("Any missing sample in training set:",train.isnull().values.any())
print("Any missing sample in test set:",test.isnull().values.any(), "\n")

In [None]:
#Frequency distribution of classes"
train_outcome = pd.crosstab(index=train["Response"],  # Make a crosstab
                              columns="count")      # Name the count column

train_outcome

# Plotting Heatmap


Heatmap can be defined as a method of graphically representing numerical data where individual data points contained in the matrix are represented using different colors. The colors in the heatmap can denote the frequency of an event, the performance of various metrics in the data set, and so on. Different color schemes are selected by varying businesses to present the data they want to be plotted on a heatmap [2].

In [None]:
train = train[['Gender','Age','Driving_License','Region_Code','Previously_Insured','Vehicle_Age','Vehicle_Damage','Annual_Premium',
'Policy_Sales_Channel','Vintage','Response']] #Subsetting the data
cor = train.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

As you can see above, we obtain the heatmap of correlation among the variables. The color palette in the side represents the amount of correlation among the variables. The lighter shade represents a high correlation.

In [None]:
#Select feature column names and target variable we are going to use for training
features=['Gender','Age','Driving_License','Region_Code','Previously_Insured','Vehicle_Age','Vehicle_Damage','Annual_Premium',
'Policy_Sales_Channel','Vintage']
target = 'Response'

In [None]:
#This is input which our classifier will use as an input.
train[features].head(10)

In [None]:
#Display first 10 target variables
train[target].head(10).values

## Random forest classification

Based on the previous classification method, random forest is a supervised learning algorithm that creates a forest randomly. This forest, is a set of decision trees, most of the times trained with the bagging method. The essential idea of bagging is to average many noisy but approximately impartial models, and therefore reduce the variation. Each tree is constructed using the following algorithm:

* Let $N$ be the number of test cases, $M$ is the number of variables in the classifier.
* Let $m$ be the number of input variables to be used to determine the decision in a given node; $m<M$.
* Choose a training set for this tree and use the rest of the test cases to estimate the error.
* For each node of the tree, randomly choose $m$ variables on which to base the decision. Calculate the best partition of the training set from the $m$ variables.

For prediction a new case is pushed down the tree. Then it is assigned the label of the terminal node where it ends. This process is iterated by all the trees in the assembly, and the label that gets the most incidents is reported as the prediction. We define the number of trees in the forest in 100. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# We define the RF model
rfcla = RandomForestClassifier(n_estimators=100,random_state=9,n_jobs=-1)

# We train model
rfcla.fit(train[features],train[target]) 



In [None]:
#Make predictions using the features from the test data set
predictions = rfcla .predict(test[features])

#Display our predictions
predictions

In [None]:
#Create a  DataFrame
submission = pd.DataFrame({'id':test['id'],'Response':predictions})

#Visualize the first 5 rows
submission.head()

In [None]:
#Convert DataFrame to a csv file that can be uploaded
#This is saved in the same directory as your notebook
filename = 'submission.csv'

submission.to_csv(filename,index=False)

print('Saved file: ' + filename)