# Caravan Insurance Challenge  using Random Forest
#### Author : Rohini Garg

-----------------------------------------------------------------------------------------------------

#### *First of all we will have a small introduction about Random Forest*   
#### What is Random Forest?

 *Credit : https://towardsdatascience.com *
####   It is also called  *random descision forests *. Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction
* **A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.**
* **There needs to be some actual signal in our features so that models built using those features do better than random guessing.**



*********************************************************************************************

### Objective of  Caravan Insurance Challenge?
#### Identify potential purchasers of caravan insurance policies

#### About Data

The data file contains the following fields:

* **ORIGIN**: train or test, as described above
* **MOSTYPE**: Customer Subtype; see L0
* **MAANTHUI**: Number of houses 1 - 10
* **MGEMOMV**: Avg size household 1 - 6
* **MGEMLEEF**: Avg age; see L1
* **MOSHOOFD**: Customer main type; see L2

************************************************************************
* ** Percentages in each group, per postal code (see L3)**:

* **MGODRK**: Roman catholic
* **MGODPR**: Protestant …
* **MGODOV**: Other religion
* **MGODGE**: No religion
* **MRELGE**: Married
* **MRELSA**: Living together
* **MRELOV**: Other relation
* **MFALLEEN**: Singles
* **MFGEKIND**: Household without children
* **MFWEKIND**: Household with children
* **MOPLHOOG**: High level education
* **MOPLMIDD**: Medium level education
* **MOPLLAAG**: Lower level education
* **MBERHOOG**: High status
* **MBERZELF**: Entrepreneur
* **MBERBOER**: Farmer
* **MBERMIDD**: Middle management
* **MBERARBG**: Skilled labourers
* **MBERARBO**: Unskilled labourers
* **MSKA**: Social class A
* **MSKB1**: Social class B1
* **MSKB2**: Social class B2
* **MSKC**: Social class C
* **MSKD**: Social class D
* **MHHUUR**: Rented house
* **MHKOOP**: Home owners
* **MAUT1**: 1 car
* **MAUT2**: 2 cars
* **MAUT0**: No car
* **MZFONDS**: National Health Service
* **MZPART**: Private health insurance
* **MINKM30**: Income < 30.000
* **MINK3045**: Income 30-45.000
* **MINK4575**: Income 45-75.000
* **MINK7512**: Income 75-122.000
* **MINK123M**: Income >123.000
* **MINKGEM**: Average income
* **MKOOPKLA**: Purchasing power class
************************************************************************
* ** Total number of variable in postal code (see L4)**:

* **PWAPART**: Contribution private third party insurance
* **PWABEDR**: Contribution third party insurance (firms) …
* **PWALAND**: Contribution third party insurane (agriculture)
* **PPERSAUT**: Contribution car policies
* **PBESAUT**: Contribution delivery van policies
* **PMOTSCO**: Contribution motorcycle/scooter policies
* **PVRAAUT**: Contribution lorry policies
* **PAANHANG**: Contribution trailer policies
* **PTRACTOR**: Contribution tractor policies
* **PWERKT**: Contribution agricultural machines policies
* **PBROM**: Contribution moped policies
* **PLEVEN**: Contribution life insurances
* **PPERSONG**: Contribution private accident insurance policies
* **PGEZONG**: Contribution family accidents insurance policies
* **PWAOREG**: Contribution disability insurance policies
* **PBRAND**: Contribution fire policies
* **PZEILPL**: Contribution surfboard policies
* **PPLEZIER**: Contribution boat policies
* **PFIETS**: Contribution bicycle policies
* **PINBOED**: Contribution property insurance policies
* **PBYSTAND**: Contribution social security insurance policies
* **AWAPART**: Number of private third party insurance 1 - 12
* **AWABEDR**: Number of third party insurance (firms) …
* **AWALAND**: Number of third party insurance (agriculture)
* **APERSAUT**: Number of car policies
* **ABESAUT**: Number of delivery van policies
* **AMOTSCO**: Number of motorcycle/scooter policies
* **AVRAAUT**: Number of lorry policies
* **AAANHANG**: Number of trailer policies
* **ATRACTOR**: Number of tractor policies
* **AWERKT**: Number of agricultural machines policies
* **ABROM**: Number of moped policies
* **ALEVEN**: Number of life insurances
* **APERSONG**: Number of private accident insurance policies
* **AGEZONG**: Number of family accidents insurance policies
* **AWAOREG**: Number of disability insurance policies
* **ABRAND**: Number of fire policies
* **AZEILPL**: Number of surfboard policies
* **APLEZIER**: Number of boat policies
* **AFIETS**: Number of bicycle policies
* **AINBOED**: Number of property insurance policies
* **ABYSTAND**: Number of social security insurance policies
* **CARAVAN**: Number of mobile home policies 0 - 1

### Call libraries

In [None]:
#1.0 Clear memory
%reset -f

# 1.1 Call data manipulation libraries
import pandas as pd
import numpy as np
from scipy.stats import kurtosis, skew

# 1.2 Feature creation Classes
from sklearn.preprocessing import PolynomialFeatures            # Interaction features
from sklearn.preprocessing import KBinsDiscretizer  


# 1.3 Data transformation classes
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Construct a transformer from an arbitrary callable.
from sklearn.preprocessing import FunctionTransformer

# 1.4 Fill missing values
from sklearn.impute import SimpleImputer


# 1.5  Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# 1.6 RandomForest modeling
from sklearn.ensemble import RandomForestClassifier 

# 1.7 Misc
import os, gc

#Graphing
import matplotlib.pyplot as plt
import plotly.graph_objects as go 
import plotly.express as px
from matplotlib.colors import LogNorm
import seaborn as sns

# to display all outputs of one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#hide warning
import warnings
warnings.filterwarnings('ignore')

## Set Directory

In [None]:
os.chdir('/kaggle/input')
os.listdir()

In [None]:
dfci=pd.read_csv('caravan-insurance-challenge/caravan-insurance-challenge.csv')
dfci.head()
print("No of Observatios:",dfci.shape[0])
print("No of Features:",dfci.shape[1])

In [None]:
dfci.columns

In [None]:
dfci.head()

#### Check if any NULL value

In [None]:
dfci.columns[dfci.isnull().any()]
#no column has null value so need to fix null values


#### Check column type

In [None]:
dfci.dtypes.value_counts()
#All columns are int 64 except one ,

str_features = dfci.select_dtypes(include='object').columns
str_features

In [None]:
#Lets check the unique values of ORIGIN 
dfci['ORIGIN'].value_counts()

In [None]:
#check summary
dfci.describe()

### **split data according to origin**

In [None]:
#Fetch Train Data
train_data= dfci[dfci['ORIGIN']=='train']
#drop ORIGIN col from train_data
train_data.drop(['ORIGIN'],axis=1,inplace=True)

test_data=dfci[dfci['ORIGIN']=='test']
#drop ORIGIN col from test_data
test_data.drop(['ORIGIN'],axis=1,inplace=True)

In [None]:
train_data['CARAVAN'].value_counts().plot(kind='bar', title='CARAVAN Classification Train Data', grid=True)

In [None]:
test_data['CARAVAN'].value_counts().plot(kind='bar', title='CARAVAN Classification Test Data', grid=True)

* **Observation**: Number of records are more in train and test data where CARVAN is zero

#### Fetch target column values in variable and delete this column from train data 

In [None]:
y = train_data.pop('CARAVAN')
train_data.head()

In [None]:
#check standard deviation.if std() is zero drop that columns
s= []
s = [col for col in train_data.columns if train_data[col].std() == 0]
s


#### we will seperate num and cat columns.Check unique values and seperate accordingly.Set  cat cols unique values < 5

In [None]:
dg=(train_data.nunique() < 5)
cat_columns = dg[dg == True].index.tolist()
num_columns = dg[dg == False].index.tolist()
print("No of cat cols",len(cat_columns))
print("No of num cols",len(num_columns))

#### let us find out which preprocessing method will be used on numerical data. StandardScaler or RobustScaler???.Draw distplot

In [None]:
import math
plt.figure(figsize=(15,18))
noofrows= math.ceil(len(num_columns)/3)
noofrows
#set false.Other wise error if  bandwidth =0 
sns.distributions._has_statsmodels=False

for i in range(len(num_columns)):
    plt.subplot(noofrows,3,i+1)
    out=sns.distplot(train_data[num_columns[i]])
    
plt.tight_layout()


*  **There are outliers for most of columns so we will use RobustScaler for num_columns** 
* **OneHotEncoder for cat_columns**

#### Create Column Transformer


In [None]:

ct=ColumnTransformer([
    ('rs',RobustScaler(),num_columns),
    ('ohe',OneHotEncoder(),cat_columns),
    ],
    remainder="passthrough"
    )
ct.fit_transform(train_data)
X=train_data


#### split data in 7:3 ratio so set test size=30

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30)

#### create Pipeline and fit data

In [None]:
rf= RandomForestClassifier(oob_score = True,bootstrap=True)
pipe =Pipeline(
    [
     ('ct',ct),
     ('rf',rf)
    ]
    )
rf.fit(X_train, y_train)

#### Check acuracy

In [None]:
from sklearn.metrics import accuracy_score

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print("Accuracy is:",accuracy)
print("out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data")
print("Out-of-bag score estimation::",rf.oob_score_)



#### We have good accuracy  but  it does not show us  anything about where we’re doing well.Performance can be  visualising by confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
cm = pd.DataFrame(confusion_matrix(y_test, predicted))
sns.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Predicted vs Actual')
cm

In [None]:
from sklearn.tree import export_graphviz
import graphviz

feature_list=train_data.columns
tree = rf.estimators_[5]
# Export  to a dot_data
dot_data = export_graphviz(tree, out_file=None,
                     feature_names=train_data.columns,
                     filled=True, rounded=True,
                     special_characters=True)
# Set graph and plot
graph = graphviz.Source(dot_data)
graph


#### Get Feature importance

In [None]:

importances = list(rf.feature_importances_)

dffeature_importance=pd.DataFrame({'Feature_Name':feature_list, 'Imporatance':importances})



# Get which feature has max importance
dffeature_importance[dffeature_importance['Imporatance'] == dffeature_importance['Imporatance'].max()]
# Get which feature has max importance
dffeature_importance[dffeature_importance['Imporatance'] == dffeature_importance['Imporatance'].min()]


 #### Plot Feature importance

In [None]:
plt.figure(figsize=(20,18))

# list of x locations for plotting
x_values = list(range(len(importances)))
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
# Make a bar chart
out=plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')


* * **Observation**:Most important features : PBRAND,PPERSAUT and MOSTYPE,APERSAUT