## Caravan Insurance Excercise using Random Forests
#### By Ezhilarasan Kannaiyan

Create RandomForest model and train dataset in the model created and then apply the future test dataset and make prediction

 - Import Libraries & Read Dataset
 - Check the train dataset if any missing values (NA) ( Use Imputer if any missing values)
 - Split the dataset into train and test datasets
 - Remove the ORIGIN variable in both train and test datasets (it is used only to spilit the data)
 - Use Dist plot to see any outliers 
 - Use Column Transformer to encode the dataset
 - Create RandomForest model 
 - Train the model with our train dataset
 - Make predictions by applying the model on our test dataset


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
pd.options.display.max_columns=100
pd.options.display.max_rows=100

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
df = pd.read_csv("../input/caravan-insurance-challenge/caravan-insurance-challenge.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().values.any()

Null values are not present

In [None]:
df.nunique()

Split the dataset using ORIGIN column.
Remove the ORIGIN variable in both train and test datasets (it is used only to spilit the data)

In [None]:
x_train = df.loc[df['ORIGIN'] == 'train',:]
x_test = df.loc[df['ORIGIN'] == 'test',:]

In [None]:
a = x_train.pop('ORIGIN')
b = x_test.pop('ORIGIN')


Pop out the CARAVAN column into y array (Target)

In [None]:
y_train = x_train.pop('CARAVAN')
y_test = x_test.pop('CARAVAN')

In [None]:
x_train.shape
x_test.shape
y_train.shape
y_test.shape

Create arrays of numerical and Categorical columns

In [None]:
num_columns = x_train.columns[x_train.nunique() > 5]
cat_columns = x_train.columns[x_train.nunique() <= 5]

In [None]:
len(num_columns)
len(cat_columns)

In [None]:
num_columns

Draw dist plots for all the numerical columns

In [None]:
plt.figure(figsize=(15,15))
sns.distributions._has_statsmodels=False
for i in range(len(num_columns)):
    plt.subplot(11,5,i+1)
    sns.distplot(df[num_columns[i]])
    
plt.tight_layout()

Column Transformer to apply Scaling
Robust Scaler for Numerical Columns when they Outliers
Standard Scaler for Numerical Columns when there is no Outliers
OneHotEncoder for Categorical Columns

In [None]:
ct = ColumnTransformer(
                        [
                            ('num_col', RobustScaler(),num_columns), 
                            ('cat_col', OneHotEncoder(handle_unknown='ignore'), cat_columns),
                         ]
,remainder = 'passthrough')

In [None]:
ct.fit_transform(x_train) 

Apply Random Forest Classifier on the fit & transformed (RobustScaler & OneHotEncoder) x_train data

In [None]:
pipe = Pipeline([('ct',ct),
                 ('rf',RandomForestClassifier())
                 ])


In [None]:
pipe.fit(x_train,y_train)

Apply the model on test data and make predictions

In [None]:
y_pred = pipe.predict(x_test)

In [None]:
np.sum(y_pred == y_test)/len(y_test)*100

We could predict the test data with almost 93% with RandomForest model