# Caravan Insurance Prediction

Data Set Information

Information about customers consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes. The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real world business problem. The training set contains over 5000 descriptions of customers, including the information of whether or not they have a caravan insurance policy. A test set contains 4000 customers of whom only the organisers know if they have a caravan insurance policy. 

In [75]:
# Importing packages
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_validation import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import cohen_kappa_score as kappa
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Reading the caravan Insurance data 
data = pd.read_csv('D:\\WorkPlace_R\\caravan_insurance.csv')
# Analysing the data 
data.describe()

Unnamed: 0,Customer Subtype,Number of houses,vg size household,Avg age,Customer main type,Roman catholic,Protestant,Other religion,No religion,Married,...,Number of private accident insurance policies,Number of family accidents insurance policies,Number of disability insurance policies,Number of fire policies,Number of surfboard policies,Number of boat policies,Number of bicycle policies,Number of property insurance policies,Number of social security insurance policies,Number of mobile home policies
count,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,...,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0
mean,24.253349,1.110615,2.678805,2.99124,5.773617,0.696496,4.626932,1.069907,3.258502,6.183442,...,0.005325,0.006527,0.004638,0.570079,0.000515,0.006012,0.031776,0.007901,0.014256,0.059773
std,12.846706,0.405842,0.789835,0.814589,2.85676,1.003234,1.715843,1.017503,1.597647,1.909482,...,0.072782,0.080532,0.077403,0.562058,0.022696,0.081632,0.210986,0.090463,0.119996,0.237087
min,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,1.0,2.0,2.0,3.0,0.0,4.0,0.0,2.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30.0,1.0,3.0,3.0,7.0,0.0,5.0,1.0,3.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,35.0,1.0,3.0,3.0,8.0,1.0,6.0,2.0,4.0,7.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,41.0,10.0,5.0,6.0,10.0,9.0,9.0,5.0,9.0,9.0,...,1.0,1.0,2.0,7.0,1.0,2.0,3.0,2.0,2.0,1.0


In [3]:
# Extracting only Independent Variables
features = ['Customer Subtype', 'Number of houses', 'vg size household', 'Avg age',
       'Customer main type', 'Roman catholic', 'Protestant', 'Other religion',
       'No religion', 'Married', 'Living together', 'Other relation',
       'Singles', 'Household without children', 'Household with children',
       'High level education', 'Medium level education',
       'Lower level education', 'High status', 'Entrepreneur', 'Farmer',
       'Middle management', 'Skilled labourers', 'Unskilled labourers',
       'Social class A', 'Social class B1', 'Social class B2',
       'Social class C', 'Social class D', 'Rented house', 'Home owners',
       'car', 'cars', 'No car', 'National Health Service',
       'Private Health Service', 'Income < 30.000', 'Income 30-45.000',
       'Income 45-75.000', 'Income 75-122.000', 'Income >123.000',
       'Average income', 'Purchasing power class',
       'Contribution private third party insurance',
       'Contribution third party insurance (firms)',
       'Contribution third party insurane (agriculture)',
       'Contribution car policies', 'Contribution delivery van policies',
       'Contribution motorcycle/scooter policies',
       'Contribution lorry policies', 'Contribution trailer policies',
       'Contribution tractor policies',
       'Contribution agricultural machines policies ',
       'Contribution moped policies', 'Contribution life insurances',
       'Contribution private accident insurance policies',
       'Contribution family accidents insurance policies',
       'Contribution disability insurance policies',
       'Contribution fire policies', 'Contribution surfboard policies',
       'Contribution boat policies', 'Contribution bicycle policies',
       'Contribution property insurance policies',
       'Contribution social security insurance policies',
       'Number of private third party insurance',
       'Number of third party insurance (firms) ',
       'Number of third party insurane (agriculture)',
       'Number of car policies', 'Number of delivery van policies',
       'Number of motorcycle/scooter policies', 'Number of lorry policies',
       'Number of trailer policies', 'Number of tractor policies',
       'Number of agricultural machines policies', 'Number of moped policies',
       'Number of life insurances',
       'Number of private accident insurance policies',
       'Number of family accidents insurance policies',
       'Number of disability insurance policies', 'Number of fire policies',
       'Number of surfboard policies', 'Number of boat policies',
       'Number of bicycle policies', 'Number of property insurance policies',
       'Number of social security insurance policies']

In [4]:
# Normalization - Using MinMax Scaler
min_max_scaler = preprocessing.MinMaxScaler()
X = min_max_scaler.fit_transform(data[features])

y = np.vstack(data['Number of mobile home policies'].values)
print('X and Y Input Data:   ', X.shape, y.shape)


X and Y Input Data:    (5822, 85) (5822, 1)


In [5]:
# Spliting test and train data 
X_train_original, X_test2, y_train_original, y_test2 = train_test_split(X, y, test_size=0.3,
                                                                        random_state=42)

print('Training Set Shape:   ', X_train_original.shape, y_train_original.shape)
X_train_original, X_test2, y_train_original, y_test2 = train_test_split(X, y, test_size=0.3,
                                                                        random_state=42)

print('Training Set Shape:   ', X_train_original.shape, y_train_original.shape)

X_val, X_test, y_val, y_test = train_test_split(X_test2, y_test2, test_size=0.33,random_state=42)
# Used Seed in Partitioning so that Test Set remains same for every Run

print('Validation Set Shape: ', X_val.shape,y_val.shape)
print('Test Set Shape:       ', X_test.shape, y_test.shape)

Training Set Shape:    (4075, 85) (4075, 1)
Training Set Shape:    (4075, 85) (4075, 1)
Validation Set Shape:  (1170, 85) (1170, 1)
Test Set Shape:        (577, 85) (577, 1)


In [6]:
# Oversampling of underrepresented class
# Class to perform over-sampling using SMOTE
# conda install -c glemaitre imbalanced-learn
doOversampling = True

if doOversampling:
# Apply regular SMOTE
    sm = SMOTE(kind='regular')
    X_train, y_train = sm.fit_sample(X_train_original, y_train_original)
    print('Training Set Shape after oversampling: ', X_train.shape, y_train.shape)
    print(pd.crosstab(y_train,y_train))
else:
    X_train = X_train_original
    y_train = y_train_original 

Training Set Shape after oversampling:  (7692, 85) (7692,)
col_0     0     1
row_0            
0      3846     0
1         0  3846


# 1. Logistic Regression
Logistic Regression (aka logit, MaxEnt) classifier.
1. solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’} - Solver refers to the optimization method to use to find the optimum of the objective function.
 1. newton-cg - the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. The minimum of f(x) is obtained when the gradient is 0.
 2. lbfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm. The algorithm's target problem is to minimise f(x) over unconstrained values of the real-vector X  where f is a differentiable scalar function.
 3. LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

    * L2-regularized classifiers 
    * L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
    * L1-regularized classifiers (after version 1.4) 
    * L2-loss linear SVM and logistic regression (LR)
    * L2-regularized support vector regression (after version 1.9) 
    * L2-loss linear SVR and L1-loss linear SVR.
   
      i) Main features of LIBLINEAR include

        * Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
        * Cross validation for model evaulation
        * Automatic parameter selection
        * Probability estimates (logistic regression only)
        * Weights for unbalanced data
 4. Stochastic Average Gradient (SAG) method, as well as several related methods, for the problem of L2-regularized logistic regression with a finite training set
 5. SAGA - Which handle both multinomial loss and L1 penalty

2. fit_intercept : bool, default: True, fit_intercept=False sets the y-intercept to 0. If fit_intercept=True, the y-intercept will be determined by the line of best fit.

In [121]:
# Building a logistic Regression model using 'liblinear' solver
clf_Log1 = LogisticRegression(solver='liblinear', max_iter=1000, random_state=123)
# Fitting the model
clf_Log1.fit(X_train, y_train)
# Predicting the model
y_pred_Log1 = clf_Log1.predict(X_val)

cols = ['Model', 'ROC Score', 'Precision Score', 'Recall Score','Accuracy Score','Kappa Score']
models_report = pd.DataFrame(columns = cols)

tmp1 = pd.Series({'Model': " Logistic Regression with 'liblinear' Solver",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_Log1),
                 'Precision Score': metrics.precision_score(y_val, y_pred_Log1),
                 'Recall Score': metrics.recall_score(y_val, y_pred_Log1),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_Log1),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_Log1)})

model1_report = models_report.append(tmp1, ignore_index = True)
model1_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'liblinear' Solver,0.673318,0.128852,0.630137,0.711111,0.123106


In [122]:
# Building a logistic Regression model using 'newton-cg' solver
clf_Log2 = LogisticRegression(solver='newton-cg', max_iter=50, random_state=123)
# Fitting the model
clf_Log2.fit(X_train, y_train)
# Predicting the model
y_pred_Log2 = clf_Log2.predict(X_val)

tmp2 = pd.Series({'Model': " Logistic Regression with 'newton-cg' Solver ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_Log2),
                 'Precision Score': metrics.precision_score(y_val, y_pred_Log2),
                 'Recall Score': metrics.recall_score(y_val, y_pred_Log2),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_Log2),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_Log2)})

model2_report = models_report.append(tmp2, ignore_index = True)
model2_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'newton-cg' Solver,0.670115,0.12931,0.616438,0.717094,0.123351


In [123]:
# Building a logistic Regression model using 'lbfgs' solver
clf_Log3 = LogisticRegression(solver='lbfgs', max_iter=100, random_state=123)
# Fitting the model
clf_Log3.fit(X_train, y_train)
# Predicting the model
y_pred_Log3 = clf_Log3.predict(X_val)

tmp3 = pd.Series({'Model': " Logistic Regression with 'lbfgs' Solver",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_Log3),
                 'Precision Score': metrics.precision_score(y_val, y_pred_Log3),
                 'Recall Score': metrics.recall_score(y_val, y_pred_Log3),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_Log3),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_Log3)})

model3_report = models_report.append(tmp3, ignore_index = True)
model3_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'lbfgs' Solver,0.669659,0.12894,0.616438,0.716239,0.122736


In [124]:
# Building a logistic Regression model using 'sag' solver
clf_Log4 = LogisticRegression(solver='sag', max_iter=500, random_state=123)
# Fitting the model
clf_Log4.fit(X_train, y_train)
# Predicting the model
y_pred_Log4 = clf_Log4.predict(X_val)


tmp4 = pd.Series({'Model': " Logistic Regression with 'sag' Solver",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_Log4),
                 'Precision Score': metrics.precision_score(y_val, y_pred_Log4),
                 'Recall Score': metrics.recall_score(y_val, y_pred_Log4),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_Log4),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_Log4)})

model4_report = models_report.append(tmp4, ignore_index = True)
model4_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'sag' Solver,0.670115,0.12931,0.616438,0.717094,0.123351


In [125]:
# Building a logistic Regression model using 'saga' solver
clf_Log5 = LogisticRegression(solver='saga', max_iter=1000, random_state=123,verbose=2,class_weight='balanced')
# Fitting the model
clf_Log5.fit(X_train, y_train)
# Predicting the model
y_pred_Log5 = clf_Log5.predict(X_val)

tmp5 = pd.Series({'Model': " Logistic Regression with 'saga' Solver",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_Log5),
                 'Precision Score': metrics.precision_score(y_val, y_pred_Log5),
                 'Recall Score': metrics.recall_score(y_val, y_pred_Log5),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_Log5),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_Log5)})

model5_report = models_report.append(tmp5, ignore_index = True)
model5_report

convergence after 143 epochs took 4 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.0s finished


Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'saga' Solver,0.670115,0.12931,0.616438,0.717094,0.123351


In [126]:
# Comparison of Logistic Regression models interms of various solvers
cols = ['Model', 'ROC Score', 'Precision Score', 'Recall Score','Accuracy Score','Kappa Score']
model_log = pd.DataFrame(columns = cols)
model_log = model_log.append([model1_report,model2_report,model3_report,model4_report,model5_report])
model_log

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'liblinear' Solver,0.673318,0.128852,0.630137,0.711111,0.123106
0,Logistic Regression with 'newton-cg' Solver,0.670115,0.12931,0.616438,0.717094,0.123351
0,Logistic Regression with 'lbfgs' Solver,0.669659,0.12894,0.616438,0.716239,0.122736
0,Logistic Regression with 'sag' Solver,0.670115,0.12931,0.616438,0.717094,0.123351
0,Logistic Regression with 'saga' Solver,0.670115,0.12931,0.616438,0.717094,0.123351


# 2. Decision Tree Classifier

1. criterion : string, optional (default=”gini”). The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

2. splitter : string, optional (default=”best”). The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

In [172]:
# Building a Decision Tree model using 'gini' solver and splitter 'best'
clf_DT1 = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, 
                                min_samples_split=2, min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0,
                                min_impurity_split=1e-07)
# Fitting the model
clf_DT1.fit(X_train, y_train)
# Predicting the model
y_pred_DT1 = clf_DT1.predict(X_val)

tmp1 = pd.Series({'Model': " Decision Tree with (GINI & BEST) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_DT1),
                 'Precision Score': metrics.precision_score(y_val, y_pred_DT1),
                 'Recall Score': metrics.recall_score(y_val, y_pred_DT1),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_DT1),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_DT1)})

model_dt1_report = models_report.append(tmp1, ignore_index = True)
model_dt1_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Decision Tree with (GINI & BEST),0.539816,0.107438,0.178082,0.85641,0.060932


In [173]:
# Building a Decision Tree model using 'gini' solver and splitter 'random'
clf_DT2 = DecisionTreeClassifier(criterion='gini', splitter='random', max_depth=10, 
                                min_samples_split=2, min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0, min_impurity_split=1e-07)
# Fitting the model
clf_DT2.fit(X_train, y_train)
# Predicting the model
y_pred_DT2 = clf_DT2.predict(X_val)

tmp2 = pd.Series({'Model': " Decision Tree with (GINI & RANDOM) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_DT2),
                 'Precision Score': metrics.precision_score(y_val, y_pred_DT2),
                 'Recall Score': metrics.recall_score(y_val, y_pred_DT2),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_DT2),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_DT2)})

model_dt2_report = models_report.append(tmp2, ignore_index = True)
model_dt2_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Decision Tree with (GINI & RANDOM),0.575923,0.123529,0.287671,0.828205,0.093722


In [174]:
# Building a Decision Tree model using 'entropy' solver and splitter 'best'
clf_DT3 = DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=10, 
                                min_samples_split=2, min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0,  
                                min_impurity_split=1e-07)
# Fitting the model
clf_DT3.fit(X_train, y_train)
# Predicting the model
y_pred_DT3 = clf_DT3.predict(X_val)

tmp3 = pd.Series({'Model': " Decision Tree with (ENTROPY & BEST) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_DT3),
                 'Precision Score': metrics.precision_score(y_val, y_pred_DT3),
                 'Recall Score': metrics.recall_score(y_val, y_pred_DT3),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_DT3),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_DT3)})

model_dt3_report = models_report.append(tmp3, ignore_index = True)
model_dt3_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Decision Tree with (ENTROPY & BEST),0.522459,0.09009,0.136986,0.859829,0.036137


In [175]:
# Building a Decision Tree model using 'entropy' solver and splitter 'random'
clf_DT4 = DecisionTreeClassifier(criterion='entropy', splitter='random', max_depth=10, 
                                min_samples_split=2, min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0,  
                                min_impurity_split=1e-07)
# Fitting the model
clf_DT4.fit(X_train, y_train)
# Predicting the model
y_pred_DT4 = clf_DT4.predict(X_val)

tmp4 = pd.Series({'Model': " Decision Tree with (ENTROPY & RANDOM) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_DT4),
                 'Precision Score': metrics.precision_score(y_val, y_pred_DT4),
                 'Recall Score': metrics.recall_score(y_val, y_pred_DT4),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_DT4),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_DT4)})

model_dt4_report = models_report.append(tmp4, ignore_index = True)
model_dt4_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Decision Tree with (ENTROPY & RANDOM),0.61065,0.135922,0.383562,0.809402,0.1196


In [176]:
# Comparison of Decision Tree models based on criterion and splitter
cols = ['Model', 'ROC Score', 'Precision Score', 'Recall Score','Accuracy Score','Kappa Score']
model_DT = pd.DataFrame(columns = cols)
model_DT = model_DT.append([model_dt1_report,model_dt2_report,model_dt3_report,model_dt4_report])
model_DT

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Decision Tree with (GINI & BEST),0.539816,0.107438,0.178082,0.85641,0.060932
0,Decision Tree with (GINI & RANDOM),0.575923,0.123529,0.287671,0.828205,0.093722
0,Decision Tree with (ENTROPY & BEST),0.522459,0.09009,0.136986,0.859829,0.036137
0,Decision Tree with (ENTROPY & RANDOM),0.61065,0.135922,0.383562,0.809402,0.1196


# 3. Random Forest

1. criterion : string, optional (default=”gini”)
    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

2. max_features : int, float, string or None, optional (default=”auto”)

    The number of features to consider when looking for the best split:

    1. If “auto”, then max_features=sqrt(n_features)
    2. If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
    3. If “log2”, then max_features=log2(n_features)

In [164]:
# Building a Decision Tree model using 'gini' solver and 'auto' max_features
clf_RF1 = RandomForestClassifier(n_estimators=500, criterion='gini', max_depth=15,
                                min_samples_split=2, min_samples_leaf=1,
                                max_features='auto', 
                                n_jobs=1,random_state=42)
# Fitting the model
clf_RF1.fit(X_train, y_train)
# Predicting the model
y_pred_RF1 = clf_RF1.predict(X_val)

tmp1 = pd.Series({'Model': " Random Forest with GINI ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF1),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF1),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF1),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF1),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF1)})

model_rt1_report = models_report.append(tmp1, ignore_index = True)
model_rt1_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with GINI,0.535183,0.2,0.09589,0.919658,0.092948


In [165]:
# Building a Decision Tree model using 'entropy' solver and 'auto' max_features
clf_RF2 = RandomForestClassifier(n_estimators=500, criterion='entropy', max_depth=15,
                                min_samples_split=2, min_samples_leaf=1,
                                max_features='auto', random_state=42)
# Fitting the model
clf_RF2.fit(X_train, y_train)
# Predicting the model
y_pred_RF2 = clf_RF2.predict(X_val)

tmp2 = pd.Series({'Model': " Random Forest with Entropy ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF2),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF2),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF2),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF2),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF2)})

model_rt2_report = models_report.append(tmp2, ignore_index = True)
model_rt2_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with Entropy,0.529701,0.193548,0.082192,0.921368,0.081209


In [166]:
# Building a Decision Tree model using 'gini' solver and 'sqrt' max_features
clf_RF3 = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=15,
                                min_samples_split=5, min_samples_leaf=1,
                                max_features='sqrt', random_state=42)
# Fitting the model
clf_RF3.fit(X_train, y_train)
# Predicting the model
y_pred_RF3 = clf_RF3.predict(X_val)

tmp3 = pd.Series({'Model': " Random Forest with GINI (sqrt)",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF3),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF3),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF3),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF3),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF3)})

model_rt3_report = models_report.append(tmp3, ignore_index = True)
model_rt3_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with GINI (sqrt),0.518281,0.166667,0.054795,0.923932,0.053243


In [178]:
# Building a Decision Tree model using 'entropy' solver and 'sqrt' max_features
clf_RF4 = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=15,
                                min_samples_split=5, min_samples_leaf=1,
                                max_features='sqrt', random_state=42)
# Fitting the model
clf_RF4.fit(X_train, y_train)
# Predicting the model
y_pred_RF4 = clf_RF4.predict(X_val)

tmp4 = pd.Series({'Model': " Random Forest with Entropy (sqrt) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF4),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF4),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF4),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF4),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF4)})

model_rt4_report = models_report.append(tmp4, ignore_index = True)
model_rt4_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with Entropy (sqrt),0.519649,0.190476,0.054795,0.926496,0.05887


In [187]:
# Building a Decision Tree model using 'gini' solver and 'log2' max_features
clf_RF5 = RandomForestClassifier(n_estimators=250, criterion='gini', max_depth=15,
                                min_samples_split=11, min_samples_leaf=1,
                                max_features='log2',random_state=42)
# Fitting the model
clf_RF5.fit(X_train, y_train)
# Predicting the model
y_pred_RF5 = clf_RF5.predict(X_val)

tmp5 = pd.Series({'Model': " Random Forest with GINI (log2)",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF5),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF5),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF5),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF5),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF5)})

model_rt5_report = models_report.append(tmp5, ignore_index = True)
model_rt5_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with GINI (log2),0.531524,0.222222,0.082192,0.924786,0.089317


In [181]:
# Building a Decision Tree model using 'entropy' solver and 'log2' max_features
clf_RF6 = RandomForestClassifier(n_estimators=250, criterion='entropy', max_depth=15,
                                min_samples_split=11, min_samples_leaf=1,
                                max_features='log2',random_state=42)
# Fitting the model
clf_RF6.fit(X_train, y_train)
# Predicting the model
y_pred_RF6 = clf_RF6.predict(X_val)

tmp6 = pd.Series({'Model': " Random Forest with Entropy (log2) ",
                 'ROC Score' : metrics.roc_auc_score(y_val, y_pred_RF6),
                 'Precision Score': metrics.precision_score(y_val, y_pred_RF6),
                 'Recall Score': metrics.recall_score(y_val, y_pred_RF6),
                 'Accuracy Score': metrics.accuracy_score(y_val, y_pred_RF6),
                 'Kappa Score':metrics.cohen_kappa_score(y_val, y_pred_RF6)})

model_rt6_report = models_report.append(tmp6, ignore_index = True)
model_rt6_report

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with Entropy (log2),0.520105,0.2,0.054795,0.92735,0.060818


In [194]:
# Comparison of Random Forest based on criterion and max_features
cols = ['Model', 'ROC Score', 'Precision Score', 'Recall Score','Accuracy Score','Kappa Score']
model_rf = pd.DataFrame(columns = cols)
model_rf = model_rf.append([model_rt1_report,model_rt2_report,model_rt3_report,model_rt4_report,model_rt5_report,model_rt6_report])
model_rf

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Random Forest with GINI,0.535183,0.2,0.09589,0.919658,0.092948
0,Random Forest with Entropy,0.529701,0.193548,0.082192,0.921368,0.081209
0,Random Forest with Entropy (sqrt),0.519649,0.190476,0.054795,0.926496,0.05887
0,Random Forest with Entropy (sqrt),0.519649,0.190476,0.054795,0.926496,0.05887
0,Random Forest with GINI (log2),0.531524,0.222222,0.082192,0.924786,0.089317
0,Random Forest with Entropy (log2),0.520105,0.2,0.054795,0.92735,0.060818


In [186]:
# Comparison of various model
cols = ['Model', 'ROC Score', 'Precision Score', 'Recall Score','Accuracy Score','Kappa Score']
clas_model = pd.DataFrame(columns = cols)
clas_model = clas_model.append([model_log,model_DT,model_rf])
clas_model

Unnamed: 0,Model,ROC Score,Precision Score,Recall Score,Accuracy Score,Kappa Score
0,Logistic Regression with 'liblinear' Solver,0.673318,0.128852,0.630137,0.711111,0.123106
0,Logistic Regression with 'newton-cg' Solver,0.670115,0.12931,0.616438,0.717094,0.123351
0,Logistic Regression with 'lbfgs' Solver,0.669659,0.12894,0.616438,0.716239,0.122736
0,Logistic Regression with 'sag' Solver,0.670115,0.12931,0.616438,0.717094,0.123351
0,Logistic Regression with 'saga' Solver,0.670115,0.12931,0.616438,0.717094,0.123351
0,Decision Tree with (GINI & BEST),0.539816,0.107438,0.178082,0.85641,0.060932
0,Decision Tree with (GINI & RANDOM),0.575923,0.123529,0.287671,0.828205,0.093722
0,Decision Tree with (ENTROPY & BEST),0.522459,0.09009,0.136986,0.859829,0.036137
0,Decision Tree with (ENTROPY & RANDOM),0.61065,0.135922,0.383562,0.809402,0.1196
0,Random Forest with GINI,0.535183,0.2,0.09589,0.919658,0.092948


# CONCLUSION

As we can see in the table with the comparision of various models, Out of all the models that we have created, RandomForest model gives the best accuracy of 92 %.

