# Business Objective
This is a dataset taken from <b>UCI Machine Learning</b> repository
<br>
Link: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
<br>
<b>Data Set Information:</b>
<br>
Information about customers consists of <font color="red">86</font> variables and includes product usage data and socio-demographic data derived from zip area codes. The data was supplied by the <font color="red">Dutch data mining company Sentient Machine Research</font> and is based on a real world business problem. The training set contains over <font color="red">5000</font> descriptions of customers, including the information of whether or not they have a caravan insurance policy. A test set contains <font color="red">4000</font> customers of whom only the organisers know if they have a caravan insurance policy.
<br>
<br>
<b>Objective:</b> Our objective is to predict whether a customer will like a "Caravan policy"
<br>
## How this dataset will address our idea in Accenture Innovation contest?
In this contest we addressed the challenge faced by Indian insurance companies to chose the right product for the right customer. This sample POC has a similar business objective where we are trying to predict whether a customer with a particular feature will like a particular product (in our case "A Caravan Policy) or not.

## Approach to the problem
We will take 2 steps approach towards this problem<br>
i) Exploratory Data Analysis to find out the relevant columns which we will be taking in as input<br>
ii) Classification Model building and performance evaluations

# Model Building

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import math

In [2]:
path = r'C:\Users\sandipto.sanyal\Desktop\ML\insurance-master\insurance-master'
os.chdir(path)

## Import our modified training and test datasets

In [3]:
train = pd.read_csv('train_modified.csv')
test = pd.read_csv('test_modified.csv')
test_eval = pd.read_csv('tictgts2000.txt',names=['CARAVAN'])
test = pd.concat([test,test_eval],axis=1)

## Joining both train and test set for performing Hot Encoding

In [4]:
temp_df = pd.concat([train,test],axis=0)

In [5]:
print(temp_df.columns)

Index(['MOSTYPE', 'MAANTHUI', 'MGEMOMV', 'MGEMLEEF', 'MGODRK', 'MGODPR',
       'MGODOV', 'MGODGE', 'MRELGE', 'MRELSA', 'MFALLEEN', 'MFGEKIND',
       'MOPLHOOG', 'MOPLMIDD', 'MOPLLAAG', 'MBERHOOG', 'MBERZELF', 'MBERBOER',
       'MBERMIDD', 'MBERARBG', 'MBERARBO', 'MSKA', 'MSKB1', 'MSKB2', 'MSKC',
       'MSKD', 'MHHUUR', 'MAUT1', 'MAUT2', 'MAUT0', 'MZFONDS', 'MINKM30',
       'MINK3045', 'MINK4575', 'MINK7512', 'MINK123M', 'MINKGEM', 'MKOOPKLA',
       'PWAPART', 'CARAVAN'],
      dtype='object')


## Hot Encoding
As per the business data description Hot Encoding will be done on the following columns<br>
i) MOSTYPE: This is a category "Customer subtype"<br>
Since other categories denote ranges of Age, Percentages hence we are leaving those columns as numeric.

In [6]:
temp_df = pd.get_dummies(temp_df,drop_first=True,columns=['MOSTYPE'])
#dividing the train and test datasets again after performing hot encoding
train = temp_df.iloc[:5822,:]
test = temp_df.iloc[5822:len(temp_df),:]

## Converting the dataframes to numpy arrays

In [7]:
X_train = np.array(train.drop(columns=['CARAVAN']))
y_train = np.array(train.CARAVAN)
X_test = np.array(test.drop(columns=['CARAVAN']))
y_test = np.array(test.CARAVAN)

## Applying standard scaler to standardize the values of columns
Standardization is an important part before feeding into ML models.

In [8]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



## Performing SMOTE to reduce imbalance in training dataset
Since the training DS is a highly imbalanced one we are performing Oversampling of minority data (people who have liked caravan policy) using SMOTE algorithm

In [9]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 12)
X_res, y_res = smote.fit_sample(X_train, y_train)
unique, counts = np.unique(y_res, return_counts=True)
print('Distribution of CARAVAN column before SMOTE')
print(np.asarray((np.unique(y_train,return_counts=True))))
print('Distribution of CARAVAN column after SMOTE')
print(np.asarray((unique, counts)))

Distribution of CARAVAN column before SMOTE
[[   0    1]
 [5474  348]]
Distribution of CARAVAN column after SMOTE
[[   0    1]
 [5474 5474]]


## Model building

In [12]:
def classifier(choice):
    if choice == 1:
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(random_state = 12)
    elif choice == 2:
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
    elif choice == 3:
        from sklearn.naive_bayes import GaussianNB
        classifier = GaussianNB()
    elif choice == 4:
        classifier = ANN()
    return classifier

In [11]:
def ANN():
    from keras.models import Sequential
    from keras.layers import Dense, Activation
    model = Sequential()
    #input layer
    model.add(Dense(units=38,activation='relu',input_shape=(X_res.shape[1],)))
    #hidden layer
    model.add(Dense(units=38,activation='relu'))
    #output layer
    model.add(Dense(units=1,activation='sigmoid'))
    #compile
    model.compile(
     optimizer = "adam",
     loss = "binary_crossentropy",
     metrics = ["accuracy"]
    )
    return model

In [13]:
choice = 4
#fit = classifier(choice).fit(X_res,y_res)
fit = classifier(choice).fit(X_res,y_res,
                            epochs=2,
                            batch_size = 500,
                            validation_data = (X_test, y_test)
                            )

ModuleNotFoundError: No module named 'keras'

In [None]:
y_pred = fit.predict(X_test)

## Performance metrics
Here we evaluate the performances of different models<br>
We will check the following metrics:<br>
i) Accuracy: Check the fraction of correct predictions in the total test data set.<br>
ii) Precision: Out of total positive predictions that customer will like the product how many were actually correct<br>
iii) Recall: Out of total positives that customer liked the product in the test set how many did our model correctly predict.<br>
iv) F1 score: Harmonic mean of both Precision and Recall.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
y_test1 = y_test
cm = confusion_matrix(y_test1, y_pred)
print('Confusion matrix:\n' + str(cm))
print('Accuracy_score: %.2f' %(accuracy_score(y_test1, y_pred)))
print('Precision_score: %.2f' %(precision_score(y_test1, y_pred))) # tp / (tp + fp)
print('Recall_score: %.2f' %(recall_score(y_test1, y_pred))) # tp / (tp + fn)
print('f1_score: %.2f' %(f1_score(y_test1, y_pred)))

In [None]:
unique, counts = np.unique(y_pred, return_counts=True)
print('Prediction distribution')
print(np.asarray((unique, counts)))

unique, counts = np.unique(y_test, return_counts=True)
print('Validation distribution')
print(np.asarray((unique, counts)))

In [None]:
X_train.shape