# Business Objective
This is a dataset taken from <b>UCI Machine Learning</b> repository
<br>
Link: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
<br>
<b>Data Set Information:</b>
<br>
Information about customers consists of <font color="red">86</font> variables and includes product usage data and socio-demographic data derived from zip area codes. The data was supplied by the <font color="red">Dutch data mining company Sentient Machine Research</font> and is based on a real world business problem. The training set contains over <font color="red">5000</font> descriptions of customers, including the information of whether or not they have a caravan insurance policy. A test set contains <font color="red">4000</font> customers of whom only the organisers know if they have a caravan insurance policy.
<br>
<br>
<b>Objective:</b> Our objective is to predict whether a customer will like a "Caravan policy"
<br>
## How this dataset will address our idea in Accenture Innovation contest?
In this contest we addressed the challenge faced by Indian insurance companies to chose the right product for the right customer. This sample POC has a similar business objective where we are trying to predict whether a customer with a particular feature will like a particular product (in our case "A Caravan Policy) or not.

## Approach to the problem
We will take 2 steps approach towards this problem<br>
i) Exploratory Data Analysis to find out the relevant columns which we will be taking in as input<br>
ii) Classification Model building and performance evaluations

# Model Building

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import math

In [2]:
path = r'F:\Machine learning\Accenture Innovation'
os.chdir(path)

## Import our modified training and test datasets

In [3]:
train = pd.read_csv('train_modified.csv')
test = pd.read_csv('test_modified.csv')
test_eval = pd.read_csv('tictgts2000.txt',names=['CARAVAN'])
test = pd.concat([test,test_eval],axis=1)

## Joining both train and test set for performing Hot Encoding

In [4]:
temp_df = pd.concat([train,test],axis=0)

## Hot Encoding
As per the business data description Hot Encoding will be done on the following columns<br>
i) MOSTYPE: This is a category "Customer subtype"<br>
Since other categories denote ranges of Age, Percentages hence we are leaving those columns as numeric.

In [5]:
temp_df = pd.get_dummies(temp_df,drop_first=True,columns=['MOSTYPE'])
#dividing the train and test datasets again after performing hot encoding
train = temp_df.iloc[:5822,:]
test = temp_df.iloc[5822:len(temp_df),:]

## Converting the dataframes to numpy arrays

In [6]:
X_train = np.array(train.drop(columns=['CARAVAN']))
y_train = np.array(train.CARAVAN)
X_test = np.array(test.drop(columns=['CARAVAN']))
y_test = np.array(test.CARAVAN)

## Applying standard scaler to standardize the values of columns
Standardization is an important part before feeding into ML models.

In [7]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



## Performing SMOTE to reduce imbalance in training dataset
Since the training DS is a highly imbalanced one we are performing Oversampling of minority data (people who have liked caravan policy) using SMOTE algorithm<br>
SMOTE = Synthetic Minority Oversampling Technique

In [16]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 12)
X_res, y_res = smote.fit_sample(X_train, y_train)
unique, counts = np.unique(y_res, return_counts=True)
print(np.asarray((unique, counts)))
len(X_res)

[[   0    1]
 [5474 5474]]


10948

## Model building

In [34]:
def classifier(choice):
    if choice == 1:
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(random_state = 0)
    elif choice == 2:
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
    elif choice == 3:
        from sklearn.naive_bayes import GaussianNB
        classifier = GaussianNB()
    elif choice == 4:
        return ANN()
    return classifier

In [38]:
def ANN():
    import keras
    from keras.models import Sequential
    from keras.layers import Dense
    classifier = Sequential()
    classifier.add(Dense(output_dim = 39, init = 'uniform', activation = 'relu', input_dim = X_res.shape[1]))
    classifier.add(Dense(output_dim = 39, init = 'uniform', activation = 'relu'))
    classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier

In [None]:
choice = 4
fit = classifier(choice).fit(X_res,
                             y_res,
                             batch_size = 10,
                             nb_epoch = 100
                            )

In [42]:
y_pred = classifier(choice).predict(X_test)
y_pred = (y_pred > 0.5)

  
  import sys
  


## Performance metrics
Here we evaluate the performances of different models

In [43]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
y_test1 = y_test
cm = confusion_matrix(y_test1, y_pred)
print('Confusion matrix:\n' + str(cm))
print('Accuracy_score: %.2f' %(accuracy_score(y_test1, y_pred)))
print('Precision_score: %.2f' %(precision_score(y_test1, y_pred))) # tp / (tp + fp)
print('Recall_score: %.2f' %(recall_score(y_test1, y_pred))) # tp / (tp + fn)
print('f1_score: %.2f' %(f1_score(y_test1, y_pred)))

Confusion matrix:
[[ 764 2998]
 [  60  178]]
Accuracy_score: 0.24
Precision_score: 0.06
Recall_score: 0.75
f1_score: 0.10


In [29]:
unique, counts = np.unique(y_pred, return_counts=True)
print('y_pred distribution')
print(np.asarray((unique, counts)))

unique, counts = np.unique(y_test, return_counts=True)
print('y_test actual distribution')
print(np.asarray((unique, counts)))

y_pred distribution
[[   0    1]
 [ 301 3699]]
y_test actual distribution
[[   0    1]
 [3762  238]]


In [33]:
from sklearn.model_selection import cross_val_score
f1 = cross_val_score(estimator = fit, X = X_train, y = y_train, cv = 10, scoring='f1')
f1

array([0.07017544, 0.04761905, 0.        , 0.04347826, 0.04      ,
       0.03921569, 0.04545455, 0.        , 0.04545455, 0.04761905])

In [36]:
X_test.shape

(4000, 77)