# Introduction

This is a dataset from one bank in United States. This bank besides usual service also provides car insurance service. The bank has potential customers data and bank employees call them for advertising available car insurance options.We are provided with general information about clients (age,job,etc.) as well as more specific information about the current insurance sell campaign (communication,last contact day) and previous campaign (attributes like previous attempts,outcome etc).

We have data about 4000 customers who were contacted during the last campaign andfor whom the results of campaign (dis the customer buy insurance or not) are known.


# Objective

The task is to predict for 1000 customers who are contacted during the current campaign, whether they will buy insurance or not.

##  DataSet

The dataset is downloaded from [Kaggle](https://www.kaggle.com/kondla/carinsurance).

## Table of Contents
1. [Loading of data](#Loading-of-data)
2. [Analysis of Data](#Analysis-of-Data)
3. [Preprocessing of data](#Pre-Processing-of-Dataset)
4. [Model designing for Prediction](#Model-designing-for-Prediction)

### Loading of data

In [None]:
# 1.1 Importing libraries
import pandas as pd
import numpy as np
import os
import datetime
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.manifold import TSNE
import xgboost as xgb
from sklearn.model_selection import train_test_split
import optuna
from sklearn.metrics import accuracy_score

In [None]:
# 1.2 Display multiple outputs from a jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# 1.4 Load train and test data
train = pd.read_csv("../input/carinsurance/carInsurance_train.csv")
test = pd.read_csv("../input/carinsurance/carInsurance_test.csv")

In [None]:
# 2.1 Analyse features of train dataset
train.head(5) # show first 5 rows of train dataset
train.dtypes  # Find the columns datatype
train.shape # shape of training dataset
train.CarInsurance.value_counts() # check target column count

In [None]:
 # 2.2 change datatypes of integer columns from int64 to int16
train[train.select_dtypes(include='int64').columns.tolist()].min() # get minimum value of int columns
train[train.select_dtypes(include='int64').columns.tolist()].max() # get maximum value of int columns
train[train.select_dtypes(include='int64').columns.tolist()] = train[train.select_dtypes(include='int64').
                                                                     columns.tolist()].astype('int16') # change datatype from int64 to int16
train.dtypes


In [None]:
# 2.3 check null values in dataset
train.isnull().sum() 

In [None]:
# 2.4 Generate new Feature
train['CallDuration'] = (pd.to_datetime(train.CallEnd,format='%H:%M:%S')-
                                     pd.to_datetime(train.CallStart,format='%H:%M:%S'))/np.timedelta64(1,'s')
test['CallDuration'] = (pd.to_datetime(test.CallEnd,format='%H:%M:%S')-
                                     pd.to_datetime(test.CallStart,format='%H:%M:%S'))/np.timedelta64(1,'s')

### Analysis of Data

In [None]:
# 3.1 Is there any relation between customer Age and target column carInsurance
px.histogram(data_frame=train,x='Age',nbins=20,marginal='box',color='CarInsurance')

**From the above histogram it is evident that the Age of customer doesn't make much difference for the car insurance**

In [None]:
# 3.2 Is there any relation between customer Job nature and target column carInsurance
px.histogram(train, x="Job",facet_row='CarInsurance')

**People in blue-collar,services and enterpreners generally don't prefer car insurance**

In [None]:
# 3.3 Is there any relationship between marital status of customer and car insurance
px.histogram(train, x="Marital",facet_col='CarInsurance')

**Married customers generally preferred to take car Insurance rather than single and divorced customers**

In [None]:
# 3.4 Is there any relationship between Account Balance and car insurance
px.histogram(train,x='Balance',color = 'CarInsurance',range_x=[0,10000])

**Customers having low balance in account generally prefer to have car insurance**

In [None]:
# 3.5 Is there any relationship between customers having housing insurance and car insurance
train.HHInsurance.value_counts()
fig = px.histogram(train,x='HHInsurance',color = 'CarInsurance')
fig.update_layout(xaxis_type='category')

**Customers having Housing loan insurance preferred Car Insuarnce too**

In [None]:
# 3.5 Is there any relationship between month when customer was contacted and car insurance
train.LastContactMonth.value_counts()
px.histogram(train,x='LastContactMonth',facet_row = 'CarInsurance')

**Most of the Customers contacted last time on the month of May have purchased insurance during current campaign period**

In [None]:
# 3.6 Is there relationship between customer education and Car insurance
train.Education.value_counts()
px.histogram(train,x='Education',color = 'CarInsurance')

In [None]:
# 3.7 Is there a relationship between call duration and Car insurance
px.box(train, x="CarInsurance", y="CallDuration",notched=True)

**The above box plot indicates that for the customers purchased car-insurance the call duration is more than the customers not purchased car - insurance.**

In [None]:
# 3.8 Is there a relationship between last campaign outcome and current campaign outcome
px.histogram(train,x='Outcome',facet_col = 'CarInsurance')

**The Bank is successful in convincing those customers during current campaign period who had taken insurance in last campaign,however most of the customers who had not purchased insurance during last campaign continued with their opinion**

In [None]:
# 3.9 Is there a relationship between no. of previous attempts made,Days Passed since last attempt and car insurance

px.histogram(train[train.PrevAttempts>0],x='PrevAttempts',nbins=20,facet_col='CarInsurance')
px.scatter(train[train.PrevAttempts>0],x='PrevAttempts',y='DaysPassed',facet_col='CarInsurance',range_x=[1,20],range_y=[0,900])

**1. Calling previous attempted customers during current campaign has gave the fruit as these customers have purchased car insurance**


**2. Among those customers who purchased insurance maximum interest was shown by those for whom 1-5 attempts have been made and as the dayspassed between last attempt and current attempt increases, customer shown less interest**

### Pre-Processing of Dataset

In [None]:
# 4.1 drop Id column
train.drop(['Id'],axis=1,inplace= True)
train.drop(['CallStart'],axis=1,inplace= True)
train.drop(['CallEnd'],axis=1,inplace= True)
test.drop(['Id'],axis=1,inplace= True)
test.drop(['CallStart'],axis=1,inplace= True)
test.drop(['CallEnd'],axis=1,inplace= True)

In [None]:
# 4.2 pop target column from train and test
y_train = train.pop('CarInsurance')
y_test = test.pop('CarInsurance')

In [None]:
# 4.3 show column names from train and test
train.columns
test.columns

In [None]:
# 4.4 seperate out categorical and numerical column seperately
#cols_name = ['Age', 'Job', 'Marital', 'Education', 'Default', 'Balance','HHInsurance', 'CarLoan', 'Communication', 
#             'LastContactDay','LastContactMonth', 'NoOfContacts', 'DaysPassed', 'PrevAttempts','Outcome', 'CallDuration']
num_cols = ['Age','Balance','NoOfContacts','DaysPassed','PrevAttempts','CallDuration']
cat_cols = ['Job','Marital','Education','Default','HHInsurance','CarLoan','Communication','LastContactDay','LastContactMonth',
            'Outcome']
cat_cols_const = ['Job','Outcome']
cat_cols_mf =['Education','Communication']
cat_col_rem = ['Marital','Default','HHInsurance','CarLoan','LastContactDay','LastContactMonth']

In [None]:
# 4.5 create pipeline for transformation and categorial and numerical features 
pipeline1 = Pipeline(
                     [
                         ('si_mf',SimpleImputer(strategy='most_frequent')),
                         ('ohe1',OneHotEncoder(sparse=False))
                     ]
                     )

pipeline2 = Pipeline(
                     [
                         ('si_const',SimpleImputer(strategy='constant',fill_value='Unknown')),
                         ('ohe2',OneHotEncoder(sparse=False))
                     ]
                     )
                         
ct = ColumnTransformer(
                     [
                         ('pipe1',pipeline1,cat_cols_mf),
                         ('pipe2',pipeline2,cat_cols_const),
                         ('ohe',OneHotEncoder(),cat_col_rem),
                         ('ss',StandardScaler(),num_cols)
                     ]
                     )


In [None]:
# 4.6 transform train and test data
X_train = ct.fit_transform(train)
X_test = ct.fit_transform(test)

In [None]:
X_train

In [None]:
# 4.7 Use T-sne to check the structure in the train dataset
X_tsne = TSNE(n_components=2).fit_transform(X_train)

In [None]:
# 4.8 change T-sne output to dataframe 
df = pd.DataFrame(X_tsne,columns=['X','Y'])

In [None]:
# 4.9 Use scatter plot to show data structure using T-sne
px.scatter(df,x='X',y='Y',color=y_train)

**The above plot shows there is structure in dataset as most of the blue points are in one side while yellow points are in other side**

### Model designing for Prediction

We will use xgboost as model for prediction and optuna for hyperparameter optimisation

In [None]:
# 5.1 Define objective function for optuna
def objective(trial):
    train_x, test_x, train_y, test_y = train_test_split(X_train, y_train, test_size=0.25)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dtest = xgb.DMatrix(test_x, label=test_y)

    param = {
        "silent": 1,
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
        "lambda": trial.suggest_loguniform("lambda", 1e-8, 1.0),
        "alpha": trial.suggest_loguniform("alpha", 1e-8, 1.0),
    }

    if param["booster"] == "gbtree" or param["booster"] == "dart":
        param["max_depth"] = trial.suggest_int("max_depth", 1, 9)
        param["eta"] = trial.suggest_loguniform("eta", 1e-8, 1.0)
        param["gamma"] = trial.suggest_loguniform("gamma", 1e-8, 1.0)
        param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])
    if param["booster"] == "dart":
        param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
        param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
        param["rate_drop"] = trial.suggest_loguniform("rate_drop", 1e-8, 1.0)
        param["skip_drop"] = trial.suggest_loguniform("skip_drop", 1e-8, 1.0)

    # Add a callback for pruning.
    pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "validation-auc")
    bst = xgb.train(param, dtrain, evals=[(dtest, "validation")], callbacks=[pruning_callback])
    trial.set_user_attr(key="best_booster", value=bst)
    preds = bst.predict(dtest)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(test_y, pred_labels)
    return accuracy


In [None]:
# 5.2 define callback function to find best model
def callback(study, trial):
    #global best_booster
    if study.best_trial.number == trial.number:
        study.set_user_attr(key="best_booster", value=trial.user_attrs["best_booster"])

In [None]:
# 5.3 Instantiate a study object
study = optuna.create_study(direction='maximize')

In [None]:
# 5.4 Begin optimization process
study.optimize(
                objective,      
                n_trials=100,
                callbacks=[callback]
               )

In [None]:
# 5.5 Same as above but in a dataframe format
study.trials_dataframe().head(5)

In [None]:
# 5.6 So which is best parameter combination
trial = study.best_trial
# 5.7
print('Accuracy: {}'.format(trial.value))
trial.params

In [None]:
# 5.8 Find the best model 
best_model=study.user_attrs["best_booster"]

In [None]:
# 5.9 Predict the target for Input test given to us 
input_test = xgb.DMatrix(X_test)
final_output=best_model.predict(input_test)
pred_labels = np.rint(final_output)
pred_labels