# Predicting vehicle insurance purchase using Binary Classification and Upsampling

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

We have information about:

Demographics (gender, age, region code type),
Vehicles (Vehicle Age, Damage),
Policy (Premium, sourcing channel) etc.

In [None]:
#Import libraries and functions required for data modeling and manipulation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pycaret

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, KFold, GroupKFold, GridSearchCV, StratifiedKFold

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsRegressor, KernelDensity, KDTree
from sklearn.metrics import *

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN, RandomOverSampler, SMOTENC, SVMSMOTE
from imblearn.under_sampling import TomekLinks, NearMiss, RandomUnderSampler

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load and view the data
The data has been split into training and test sets already.  We'll read in and examine the shape and format of the data sets.

In [None]:
train = pd.read_csv('../input/imbalanced-data-practice/aug_train.csv')
test = pd.read_csv('../input/imbalanced-data-practice/aug_test.csv')

print("Training Data")
print(train.shape)
print(train.head())
print("Testing Data")
print(test.shape)
print(test.head())


Training/Testing data split is about 83/17.  Would typically split the data closer to 70/30, but the sample size is large enough so this is probably ok.

### Check for any missing data
Missing data is typically handled through imputation (mean, median, random) or records may be removed if they are deemed immaterial.

In [None]:
print('Train - Missing Data')
print(train.isna().any())
print('Test - Missing Data')
print(test.isna().any())

All variables show a value of "False", no imputation or record removal is required.

### Check data types

In [None]:
train.dtypes

In [None]:
#Identify categorical variables stored as type 'object' vs. variables that are numeric

categories = [c for c in train.columns if train[c].dtypes =='object']
print('Categories', categories)

numerics = [c for c in train.columns if c not in categories]
print('Numerics', numerics)

In [None]:
#Convert categorical variables to "dummy" coded variables

for c in categories:
    le=LabelEncoder()
    le.fit(list(train[c].astype('str')) + list(test[c].astype('str')))
    train[c] = le.transform(list(train[c].astype(str))) 
    test[c] = le.transform(list(test[c].astype(str))) 

train.head()

In [None]:
#Check data types again; data types should all be numeric now

train.dtypes

In [None]:
#Remove the ID field for modeling, store in a variable for later use if needed

train_id = train.pop('id')
test_id = test.pop('id')

In [None]:
#Create a copy of the data set to be used for model evaluation/selection

train2 = train.copy()
train2.head()

In [None]:
#Split the target field from the train data and store in a variable for analysis of predictors/later use

target = train.pop('Response')

#Confirm shape of data matches expectation

train.shape, test.shape

In [None]:
#Generate a correlation matrix to check for collinearity

plt.figure(figsize=(15,12))
sns.heatmap(pd.concat([train,target], axis=1).corr(), annot=True, cmap="coolwarm")

The correlation matrix shows no strong correlations between any of the predictors and the Response variable

The strongest correlation we see between predictors is the negatively correlated relationship between Vehicle Damage and Previously Insured.  This could reinforce a theory that someone who experienced an event where their vehicle was damaged prompted them to seek coverage.

In [None]:
#Check distribution of Response variable

ax = sns.countplot(x = target)
sns.set(font_scale=1.5)
fig = plt.gcf()
fig.set_size_inches(10,5)
ax.set_ylim(top=500000)
for p in ax.patches:
    ax.annotate('{:.2f}%'.format(100*p.get_height()/len(target)), (p.get_x()+ 0.3, p.get_height()+10000))

plt.title('Distribution of Target')
plt.ylabel('Frequency [%]')
plt.show()

We can see the Response data is heavily imbalanced which may adversely impact our model.  Upsampling or Downsampling should be explored.

### Use the pycaret package for model selection

In [None]:
#Set up the pycaret classification environment

from pycaret.classification import *
exp_clf101 = setup(data = train2, target = 'Response', session_id=123)

In [None]:
#Use the compare model function to evaluate the performance of several models fit on the training data

best_model = compare_models(exclude = ['xgboost', 'catboost', 'svm', 'qda', 'nb', 'ada', 'gbc', 'lightgbm'])

Random Forest Classifier is the model with the highest Accuracy and AUC 

In [None]:
#Use the create model function to view model plots

rf1 = create_model('rf')

In [None]:
#Plot the AUC curve

plot_model(rf1, plot='auc')

In [None]:
#Plot feature importance to show the level of relevance of each of the variables in the model

plot_model(rf1, plot='feature')

We can see that Vintage is the most important feature in the model, just ahead of Annual Premium. 

### Model the data using a Random Forest Classifier with cross validation
We'll return both the auc and recall scores for the model.

Recall is of particular interest because it represents the the ratio of results that were actually True (e.g., Purchases) compared to the number of results predicted to be True

In [None]:
score_auc = []
score_recall = []
train_rf = np.zeros(len(train))
test_rf = np.zeros(len(test))

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

for fold, (train_ind, val_ind) in enumerate(folds.split(train, target)):
    print('fold:', fold)
    trn_data, val_data = train.iloc[train_ind], train.iloc[val_ind]
    y_train, y_val = target.iloc[train_ind], target.iloc[val_ind]
    
    rf = RandomForestClassifier(n_estimators=150, max_depth=5, criterion='gini', max_features=0.8, n_jobs= -1, random_state=11)
    rf.fit(trn_data, y_train)
    train_rf[val_ind] = rf.predict_proba(val_data)[:, 1]
    y = rf.predict_proba(trn_data)[:, 1]
    
    print('val auc:' , roc_auc_score(y_val, train_rf[val_ind]))
    print('val recall:' , recall_score(y_val, np.where(train_rf[val_ind] > 0.5, 1, 0)))
   
    score_auc.append(roc_auc_score(y_val, train_rf[val_ind]))
    score_recall.append(recall_score(y_val, np.where(train_rf[val_ind] > 0.5, 1, 0)))
                        
    test_rf += rf.predict_proba(test)[:, 1]/folds.n_splits
    
print(' Model auc: --> ', np.mean(score_auc))
print(' Model recall: --> ', np.mean(score_recall))

AUC values look pretty good, but Recall values are all 0.  That's not a good sign for our model, which isn't able to predict Positive values (Purchases).  This may be due to the class imbalance issue mentioned earlier.

Let's take a look at the confusion matrix...

In [None]:
train_rf_01 = np.where(train_rf > 0.5, 1, 0)

cf_matrix = confusion_matrix(target, train_rf_01)
group_names = ['True Negative','False Positive','False Negative','True Positive']
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(12,8))
sns.set(font_scale=1.6)

sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='coolwarm')

We can now see no predictions were made for Positive (Response = 1) values.  

Next we'll try upsampling to attempt to fix our class imbalance issue and see if our model improves.

### Upsampling w/ SMOTE

In [None]:
#Used default imblearn parameters

smote= SMOTE(sampling_strategy='minority', k_neighbors=5)

tml = TomekLinks()

In [None]:
score_auc = []
score_recall = []
train_rf = np.zeros(len(train))
test_rf = np.zeros(len(test))

folds = KFold(n_splits=5, shuffle=True, random_state=123)

for fold, (train_ind, val_ind) in enumerate(folds.split(train, target)):
    print('fold:', fold)
    trn_data, val_data = train.iloc[train_ind], train.iloc[val_ind]
    y_train, y_val = target.iloc[train_ind], target.iloc[val_ind]
    
    #Upsample our training data and response variable
    train_upsample, y_upsample = smote.fit_resample(trn_data, y_train)
    
    rf = RandomForestClassifier(n_estimators=150, max_depth=5, criterion='gini', max_features=0.8, n_jobs= -1, random_state=11)
    rf.fit(train_upsample, y_upsample)
    train_rf[val_ind] = rf.predict_proba(val_data)[:, 1]
    y = rf.predict_proba(train_upsample)[:, 1]
    
    print('val auc:' , roc_auc_score(y_val, train_rf[val_ind]))
    print('val recall:' , recall_score(y_val, np.where(train_rf[val_ind] > 0.5, 1, 0)))
   
    score_auc.append(roc_auc_score(y_val, train_rf[val_ind]))
    score_recall.append(recall_score(y_val, np.where(train_rf[val_ind] > 0.5, 1, 0)))
                        
    test_rf += rf.predict_proba(test)[:, 1]/folds.n_splits
    
print(' Model auc: --> ', np.mean(score_auc))
print(' Model recall: --> ', np.mean(score_recall))

Now our AUC values have decreased slightly, but we have Recall values that look good. (Average is 0.90699)

Let's again look at the confusion matrix...

In [None]:
train_rf_01 = np.where(train_rf > 0.5, 1, 0)

cf_matrix = confusion_matrix(target, train_rf_01)
group_names = ['True Negative','False Positive','False Negative','True Positive']
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(12,8))
sns.set(font_scale=1.6)


sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='coolwarm')

### Conclusion

We should evaluate which metric is most important to us based on our use case and train/select our model to optimize that metric.  In this case, we are particularly interested in Recall which tells us our True Positive rate (Purchases).

The Random Forest performs pretty well in predicting Recall.  We can consider additional adjustments/tuning to try to improve the model such as modifying the train/test data split, alternative upsampling/downsampling approaches, other model types, etc.