# Introduction
This notbook uses the Car Insurance Cold Calls data from Kaggle. I put this work together in order to practice my machine learning classification skills as well as to focus on incorporating pipelines into my model building.    

First, we will import the relevant libraries and load in the training and testing data. In reality, the testing data has no labels, so this notebook will focus on building a model and testing it on a validation set. In the process, I will also clean up the testing data to make it ready for implementing the model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, classification_report, recall_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
sns.set()

In [None]:
df_test = pd.read_csv("../input/carinsurance/carInsurance_test.csv")
df_train = pd.read_csv("../input/carinsurance/carInsurance_train.csv")

In [None]:
print("The training data has {0} samples and {1} features.".format(df_train.shape[0], df_train.shape[1]-1))
print("The testing data has {0} samples and {1} features.".format(df_test.shape[0], df_test.shape[1]-1))

We can see below that there are a number of features with missing information, namely `Job`, `Education`, `Communication`, and `Outcome`. We can also see that the same features have missing information in the testing set. As I mentioned, all of the labels for the testing set are missing, but we won't focus on those here, as I won't be testing the model on that set.

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

Now I will write a brief function to visualize the distributions of our numeric variables.

In [None]:
def hist_matrix(data):
    numeric_cols = [col for col in data if data[col].dtype!= "O"]
    fig, ax = plt.subplots(nrows = 4, ncols = 3,figsize = (16,10))
    fig.subplots_adjust(hspace = 0.5)
    x=0
    y=0
    for i in numeric_cols:
        ax[y,x].hist(data[i])
        ax[y,x].set_title("{}".format(i))
        x+=1
        if x == 3:
            x-=3
            y+=1
    return

In [None]:
hist_matrix(df_train)

# Preprocessing and Feature Engineering
Right away we can see that a number of our features have significant outliers. These include `Balance`, `NoOfContacts`, `PrevAttempts`, and `DaysPassed`. To address these and to deal with the missing values in the categorical features, I will write another function that will do the bulk of our preprocessing. This will make it very simple to then apply the same transformations to the testing data as well. Specifically, the function will: (1) drop the ID column, (2) fill the categorical data with the modal value (when there are not a significant number of missings), (3) replace outliers with a top-coded value (here, the value of the feature at the 99th percentile), and (4) create new features for the duration of a salesperson's phone call and the hour of the day the call was initiated.

In [None]:
def preprocessing(data):
    data = data.drop('Id', axis = 1)
    data['Education'] = data['Education'].fillna(data['Education'].mode()[0])
    data['Job'] = data['Job'].fillna(data['Job'].mode()[0])
    for i in ['CallStart', 'CallEnd']:
        data[i] = pd.to_datetime(data[i])
    data['CallDur'] = ((data['CallEnd']-data['CallStart']).dt.seconds)/60
    data['CallHour'] = data['CallStart'].dt.hour
    data = data.drop(['CallStart', 'CallEnd'], axis = 1)
    for i in ['Balance', 'NoOfContacts', 'PrevAttempts', 'DaysPassed']:
        val = data[i].quantile(.99).astype(int)
        data.loc[data[i]>val, i] = val
    data.loc[data['DaysPassed']<0, 'DaysPassed'] = 0
    data['Communication'] = data['Communication'].fillna('Missing')
    data['Outcome'] = data['Outcome'].fillna('Mising')
    return data

Now, let's apply the function to our training data. and then check the distributions of the variables and the number of missing values.

In [None]:
df_train = preprocessing(df_train)

In [None]:
df_train.isnull().sum()

In [None]:
hist_matrix(df_train)

We can now see that there are no missing values in the data and the distributions of our numeric variables are less extreme.

# Model Building
Now I will split the training data into the targets and features. Then, I will create a pipeline that will (1) scale our numeric features, (2) one-hot-encode our categorical features, and (3) train a model on the data.

In [None]:
y = df_train['CarInsurance'].copy()
x_cols = [col for col in df_train.columns if col != "CarInsurance"]
x = df_train[x_cols].copy()

## Creating the Pipeline
First, I will create one pipeline that will do the preprocessing on our features. This will basically scale the numeric variables using a MinMaxScaler and one-hot encode the categorical variables. Then, I will create the model pipeline that will preprocess the data, split the data, and estimate different types of models using a `for` loop. In this step I will also calculate model metrics for comparison purposes.

In [None]:
numeric_cols = [col for col in x if x[col].dtype != "O"]
numeric_transformer = Pipeline(steps = [(
                        'scaler', MinMaxScaler())])

categorical_cols = [col for col in x if col not in numeric_cols]
categorical_transformer = Pipeline(steps=[(
                            'ohe', OneHotEncoder(drop = 'first'))])

preprocessor = ColumnTransformer(transformers = [
                ('num', numeric_transformer, numeric_cols),
                ('cat', categorical_transformer, categorical_cols)])

In [None]:
models = ['Random Forest: ', 'Logistic Regression: ', 'XGBoost: ']
suffixes = ['_rf', '_logit', '_xgb']
names = ['fpr', 'tpr', 'thresholds', 'auc']
j = 0
for i in [RandomForestClassifier(n_estimators = 100, random_state=0), LogisticRegression(solver = 'lbfgs'), XGBClassifier(random_state=0)]:
    clf = Pipeline(steps = [
            ('preprocessor', preprocessor),
            ('classifier', i)])
    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 0)
    clf.fit(x_train, y_train)
    preds = clf.predict(x_test)
    print(models[j])
    print("Model Accuracy: {}".format(round(accuracy_score(y_test, preds),4)*100))
    print(classification_report(y_test, preds))
    print("*****************************************************")   
    exec("{0}, {1}, {2} = roc_curve(y_test, clf.predict_proba(x_test)[:,1])".format(names[0]+suffixes[j], names[1]+suffixes[j], names[2]+suffixes[j]))
    exec("{} = roc_auc_score(y_test, clf.predict_proba(x_test)[:,1])".format(names[3]+suffixes[j]))
    j+=1

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 3, figsize = (18,6))
ax[0].plot(fpr_rf,tpr_rf, label = "Random Forest")
ax[0].plot([0,1], [0,1], label = 'Base Rate')
ax[0].set_ylabel("True Positive Rate")
ax[0].set_title("Random Forest")
ax[0].text(0.3, 0.7, "AUC = {}".format(round(auc_rf, 2)))
ax[1].plot(fpr_logit,tpr_logit, label = "Random Forest")
ax[1].plot([0,1], [0,1], label = 'Base Rate')
ax[1].set_xlabel("False Positive Rate")
ax[1].set_title("Logistic Regression")
ax[1].text(0.3, 0.7, "AUC = {}".format(round(auc_logit, 2)))
ax[2].plot(fpr_xgb,tpr_xgb, label = "Random Forest")
ax[2].plot([0,1], [0,1], label = 'Base Rate')
ax[2].set_title("XGBoost")
ax[2].text(0.3, 0.7, "AUC = {}".format(round(auc_xgb, 2)))
fig.suptitle("ROC Graph")

It seems here that the Random Forest is the best performing model across all metrics. Now, I will tune some hyperparameters of the model using `GridSearchCV` to see if I can improve it.

In [None]:
clf = Pipeline(steps = [
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=0))])

param_grid = {'classifier__n_estimators' : [10, 50, 75, 100, 150, 200, 250, 300],
             'classifier__criterion' : ['gini', 'entropy']}

search = GridSearchCV(clf, param_grid, n_jobs = -1, cv = 7)
search.fit(x_train, y_train)
print(search.best_params_)

In [None]:
clf = Pipeline(steps = [
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators = 200, criterion = 'gini', random_state=0))])
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
print("Tuned Random Forest: ")
print("Model Accuracy: {}".format(round(accuracy_score(y_test, preds),4)*100))
print(classification_report(y_test, preds))

By tuning the number of estimators we improved the accuracy and F1 scores of the model by a little bit. There is more tuning that can be done, but I will save that for another time. Now, lets plot the confusion matrix of our model.

In [None]:
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt = ".0f")
plt.yticks([1.5,0.5], ['Did not Buy', 'Did Buy'])
plt.xticks([1.5,0.5], ['Did Buy', 'Did not Buy'])
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.title("Confusion Matrix")

Now let's take a look at the feature importances of our model to get an idea of what is predicting sales.

In [None]:
names = {"feature" : numeric_cols + list(clf['preprocessor'].transformers_[1][1]['ohe'].get_feature_names(categorical_cols))}
imp = {'importances' : list(clf.steps[1][1].feature_importances_)}
feature_importances = {**names, **imp}
feature_importances_df = pd.DataFrame(feature_importances) 

In [None]:
feature_importances_df = feature_importances_df.sort_values(by = 'importances', ascending = True)
feature_importances_df

In [None]:
plt.figure(figsize = (16,10))
plt.title("Feature Importances from Random Forest Classifier Model")
plt.barh(feature_importances_df['feature'], feature_importances_df['importances'])

We can see that a few features are playing a prominent role in our model. `Outcome_success` indicates if a previous marketing campaign was successful or not. The documentation doesnt really explain what this means. Next, an individual's having household insurance is predictive of purchasing car insurance, which is intuitive. We would expect that risk-averse individuals would be more likely to buy both, and riskier individuals to have neither. Interestingly, our constructed feature `CallDur`, which was the length of the sales call in minutes, strongly influenced the predictive capacity of the model. There may be two reasons for this. First, longer call durations could be correlated with a salesman's ability, thereby leading to higher sales. Second, individuals who were already likely to buy insurance for other reasons may need to have longer calls in order to fill out documentation, etc. So, this is not necessarilly a causal relationship, but the two are correlated nonetheless. Communication via phone or cellphone were also importance determinants, but one has to keep in mind that the reference category was 'missing', so it is really not clear what the interpretation of these importances are without figuring out what the missing communication types were.

# Going Forward
In the future, I would like to experiment more with the pipeline here. It could be interesting to do a bit more feature engineering, especially with regard to the functional forms of the continuous features in the model.  One could easily imagine, for instance, that the relationship between purchasing car insurance and age is nonlinear. After all, young people may not buy their own insurance and older people may no longer drive, so we would expect that middle-aged individuals would have a higher propensity to purchase car insurance.
