# <center>Health Insurance Cross Sell Prediction</center>

In [None]:
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.utils import class_weight
from tensorflow.keras import models, layers, activations, callbacks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
sns.set()

%matplotlib inline

## Section 1: Business Understanding

An insurance company that has provided Health Insurance Products to its customer wants to expand its business to Vehicle Insurance. It needs to build a model to predict whether customers from the past year are interested in its new Vehicle Insurance Products to plan for marketing strategies. First, we will analyze how some variables are related to customers' interests. Then, we will build a machine learning model to classify whether a customer is interested. Customers forecasted to be interested in this product will become the target market and receive ads promoting this product.

### Question 1: How do 'Previously Insured' and 'Vehicle Damage' indicators correlate with customers' interest in this new vehicle insurance?

### Question 2: Are older customers more interested in vehicle insurance than newer customers?

### Question 3: How well we can predict customers' interests based on customer data?

- 'Previously Insured' -> whether the customer already has Vehicle Insurance from another company
- 'Vehicle Damage' -> wheter customer got his/her vehicle damaged in the past

## Section 2: Data Understanding

### Gather

In [None]:
df = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
df.head()

### Explore

In [None]:
# Set id as index
df.set_index('id', inplace=True)
df.head()

In [None]:
# Check for missing values
df.isna().sum()

No missing values in the dataset.

In [None]:
df.describe()

There are several categorical variables included such as `Region_Code` and `Policy_Sales_Channel` because they are encoded as integer (although they have float dtype, we must fix it later)

In [None]:
df.corr()

In [None]:
# Check gender distribution
df['Gender'].value_counts()

The number of males is slightly higher than females

In [None]:
# Check distribution of customer's age
# by using histogram
df['Age'].hist(bins=np.arange(20, 90, 5))  # adjust bin width to 5
plt.title("Customers' Age Distribution", size=16)
plt.xlabel('Age (years)')
plt.ylabel('Total');

Dominated by young (20-30 years) and middle-aged (40-50 years) customers (bimodal distribution)

In [None]:
# Check the proportion of customers interested based on driving license ownership
df.groupby('Driving_License')['Response'].mean()

Only 0.3% of customers don't have a driving license. However, customers with no driving license tend to be not interested in vehicle insurance so, we will keep this column for prediction.

In [None]:
df['Region_Code'].nunique()  # There are 53 unique values of region_code

In [None]:
# Group Response by Region_Code, aggregated by mean
df.groupby('Region_Code')['Response'].mean().sort_values(ascending=False)

There are noticeable differences in customers interested proportion based on `Region_Code`. Because this variable is categorical, we will use dummies to encode this variable in the next section.

In [None]:
df['Previously_Insured'].value_counts()

Number of customers not having vehicle insurance is higher, presenting an opportunity for the company.

In [None]:
# Most customers' vehicle age is less than 2 years
df['Vehicle_Age'].value_counts()

In [None]:
df['Vehicle_Damage'].value_counts()  

No significant differences between the number of customers that have experienced vehicle damage.

In [None]:
vehicle_age_damage = df[['Vehicle_Age',
                         'Vehicle_Damage']].value_counts().unstack()  # Count values
vehicle_age_damage

In [None]:
vehicle_age_damage = vehicle_age_damage.iloc[[1, 0, 2],
                                             [1, 0]]  # Reorder rows and columns
vehicle_age_damage

In [None]:
vehicle_age_damage.eval('yes_prop = Yes / (Yes+No)',
                        inplace=True)  # Calculate vehicle damage proportion
vehicle_age_damage

In [None]:
vehicle_age_damage['yes_prop'].plot(kind='bar')
plt.xticks(rotation=30)
plt.xlabel('Vehicle Age')
plt.title('Proportion of Customers That Have Experienced Vehicle Damage', size=16);

There are positive correlations between vehicle age and vehicle damage experience. The older the vehicle, the more likely it has got damaged.

In [None]:
# Calculate the proportion of customer interested grouped by Vehicle_Damage
df.groupby('Vehicle_Damage')['Response'].mean()

In [None]:
# Calculate the proportion of customer interested grouped by Previously_Insured
df.groupby('Previously_Insured')['Response'].mean()

In [None]:
# Create Pivot table with Previously_Insured and Vehicle_Damage as grouping variables
insured__vehicle_damage = df.pivot_table('Response',
                                         index='Previously_Insured',
                                         columns='Vehicle_Damage')
insured__vehicle_damage.rename(index={0: 'No', 1: 'Yes'}, inplace=True)
sns.heatmap(insured__vehicle_damage, annot=True, fmt='.4f')
plt.title('Proportion of Customers Interested', size=16);

From 3 cells above, customer that does not have vehicle insurance but have experienced vehicle damage are most likely to be interested in vehicle insurance product.

In [None]:
df['Policy_Sales_Channel'].nunique()  # Count unique values in the column

In [None]:
sales_channel_count = df['Policy_Sales_Channel'].value_counts()
sales_channel_count

In [None]:
(sales_channel_count <= 100).sum()

There are 155 unique policy sales channels listed, with 93 channels appear less than equal 100 times. In the next section we will group these channels to 1 group called `Others`

In [None]:
# Check the distribution of vintage (number of days customer has been associated with the company)
df['Vintage'].hist()
plt.title('Distribution of Vintage', size=16)
plt.xlabel('Vintage (days)');

Distribution of vintage is approximately uniform

In [None]:
def plot_vintage_mean_response(df=df, bins=10):
    '''
    Plot the proportion of customer interested for every bin defined by vintage

    INPUT:
    df - pandas dataframe
    bins - number of bins

    OUTPUT:
    histograms
    '''

    xmin, xmax = df['Vintage'].min(), df['Vintage'].max()

    total_counts, bins = np.histogram(df['Vintage'],
                                      bins=bins, range=(xmin, xmax))
    yes_counts, _ = np.histogram(df['Vintage']*df['Response'],
                                 bins=bins, range=(xmin, xmax))

    plt.hist(bins[:-1], bins[1:], weights=yes_counts/total_counts)
    plt.title('Proportion of Customers Interested', size=16)
    plt.xlabel('Vintage (days)')

    return None


plot_vintage_mean_response()

In [None]:
plot_vintage_mean_response(bins=30)

The proportion of customer interested in every bin is approximately equal. Furthermore, the absolute value of correlation between response and vintage is the lowest compared to other varibales. Hence, vintage is not a good predictor variable for response and we should consider to drop it.

In [None]:
# The distribution of Annual Premium is right-skewed
df['Annual_Premium'].hist(bins = 60)
plt.title('Distribution of Annual Premium', size=16)
plt.xlabel('Annual Premium');

In [None]:
df['Annual_Premium'].plot(kind = 'hist', range = (0, 150000), bins = 40)
plt.title('Distribution of Annual Premium', size=16)
plt.xlabel('Annual Premium');

It seems there are outliers, let's check it out

In [None]:
df.loc[df['Annual_Premium'] <= 10000, 'Annual_Premium'].value_counts()

In [None]:
# Check the data where annual premium = 2630
outliers_df = df[df['Annual_Premium'] == 2630]
outliers_df.head()

In [None]:
outliers_df.describe()

There is no identifiable pattern to determine whether 2630 is an encoding for missing value, outlier, or the true value. However, we will treat it as missing value because 2630 rupees ~ $36, quite unlikely to be the true annual premium. Missing values will be imputed using column mean.

## Section 3: Data Preparation

In [None]:
# Split the dataset into predictor and response variable
X = df.drop('Response', axis=1)  # Predictor
y = df['Response']  # Response

In [None]:
# Initialize OneHot Encoder to handle categorical variables
onehot_enc = OneHotEncoder(handle_unknown='ignore')

In [None]:
def preprocess_data(df=X, onehot_enc=onehot_enc, fit=True):
    '''
    Preprocess data according to analysis from previous section and encode the categorical variables into dummies

    INPUT:
    df - dataframe to be processed

    OUTPUT:
    Numpy array containing processed data
    '''

    df = df.drop('Vintage', axis=1)  # Drop Vintage column
    
    # Change Annual Premium with value 2630 to NaN
    df.loc[df['Annual_Premium'] == 2630, 'Annual_Premium'] = np.nan
    
    # Group Policy Sales Channels that appear less than equal 100 times
    psc_count = df['Policy_Sales_Channel'].value_counts()
    low_psc_index = psc_count[psc_count <= 100].index
    df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].apply(
        lambda x: 'Others' if x in low_psc_index else str(x))
    
    df['Region_Code'] = df['Region_Code'].astype('int')  # Change dtype to int
    
    # Isolate categorical features except Driving_License and Previously_Insured
    # (already encoded as dummies)
    df_need_dummies = df[['Gender', 'Region_Code', 'Vehicle_Age',
                          'Vehicle_Damage', 'Policy_Sales_Channel']].astype('category')
    if fit:
        onehot_enc.fit(df_need_dummies)
    
    array_dummies = onehot_enc.transform(df_need_dummies).toarray()
        
    df_no_dummies = df[['Age', 'Driving_License',
                        'Previously_Insured', 'Annual_Premium']]
    # fill missing values (NaN) with column mean
    array_no_dummies = df_no_dummies.apply(lambda col: col.fillna(col.mean())).values

    # Concat df_no_dummies and df_dummies along the columns
    processed = np.concatenate([array_no_dummies, array_dummies], axis=1)
    
    return processed


X_processed = preprocess_data(X)
X_processed

In [None]:
# Create processed data features
features = ['Age', 'Driving_License',
            'Previously_Insured', 'Annual_Premium',
            *onehot_enc.get_feature_names(['Gender', 'Region_Code', 'Vehicle_Age',
                                           'Vehicle_Damage', 'Policy_Sales_Channel'])]
features

In [None]:
# Split the data into train data and validation data
# Use stratify argument to split the data proportionately based on response variable
X_train, X_val, y_train, y_val = train_test_split(X_processed, y, test_size=0.3,
                                                  random_state=40, stratify=y)

## Section 4: Data Modelling

We will try 2 different machine learning models to classify whether customer is interested in the new vehicle insurance. The metrics to score the model are AUC-ROC (Area under ROC curve) and the precision and recall of the model. 

### Logistic Regression

In [None]:
# Initiate and fit the model to the train data
log_reg = Pipeline([('scaler', StandardScaler()),  # Normalize the feature 
                    ('logistic_reg', LogisticRegression(n_jobs=-1,
                                                        class_weight='balanced'))])
# Use class weight 'balanced' because there are signifcantly more customers that are not interested
log_reg.fit(X_train, y_train)

In [None]:
# Predict the probability customer in validation data is interested
y_pred_proba = log_reg.predict_proba(X_val)[:, 1]
y_pred_proba

In [None]:
# Calculate the predicted response and show the classification report
y_pred = log_reg.predict(X_val)
print(classification_report(y_val, y_pred))

In [None]:
# Create the confusion matrix
conf_matrix = confusion_matrix(y_val, y_pred)
(tn, fp, fn, tp) = conf_matrix.ravel()
log_reg_precision = tp / (tp+fp)
log_reg_recall = tp / (tp+fn)
log_reg_auc_roc = roc_auc_score(y_val, y_pred)
print('Precision: ', log_reg_precision)
print('Recall: ', log_reg_recall)
print('AUC-ROC score: ', log_reg_auc_roc)

In [None]:
#Plot the confusion matrix
labels = ('Not Interested', 'Interested')

plt.figure(figsize = (8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d',
            xticklabels=labels, yticklabels=labels, 
            annot_kws={'size': 14})
plt.yticks(rotation=0)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix', size=16);

With 0.8 AUC-ROC score, the logistic regression model is good enough. This model is also exceptional for predicting interested customers as shown by the 94% recall (target market will be wide enough to cover most of the interested customers). Although the precision is only 28% (28% of advertised customers is interested), it is acceptable because advertising to random customers will only get 12% precision (the proportion of interested customers in the dataset is approximately 12%).

In [None]:
# Create Regression Coefficient Data Frame
reg_coef = log_reg['logistic_reg'].coef_[0]
reg_coef_df = pd.DataFrame({'abs_coef': np.abs(reg_coef),
                            'sign_coef': np.sign(reg_coef)},
                           index=features)

# Sort Data Frame based on absolute value of coefficient
# We can compare the coefficient between variables because the data have been normalized
feature_importance = reg_coef_df.sort_values('abs_coef',
                                             ascending=False).head(20)
feature_importance

As analyzed in the previous section, `Previously_Insured` and `Vehicle_Damage` indicators are indeed the best predictor for customers' interests.

### Random Forest Classifier

In [None]:
# Initiate and fit the model to the train data
rf_classifier = Pipeline([('scaler', StandardScaler()),
                          ('random_forest',
                           RandomForestClassifier(n_estimators=300,
                                                  n_jobs=-1,
                                                  class_weight='balanced'))])
rf_classifier.fit(X_train, y_train)

In [None]:
# Predict the probability customer in validation data is interested
y_pred_proba = rf_classifier.predict_proba(X_val)[:, 1]
y_pred_proba

In [None]:
# Calculate the predicted response and show the classification report
y_pred = rf_classifier.predict(X_val)
print(classification_report(y_val, y_pred))

In [None]:
# Create the confusion matrix
conf_matrix = confusion_matrix(y_val, y_pred)
(tn, fp, fn, tp) = conf_matrix.ravel()
rf_precision = tp / (tp+fp)
rf_recall = tp / (tp+fn)
rf_auc_roc = roc_auc_score(y_val, y_pred)
print('Precision: ', rf_precision)
print('Recall: ', rf_recall)
print('AUC-ROC score: ', rf_auc_roc)

In [None]:
#Plot the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d',
            xticklabels=labels, yticklabels=labels,
            annot_kws={'size': 14})
plt.yticks(rotation=0)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix', size=16);

Even though the precision in this model is higher than the logistic regression model, the recall and AUC-ROC score is considerably lower, meaning that the target market is not large enough, as shown by the confusion matrix. Only 9500 customers will receive ads compared to around 38000 by using logistic regression. Therefore, this model is not good enough to predict whether a customer is interested

### Neural Network

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
num_features = X_processed.shape[1]

model_nn = models.Sequential([layers.InputLayer((num_features,)),
                              layers.Dense(128, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(64, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(32, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(16, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(4, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(1, activation='sigmoid')])

model_nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_nn.summary()

In [None]:
class_weights = class_weight.compute_class_weight('balanced',
                                                  classes=np.unique(y_train),
                                                  y=y_train)
class_weight_dict = dict(enumerate(class_weights))

model_nn.fit(X_train_scaled, y_train, batch_size=2048, epochs=30,
             validation_data = (X_val_scaled, y_val), class_weight=class_weight_dict)

In [None]:
y_pred_proba = model_nn.predict(X_val_scaled)
y_pred_proba

In [None]:
y_pred = (y_pred_proba >= .5).astype('int')
print(classification_report(y_val, y_pred))

In [None]:
# Create the confusion matrix
conf_matrix = confusion_matrix(y_val, y_pred)
(tn, fp, fn, tp) = conf_matrix.ravel()
nn_precision = tp / (tp+fp)
nn_recall = tp / (tp+fn)
nn_auc_roc = roc_auc_score(y_val, y_pred)
print('Precision: ', nn_precision)
print('Recall: ', nn_recall)
print('AUC-ROC score: ', nn_auc_roc)

In [None]:
#Plot the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d',
            xticklabels=labels, yticklabels=labels,
            annot_kws={'size': 14})
plt.yticks(rotation=0)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix', size=16);

Neural network model's performance is similar to logistic regression. However because logistic regression model is more interpretable, we will use it to evaluate the result and make predictions for test data.

## Section 5: Evaluate the Results

### Question 1: How do 'Previously Insured' and 'Vehicle Damage' indicators correlate with customers' interest in this new vehicle insurance?

In [None]:
# Plot heatmap of pivot table with Previously_Insured and Vehicle_Damage as grouping variables
x_labels = ['No', 'Yes']
y_labels = ['No', 'Yes']
z = insured__vehicle_damage.values.round(4)  # Round to 4 decimal places

fig = ff.create_annotated_heatmap(z=z,
                                  x=x_labels, y=y_labels, showscale=True,
                                  hovertemplate='Vehicle_Damage: %{x}' +
                                  '<br>Previously_Insured: %{y}<br>' +
                                  'Proportion: %{z:.4f}<extra></extra>')
fig.update_xaxes(title='Vehicle_Damage', side='bottom')
fig.update_yaxes(title='Previously_Insured')
fig.update_layout(title='Proportion of Customers Interested')
fig.show('notebook')

In [None]:
feature_importance.head(5)  # Previously_Insured and Vehicle_Damage are the top features for prediction

In [None]:
# Unpivot table to plot as bar chart
insured_damage_df = pd.melt(insured__vehicle_damage.reset_index(),
                            id_vars='Previously_Insured',
                            value_vars=['Yes', 'No'],
                            value_name='Prop_Response')
insured_damage_df

In [None]:
# Plot proportion of customers interested using bar chart
fig = px.bar(insured_damage_df, x='Vehicle_Damage', y='Prop_Response',
             color='Previously_Insured', barmode='group',
             title=("Proportion of Customers Interested Based On" 
                    "'Previously_Insured' and 'Vehicle_Damage' Indicators"))
fig.update_yaxes(title='')
fig.show('notebook')

Using the result from visualization and modeling, `Previously_Insured` and `Vehicle_Damage` indicators are correlated with customers' interests in the new vehicle insurance. `Previously_Insured` indicator has a negative correlation with customers' interests, since people that already have vehicle insurance most likely don't need another vehicle insurance. On the other hand, `Vehicle_Damage` indicator has a positive correlation. The reason is people who have experienced vehicle damage before have realized the importance of vehicle insurance to cover the loss. Therefore, they might be more interested, especially if they don't have any vehicle insurance before.

### Question 2: Are older customers more interested in vehicle insurance than newer customers?

In [None]:
# Plot the mean response for every bin determined by Vintage data
fig = px.histogram(df, x='Vintage', y='Response', histfunc='avg', nbins=50,
                   title='Proportion of Customers Interested Based On Vintage')
fig.update_traces(marker_line_width=.5, marker_line_color='white')
fig.update_yaxes(title='')
fig.update_xaxes(title='Vintage (days)')
fig.show('notebook')

As shown by the chart above, there is no distinguishable differences between older and newer customers in terms of interest in the new vehicle insurance. This means customers are not considering how long they have been associated with the company to determine whether they are interested in this product. Hence, the `Vintage` variable is not a good predictor for customers' interests. The company's data engineers might consider not to collect this information for this use case because it doesn't have any predictive value.

### Question 3: How well we can predict customers' interests based on customer data?

In [None]:
# Show metrics of the best model
print('Precision: ', log_reg_precision)
print('Recall: ', log_reg_recall)
print('AUC-ROC score: ', log_reg_auc_roc)

In [None]:
# Plot metrics of models tested in the previous section

metrics_df = pd.DataFrame({'Metrics': ['Precision', 'Recall', 'AUC-ROC Score'],
                           'Logistic Regression': [log_reg_precision, log_reg_recall,
                                                   log_reg_auc_roc],
                           'Random Forest': [rf_precision, rf_recall, rf_auc_roc],
                           'Neural Network': [nn_precision, nn_recall, nn_auc_roc]})
metrics_df = pd.melt(metrics_df, id_vars='Metrics', value_vars=['Logistic Regression',
                                                                'Random Forest',
                                                                'Neural Network'],
                     var_name='Models', value_name='Score')


fig = px.bar(metrics_df, x='Models', y='Score',
             color='Metrics', barmode='group',
             title='Model Evaluation Metrics ')
fig.update_yaxes(title='')
fig.show('notebook')

The Logistic Regression and Neural Network models are quite good at predicting customers' interests with approximately 0.8 AUC-ROC score while the Random Forest model performs poorly in this dataset. The results from the former models are similar with a slight difference in the precision-recall tradeoff. However, because the Logistic Regression model is more interpretable than the neural network, which acts as a black box, it is considered better than the Neural Network model. Consequently, we will use the Logistic Regression model to predict the test data in the next section.

## Section 6: Make Prediction on Test Data

In [None]:
test_df = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')
test_df.set_index('id', inplace=True)
test_df.head()

In [None]:
X_test = preprocess_data(test_df, fit=False)
X_test

In [None]:
# Fit the model once again to the full train data
log_reg.fit(X_processed, y)

In [None]:
y_test_pred_proba = log_reg.predict_proba(X_test)[:, 1]
y_test_pred_proba

In [None]:
# Count number of predicted interested customers
y_test_pred = (y_test_pred_proba >= .5).astype(int)
pd.Series(y_test_pred).value_counts()

In [None]:
submission = pd.DataFrame({'Response': y_test_pred_proba},
                          index=test_df.index)
submission.head()

In [None]:
submission.to_csv('vehicle_insurance.csv')