<h1 style="text-align:center">Who will buy your insurance?</h1>

<div style="text-align:center;"><img src="https://images.unsplash.com/photo-1570042707390-2e011141ab78?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1789&q=80" /></div>

**Context:** 
> Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

**About the Data:**
* id:	Unique ID for the customer
* Gender:	Gender of the customer
* Age:	Age of the customer
* Driving_License:
   * 0 : Customer does not have DL 
   * 1 : Customer already has DL
* Region_Code:	Unique code for the region of the customer
* Previously_Insured:
   * 1 : Customer already has Vehicle Insurance 
   * 0 : Customer doesn't have Vehicle Insurance
* Vehicle_Age:	Age of the Vehicle
* Vehicle_Damage:
   * 1 : Customer got his/her vehicle damaged in the past. 
   * 0 : Customer didn't get his/her vehicle damaged in the past.
* Annual_Premium:	The amount customer needs to pay as premium in the year
* PolicySalesChannel:	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
* Vintage:	Number of Days, Customer has been associated with the company
* Response:
   * 1 : Customer is interested 
   * 0 : Customer is not interested

## Imports

In [None]:
# Data Processing
import numpy as np 
import pandas as pd 

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import SGDClassifier

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve, auc, accuracy_score, roc_auc_score, f1_score, recall_score, precision_score


from sklearn.model_selection import RandomizedSearchCV

# Exploratory Data Analysis

In [None]:
df_train = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')
df_test = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/test.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

The `id` column is unecessary. Let's drop it for `df_train` and `df_test`.

In [None]:
df_train = df_train.drop(['id'], axis=1)
df_test = df_test.drop(['id'], axis=1)

In [None]:
df_train[['Age', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']].describe()

## Check for NaN values

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

We do not have any. That's great!

## Target Value - Response

In [None]:
df_train['Response'].value_counts()

In [None]:
b = sns.countplot(x='Response', data=df_train)
b.set_title("Response Distribution")

We can see that we have a lot more customers that are not interested. This should be taken into account when dividing the data into train and test sets for the modeling.

## Gender

In [None]:
b = sns.countplot(x='Gender', data=df_train)
b.set_title("Gender Distribution");

In [None]:
pd.crosstab(df_train['Response'], df_train['Gender']).plot(kind="bar", figsize=(10,6))

plt.title("Response distribution for Gender")
plt.xlabel("0 = Customer is Not interested, 1 = Customer is interested")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);

Both genders seem to be pretty similar in their response.

## Age

In [None]:
b = sns.distplot(df_train['Age'])
b.set_title("Age Distribution");

In [None]:
b = sns.boxplot(y = 'Age', data = df_train)
b.set_title("Age Distribution");

In [None]:
b = sns.boxplot(y='Age', x='Response', data=df_train);
b.set_title("Age Distribution for each Response");

### Driving_License

In [None]:
df_train['Driving_License'].value_counts()

There are only a small number of people with no driving license. This might mess up our models. Therefore, we should get drop `Driving_License` for now.

In [None]:
df_train = df_train.drop("Driving_License", axis=1)
df_test = df_test.drop("Driving_License", axis=1)

### Region_Code

In [None]:
df_train['Region_Code'].value_counts().head(30).plot(kind='barh', figsize=(20,10), title="Region_Code distribution in df_train");

### Previously_Insured

In [None]:
df_train['Previously_Insured'].value_counts()

In [None]:
pd.crosstab(df_train['Response'], df_train['Previously_Insured'])

In [None]:
pd.crosstab(df_train['Response'], df_train['Previously_Insured']).plot(kind="bar", figsize=(10,6))

plt.title("Response distribution for Previously_Insured")
plt.xlabel("0 = Customer is Not interested, 1 = Customer is interested")
plt.ylabel("Amount")
plt.legend(["Customer doesn't have Vehicle Insurance", "Customer already has Vehicle Insurance"])
plt.xticks(rotation=0);

We can see that customers that already have a vehicle insurance with only very little exception are not interested.

### Vehicle_Age

In [None]:
df_train['Vehicle_Age'].value_counts()

In [None]:
pd.crosstab(df_train['Response'], df_train['Vehicle_Age']).plot(kind="bar", figsize=(10,6))

plt.title("Response distribution for Vehicle_Age")
plt.xlabel("0 = Customer is Not interested, 1 = Customer is interested")
plt.ylabel("Amount")
plt.legend(["1-2 Year", "< 1 Year", "> 2 Years"])
plt.xticks(rotation=0);

### Vehicle_Damage

In [None]:
df_train['Vehicle_Damage'].value_counts()

In [None]:
pd.crosstab(df_train['Response'], df_train['Vehicle_Damage'])

In [None]:
pd.crosstab(df_train['Response'], df_train['Vehicle_Damage']).plot(kind="bar", figsize=(10,6))

plt.title("Response distribution for Vehicle_Damage")
plt.xlabel("0 = Customer is Not interested, 1 = Customer is interested")
plt.ylabel("Amount")
plt.legend(["Vehicle Damage", "No Vehicle Damage"])
plt.xticks(rotation=0);

### Annual_Premium

In [None]:
df_train['Annual_Premium'].describe()

In [None]:
b = sns.boxplot(y='Annual_Premium', x='Response', data=df_train);
b.set_title("Annual_Premium Distribution for each Response");

### Policy_Sales_Channel

In [None]:
df_train['Policy_Sales_Channel'].describe()

In [None]:
b = sns.boxplot(y='Policy_Sales_Channel', x='Response', data=df_train);
b.set_title("Policy_Sales_Channel Distribution for each Response");

### Vintage

In [None]:
df_train['Vintage'].describe()

In [None]:
b = sns.boxplot(y='Vintage', x='Response', data=df_train);
b.set_title("Vintage Distribution for each Response");

## Feature Engineering

Let's take a look at our data again.

In [None]:
df_train.head()

As a first step, we should get all our data in numeric form.

In [None]:
df_train['Gender'] = pd.Categorical(df_train['Gender'])
df_train['Previously_Insured'] = pd.Categorical(df_train['Previously_Insured'])
df_train['Vehicle_Age'] = pd.Categorical(df_train['Vehicle_Age'])
df_train['Vehicle_Damage'] = pd.Categorical(df_train['Vehicle_Damage'])
df_train['Response'] = pd.Categorical(df_train['Response'])
df_train['Region_Code'] = pd.Categorical(df_train['Region_Code'])

df_train = pd.concat([df_train[['Age', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response']],
           pd.get_dummies(df_train[['Gender', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage']])], axis=1)

In [None]:
df_train.head()

Great! Now our data is in numeric form!

Now let's take a look at the correlation:

In [None]:
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = df_train.corr()
sns.heatmap(cor, annot=True)
plt.show()

# Modeling

Let's check the scores with all features:

In [None]:
X = df_train.drop(["Response"], axis=1).to_numpy()
y = df_train['Response'].values

In [None]:
np.random.seed(42)

# Defining a dictionary of models
models = {"Logistic Regression": LogisticRegression(max_iter=10000), 
          "Random Forest": RandomForestClassifier(),
          "GradientBoostingClassifier" : GradientBoostingClassifier()}


# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)




for name, model in models.items():
    
    # Create list for ROC AUC scores
    roc_auc_score_list = []
    
    for train_index, test_index in skf.split(X,y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model.fit(X_train, y_train)

        roc_auc_score_list.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
        fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
        plt.plot(fpr, tpr)

        #print(f"ROC AUC Score for the fold no. {i} on the test set: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])}")


    print(f'Mean roc_auc_score {name} : {np.mean(roc_auc_score_list)}')

The `GradientBoostingClassifier()` gives us the best score with an average of `0.854707413585548`.

**If you liked this notebook or found it helpful in any way, feel free to leave an upvote - That will keep me motivated :)**

**If you have any suggestions for improvement, leave a comment :)**