# Problem Statement

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

# Task

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# Data Description

* id: Unique ID for the customer
* Gender: Gender of the customer
* Age: Age of the customer
* Driving_License:
    * 0 : Customer does not have DL
    * 1 : Customer already has DL
* Region_Code: Unique code for the region of the customer
* Previously_Insured:
    * 1 : Customer already has Vehicle Insurance
    * 0 : Customer doesn't have Vehicle Insurance
* Vehicle_Age: Age of the Vehicle
* Vehicle_Damage:
    * 1 : Customer got his/her vehicle damaged in the past.
    * 0 : Customer didn't get his/her vehicle damaged in the past.
* Annual_Premium: The amount customer needs to pay as premium in the year
* PolicySalesChannel: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
* Vintage: Number of Days, Customer has been associated with the company
* Response:
    * 1 : Customer is interested
    * 0 : Customer is not interested
    
# Leaderboard
* Public LB: 85.80% (rank=135)
* Private LB: 86.28% (rank=105)

[Link to the Leaderboard.](https://datahack.analyticsvidhya.com/contest/janatahack-cross-sell-prediction/#LeaderBoard)

# **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.model_selection import ShuffleSplit, cross_val_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Loading the data

In [None]:
train_orig= pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')
test_orig= pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/test.csv')
subm= pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/sample_submission.csv')

In [None]:
train_orig.head()

In [None]:
test_orig.head()

# Exploratory Data Analysis

In [None]:
train_orig.isna().sum()

In [None]:
test_orig.isna().sum()

In [None]:
for col in test_orig.columns:
    print(f"{col}")
    print(f"Train:{train_orig[col].nunique()}\nTest:{test_orig[col].nunique()}")
    print("===============================")

In [None]:
train_orig.info()

In [None]:
sns.countplot(x='Response',data=train_orig);

**Imbalanced Binary Classification Problem**

In [None]:
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.hist(train_orig['Region_Code'],bins=30,label='Train');
plt.legend()
plt.subplot(1,2,2)
plt.hist(test_orig['Region_Code'],bins=30,label='Test');
plt.legend();

In [None]:
sns.distplot(train_orig['Age']);

In [None]:
sns.countplot(x='Gender',data=train_orig,hue='Response');

In [None]:
sns.countplot(x='Driving_License',data=train_orig,hue='Gender');

In [None]:
sns.countplot(x='Driving_License',data=train_orig,hue='Response');

In [None]:
sns.countplot(x='Previously_Insured',data=train_orig,hue='Response');

In [None]:
sns.distplot(train_orig['Policy_Sales_Channel']);

In [None]:
sns.distplot(train_orig['Vintage']);

## Concatenating train and test set for further inspection

In [None]:
data= pd.concat([train_orig,test_orig],axis=0,sort=False)

In [None]:
data.info()

In [None]:
data.nunique()

In [None]:
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
plt.hist(train_orig['Annual_Premium'],bins=30,label='Train');
plt.legend()
plt.subplot(1,3,2)
plt.hist(test_orig['Annual_Premium'],bins=30,label='Test');
plt.legend();
plt.subplot(1,3,3)
plt.hist(data['Annual_Premium'],bins=30,label='Data');
plt.legend();

## Treating Outliers

In [None]:
def outliers(df, variable, distance):

    # Let's calculate the boundaries outside which the outliers are for skewed distributions

    # distance passed as an argument, gives us the option to estimate 1.5 times or 3 times the IQR to calculate
    # the boundaries.

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

upper_limit, lower_limit = outliers(data, 'Annual_Premium', 1.5)
upper_limit, lower_limit

In [None]:
data['Annual_Premium']= np.where(data['Annual_Premium'] > upper_limit, upper_limit,
                       np.where(data['Annual_Premium'] < lower_limit, lower_limit, data['Annual_Premium']))

In [None]:
plt.hist(data['Annual_Premium'],bins=30);

## Feature Generation

In [None]:
data['Vehicle_Age'].value_counts()

In [None]:
data.groupby(['Gender','Driving_License','Response']).size()

In [None]:
data['Driving_License']= data['Driving_License'].astype(str)
data['DL_Gender']= data['Driving_License']+'_'+ data['Gender']
data['Driving_License']= data['Driving_License'].astype(int)

In [None]:
data.groupby(['Vehicle_Age','Vehicle_Damage']).size()

In [None]:
data['Vehicle_Age_and_Damage']= data['Vehicle_Age']+'_'+data['Vehicle_Damage']

In [None]:
data.groupby(['Vehicle_Age','Previously_Insured']).size()

In [None]:
data['Previously_Insured']= data['Previously_Insured'].astype(str)
data['Previously_Insured_Vehicle_Age']= data['Vehicle_Age']+'_'+data['Previously_Insured']
data['Previously_Insured']= data['Previously_Insured'].astype(int)

In [None]:
data.head()

In [None]:
data['Vintage']= data['Vintage'].apply(lambda x: x/365)

## Encoding the data

In [None]:
gender_map= {'Male':0,'Female':1}
vehicle_age_map= {'< 1 Year':0,'1-2 Year':1,'> 2 Years':2}
vehicle_damage_map= {'Yes':1,'No':0}

data['Gender']= data['Gender'].map(gender_map)
data['Vehicle_Age']= data['Vehicle_Age'].map(vehicle_age_map)
data['Vehicle_Damage']= data['Vehicle_Damage'].map(vehicle_damage_map)

In [None]:
sns.distplot(data['Region_Code']);

In [None]:
plt.figure(figsize=(14,6))
sns.countplot(x='Region_Code',data=data);
plt.xticks(rotation=90);

In [None]:
data.dtypes

In [None]:
cat_col= [col for col in data.columns if data[col].dtypes=='object']
cat_col

In [None]:
for col in cat_col:
    dummies= pd.get_dummies(data[col])
    data=pd.concat([data,dummies],axis=1)
    data.drop(columns=[col],inplace=True)

**Dropping unnecessary columns.**

In [None]:
data.drop(columns=['id','Response','Driving_License','0_Female'],inplace=True)

In [None]:
data.head().T

## Splitting back the data to train and test set.

In [None]:
train_new= data[:len(train_orig)]
test_new= data[len(train_orig):]

## Imbalanced binary classification- Using Oversampler

In [None]:
train_os=RandomOverSampler(random_state=101)
y=train_orig['Response']

X_os,y_os=train_os.fit_sample(train_new,y)

In [None]:
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_os))) 

In [None]:
#y= train_orig['Response']

X_train, X_val, y_train, y_val = train_test_split(X_os, y_os, test_size=0.3, random_state=101)

In [None]:
check= pd.concat([train_new,pd.DataFrame(data=train_orig['Response'],columns=['Response'])],axis=1,sort=False)

**checking the correlation of data**

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(check.corr(),cbar=False,cmap='inferno',annot=True);

In [None]:
check.corr()["Response"].sort_values()

# Model building- LGBMClassifier

In [None]:
model= LGBMClassifier(boosting_type='gbdt',objective='binary',random_state=101)

# Hyperparameter tuning

In [None]:
#model_tuning.best_estimator_

In [None]:
model=LGBMClassifier(colsample_bytree=0.5, learning_rate=0.03,
                     n_estimators=600, objective='binary', reg_alpha=0.1,
                     random_state=101,reg_lambda=0.8)

model.fit(X_train,y_train)

In [None]:
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores=cross_val_score(model, X_val, y_val, cv=cv,scoring='roc_auc')
scores.mean()

In [None]:
val_pred= model.predict_proba(X_val)[:,1]

In [None]:
val_pred

In [None]:
print(roc_auc_score(y_val,val_pred))

In [None]:
pred= model.predict_proba(test_new)[:,1]

# File submission

In [None]:
subm.head()

In [None]:
subm['Response']= pred

In [None]:
subm.to_csv('submission.csv',index=False)

# Thank You
# Please upvote, if you find it insightful.