## Health Insurance Lead Prediction

Problem Statement:
Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it's customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive and they are classified as a lead.

Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

Now the company needs your help in building a model to predict whether the person will be interested in their proposed Health plan/policy.

In [None]:
# Printing the filepath of data

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Scientific and Data Manipulation Libraries :

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Data Viz & Regular Expression Libraries :

import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')

# Scikit-Learn Pre-Processing Libraries :

from sklearn.preprocessing import *

# Garbage Collection Libraries :

import gc

# Boosting Algorithm Libraries :

from xgboost                          import XGBClassifier
# from catboost                         import CatBoostClassifier, Pool
# from lightgbm                         import LGBMClassifier
# from sklearn.ensemble                 import RandomForestClassifier, VotingClassifier

# Model Evaluation Metric & Cross Validation Libraries :
from sklearn.metrics                  import roc_auc_score
from sklearn.model_selection          import StratifiedKFold,KFold, RepeatedStratifiedKFold, train_test_split

# Setting SEED to Reproduce Same Results even with "GPU" :
seed_value = 1994

import os
os.environ['PYTHONHASHSEED'] = str(seed_value)
import random
random.seed(seed_value)
import numpy as np
np.random.seed(seed_value)
SEED=seed_value
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading files

train = pd.read_csv('../input/jobathon-analytics-vidhya/train.csv')
test = pd.read_csv('../input/jobathon-analytics-vidhya/test.csv')
sub = pd.read_csv('../input/jobathon-analytics-vidhya/sample_submission.csv')

## EDA

In [None]:
print(f'Train set has {train.shape[0]} rows and {train.shape[1]} columns.')
print(f'Test set has {test.shape[0]} rows and {test.shape[1]} columns.')

In [None]:
train.head()

In [None]:
print('Percentage of missing values in each column')
train.isnull().sum()/train.shape[0]*100

In [None]:
print('There are no duplicate values.') if train.duplicated().sum()==0 else print('Duplicates found!!')

In [None]:
print('Number of unique values in categorical column.')
cat_col = ['City_Code', 'Region_Code', 'Accomodation_Type',
       'Reco_Insurance_Type', 'Is_Spouse','Health Indicator',
        'Holding_Policy_Type','Reco_Policy_Cat']
train[cat_col].nunique()

* There are missing values in only 3 columns namely Health Indicator, Holding_Policy_Duration and Holding_Policy_Type. Since these are categorical columns and no domain knowledge can be gathered about their classes, I will impute missing values using mode. Alternatively, filling them with some constant say 0 also gives same result.
* There are no duplicate values in train data.
* Region_Code feature has very large number of distinct classes. I tried frequency encoding for it but no major improvement was achieved in score.
* Features Accomodation_Type, Reco_Insurance_Type and Is_Spouse have only two distinct classes so I will do binary encoding for them.
* For other features I tried to reduce number of distinct classes using their frequency and response rate but score reached only 0.62. I also tried using FeatureHasher function from sklearn.feature_extraction but score reached only 0.59.
* So,finally I simply one hot encoded these features using get_dummies function from pandas.

In [None]:
plt.figure(figsize=(16,5))
sns.barplot(data=train,x='City_Code',y='Region_Code');

In [None]:
sns.violinplot(data=train,x='Response',y='Region_Code',palette='summer');

There is no information about Region_Code and no pattern with respect to the target variable. So I cannot transform it much.

In [None]:
sns.set_style('darkgrid')
sns.countplot(data=train,x='Accomodation_Type',hue='Response',palette='summer')
plt.xlabel('Customer Owns or Rents the house',fontdict={'fontsize': 15,'color':'Brown'},labelpad=3);

Customers wh own a house are more likely to give positive response.

In [None]:
sns.set_style('darkgrid')
sns.countplot(data=train,x='Reco_Insurance_Type',hue='Response',palette='summer')
plt.xlabel('Type for the recommended insurance',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

Customers have mostly showed interest in Individual insurance. Though response rate seems similar for both.

In [None]:
sns.countplot(data=train[train['Reco_Insurance_Type']=='Joint'],x='Is_Spouse',hue='Response',palette='summer')
plt.xlabel('If the customers are married to each other')
plt.title('Distribution of those who were Recommended Joint type of Insurance',fontsize=15);

Customers showing interest in joint insurance are mostly couples.

In [None]:
print('Distribution of Age of primary customer.')
fig,axes=plt.subplots(2,1,figsize=(8,4))
sns.distplot(train[train['Response']==0]['Upper_Age'],bins=30,color='red',ax=axes[0])
axes[0].set_title('Response=0',fontsize=18)
sns.distplot(train[train['Response']==1]['Upper_Age'],bins=30,color='blue',ax=axes[1])
axes[1].set_title('Response=1',fontsize=18)
plt.tight_layout();

Age does not seem to affect target variable.

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(data=train,x='Health Indicator',hue='Response',palette='summer')
plt.xlabel('Encoded values for health of the customer',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

As Health Indicator goes from X1 to X9 customer is less likely to be a lead.

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(data=train,x='Holding_Policy_Duration',hue='Response',palette='summer',
              order=['1.0', '2.0', '3.0', '4.0', '5.0', '6.0', '7.0',
       '8.0', '9.0', '10.0', '11.0', '12.0', '13.0', '14.0','14+'])
plt.xlabel('Duration (in years) of holding policy',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

Majority of customers hold policies of less than 5 years. As number of years increases, probability of response 1 decreases. Class 14+ seems heavier may be because it include many classes. To consider it as a numerical column I will replace 14+ with 15.

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(data=train,x='Holding_Policy_Type',hue='Response',palette='summer')
plt.xlabel('Type of holding policy',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

Policy corresponding to 3.0 seems most popular.

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(data=train,x='Reco_Policy_Cat',hue='Response',palette='summer')
plt.xlabel('Encoded value for recommended health insurance',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

Insurance policies corresponding to 16-22 are most likely.

In [None]:
print('Annual Premium (INR) for the recommended health insurance')
fig,axes=plt.subplots(2,1,figsize=(10,5))
sns.distplot(train[train['Response']==0]['Reco_Policy_Premium'],color='red',ax=axes[0])
axes[0].set_title('Response=0',fontsize=18)
sns.distplot(train[train['Response']==1]['Reco_Policy_Premium'],color='blue',ax=axes[1])
axes[1].set_title('Response=1',fontsize=18)
plt.tight_layout();

Annual Premium distribution is similar fot both the classes.

## Preprocessing

In [None]:
# Joining training and test data for preprocessing
full_df=pd.concat([train,test])

# Creating a new feature by combining 'Upper_Age' and 'Lower_Age'
full_df['Age']=full_df.Upper_Age-full_df.Lower_Age

# Label encoding categorical features with two classes
full_df.Is_Spouse = full_df.Is_Spouse.map({'No':0,'Yes':1})
full_df.Reco_Insurance_Type = full_df.Reco_Insurance_Type.map({'Individual':0,'Joint':1})
full_df.Accomodation_Type = full_df.Accomodation_Type.map({'Owned':0,'Rented':1})

# Filling missing values with mode
for i in ['Health Indicator','Holding_Policy_Duration','Holding_Policy_Type']:
    full_df[i].fillna(0,inplace=True)
for i in ['Reco_Policy_Cat','Holding_Policy_Type']:
    full_df[i] = full_df[i].astype(object)

# Holding_policy_duration
full_df['Holding_Policy_Duration'].replace('14+',15.0,inplace=True)
full_df['Holding_Policy_Duration']=full_df['Holding_Policy_Duration'].astype(float)

    
# One hot encoding categorical features with multiple classes
dummies = pd.get_dummies(full_df[['City_Code','Health Indicator','Holding_Policy_Type','Reco_Policy_Cat']],drop_first=True)
final_data = pd.concat([full_df,dummies],axis=1)
final_data.drop(['ID','City_Code','Health Indicator','Lower_Age','Holding_Policy_Type','Reco_Policy_Cat'],axis=1,inplace=True)
final_data.head()

In [None]:
# Splitting combined data into preprocessed train and test data
train_data=final_data.dropna()
test_data = final_data.iloc[50882:]
test_data.drop('Response',axis=1,inplace=True)
train_data.shape,test_data.shape

In [None]:
# Scaling the data - this didn't make much difference in score so I dropped it

# X=train_data.drop('Response',axis=1)
# y=train_data.Response
# X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)

# to_be_scaled_feat = [ 'Age','Reco_Policy_Premium','Upper_Age', 'Region_Code']
# scaler=StandardScaler()
# scaler.fit(train_data[to_be_scaled_feat])
# X_train[to_be_scaled_feat] = scaler.transform(X_train[to_be_scaled_feat])
# X_test[to_be_scaled_feat] = scaler.transform(X_test[to_be_scaled_feat])

### XGBoost model cross-validated with 10 folds

In [None]:
predictor_train_scale = train_data.drop('Response',axis=1)
predictor_test_scale = test_data
target_train = train_data.Response

In [None]:
# making folds
kf=KFold(n_splits=5,shuffle=True)

preds_3   = list()
y_pred_3  = []
rocauc_score = []

# Applying model on each fold and calculating mean of score
for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train_scale)):    
    
    X_train, y_train = predictor_train_scale.iloc[train_idx,:], target_train.iloc[train_idx]    
    X_val, y_val = predictor_train_scale.iloc[val_idx, :], target_train.iloc[val_idx]
   
    print('\nFold: {}\n'.format(i+1))

    xg=XGBClassifier(eval_metric='auc',
                     random_state=294,
                     learning_rate=0.15, 
                     max_depth=4,
                     n_estimators=494, 
                     objective='binary:logistic'
                    )

    xg.fit(X_train, y_train
           ,eval_set=[(X_train, y_train),(X_val, y_val)]
           ,early_stopping_rounds=100
           ,verbose=100
           )

    roc_auc = roc_auc_score(y_val,xg.predict_proba(X_val)[:, 1])
    rocauc_score.append(roc_auc)
#     preds_3.append(xg.predict_proba(predictor_test_scale[predictor_test_scale.columns])[:, 1])
    
# y_pred_final_3         = np.mean(preds_3,axis=0)    
# sub['Response']=y_pred_final_3

print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/5)),'\n')
print("Score : ",rocauc_score)

# Download and Show Submission File :

# display("sample_submmission",sub)
# sub_file_name_3 = "S3. XGB_GPU_1994SEED_LGBM_NoScaler_MyStyle.csv"
# sub.to_csv(sub_file_name_3,index=False)
# Blend_model_3 = sub.copy()
# sub.head(5)

In [None]:
# An ensemble model of LightGBM, XGBoost and CatBoost also gave me 0.68 score in AV.

**PLEASE UPVOTE IF YOU LIKED MY ANALYSIS AND GIVE FEEDBACK.**