# IN THIS NOTEBOOK :
* Exploratory Data Analysis 
* Feature Visualizaion and Engineering
* Feature Selection
* Modeling 
* Model Evaluation
* Hyper Parameter Tuning

****Task Details****



Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

***Evaluation Metric***
The evaluation metric for this hackathon is ROC_AUC score.

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt 
sns.set_style(style="darkgrid")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Data loading

In [None]:
df_train=pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
df_train.head()

Data dimensions,information,null values check.

In [None]:
#Data info
data_dims=df_train.shape
data_info=df_train.info()
print(data_dims,data_info)

In [None]:
#Null Check

null_data=df_train.isnull().sum()
null_data

> Exploratory Data Analysis

In [None]:
#Separate Numerical and Categorical features

num_feats=df_train.select_dtypes(['int64','float64']).columns
cat_feats=df_train.select_dtypes(['object']).columns

feats=[num_feats,cat_feats]
feats


Target Features Balancing and Imbalancing.

In [None]:
#Check Imbalancing of data 
sns.countplot(df_train['Response'],palette="twilight")

**Feature Engineering**

*Handling Categorical Features in the data using Dummies or ONEHOTENCODER
and dropping 1st encoded column*

In [None]:
#Feature Exploration - > Categorical Types 

#1.Gender
sns.countplot(df_train['Gender'],palette='Set3')


#One Hot Encoding on Gender
df_train['Gender']=pd.get_dummies(df_train['Gender'],drop_first=True)


In [None]:
#2.Vehicle_Damage


sns.countplot(df_train['Vehicle_Damage'],palette='brg')


#One Hot Encoding on Vehicle_Damage
df_train['Vehicle_Damage']=pd.get_dummies(df_train['Vehicle_Damage'],drop_first=True)


*Vehicle Age feature is divided into 3 categories :


*  0-1 ->0
*  1-2 ->1  
*  above 2 ->2

(LABEL ENCODING)*

In [None]:
#3.Vehicle_Age


va_counts=df_train['Vehicle_Age'].value_counts()

#3.1 String Handling
#3.2 Convert this feature into 3 distinct categories  0-1,1-2,>2 -> 0,1,2 (Label Encoding)




def string_extract(x):
    X=x.split()
    
    if(len(X)==2):
        return 1
    if(X[0]=='<'):
        return 0
    if(X[0]=='>'):
        return 2
    

df_train['Vehicle_Age_n']=df_train['Vehicle_Age'].apply(lambda x:string_extract(x))


**Visualizations using
* Countplots
* Dist Plot 
* Scatter Plot
**

In [None]:
sns.countplot(df_train['Vehicle_Age'],palette='brg')

In [None]:
sns.countplot(df_train['Vehicle_Damage'],palette='twilight')

In [None]:
sns.countplot(df_train['Previously_Insured'])

In [None]:
sns.countplot(df_train['Previously_Insured'],hue=df_train['Vehicle_Damage'],palette='twilight')

In [None]:
sns.countplot(df_train['Previously_Insured'],hue=df_train['Vehicle_Age'],palette='twilight')

In [None]:
sns.countplot(df_train['Vehicle_Damage'],hue=df_train['Vehicle_Age'],palette='Set3')

In [None]:
#Feature Exploration - > Numerical Types 

#1.Age 
sns.distplot(df_train['Age'],color='r')
age_desc=df_train['Age'].describe()
age_desc

In [None]:
#2.Annual_Premium
sns.distplot(df_train['Annual_Premium'])
prem_desc=df_train['Annual_Premium'].describe()
prem_desc

In [None]:
plt.scatter(df_train['Vintage'],df_train['Annual_Premium'])

In [None]:
df_train.drop(['id','Vehicle_Age'],inplace=True,axis=1)

**Feature Selection**

*1. Correlation Matrix* 

In [None]:
imp_features=pd.DataFrame(df_train.corr()['Response'].sort_values(ascending=False))
imp_features.columns=['IMP']
indx=imp_features.index



plt.figure(figsize=(25,10))
b=sns.barplot(x=indx,y=imp_features['IMP'])
b.set_xlabel("Features",fontsize=20)
b.set_ylabel("Co-Relation" ,fontsize=20)

*2.Feature Selection - SelectKBest ,chi2*

In [None]:
from sklearn.feature_selection import SelectKBest,f_classif

X=df_train.drop('Response',axis=1)
Y=df_train['Response']



selector_model=SelectKBest(score_func=f_classif,k='all')
selector=selector_model.fit(X,Y)

cols=X.columns
df_features = pd.DataFrame(cols)
df_scores = pd.DataFrame(selector.scores_)

df_new = pd.concat([df_features, df_scores], axis=1)
df_new.columns = ['Features', 'Score']

df_new = df_new.sort_values(by='Score', ascending=False)
df_new
imp_feature=df_new['Features']


indx=df_new['Features']
plt.figure(figsize=(25,10))
b=sns.barplot(x=indx,y=df_new['Score'])
b.set_xlabel("Features",fontsize=20)
b.set_ylabel("Co-Relation" ,fontsize=20)


imp_feature


In [None]:
imp_f=imp_feature[:6]
df_train[imp_f].head()

**Modeling and Evaluation**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import xgboost

*TEST-TRAIN SPLITS*

In [None]:
X=df_train[imp_f]
Y=df_train['Response']

x_train,x_test,y_train,y_test = train_test_split(X,Y, random_state = 0)

**1.Logistic Regression**

In [None]:
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred = logreg.predict_proba(x_test)[:,1]
roc_auc_score(y_test,y_pred)

2.XGBoost without Hyperparamter Tuning

In [None]:
xgb1=xgboost.XGBClassifier()
xgb1.fit(x_train,y_train)
y_pred = xgb1.predict_proba(x_test)[:,1]
roc_auc_score(y_test,y_pred)

> *Hyperparameter Boosting -> XGBOOST*

''''
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

paramss= {
        "learning_rate": [0.05, 0.10, 0.15, 0.20, 0.25, 0.30],
        "max_depth": [3, 4, 5, 6, 8, 10, 12, 15],
        "min_child_weight": [1, 3, 5, 7],
        "gamma": [0.0, 0.1, 0.2, 0.3, 0.4],
        "colsample_bytree": [0.3, 0.4, 0.5, 0.7]

    }


gbc=xgboost.XGBClassifier()
model2=RandomizedSearchCV(estimator=gbc,param_distributions=paramss,
                cv=5,scoring="roc_auc",
                verbose=10,n_jobs=-1)

model2.fit(X,Y)

print(model2.best_params_)
print(model2.best_index_)

''''




In [None]:
xgb1=xgboost.XGBClassifier(min_child_weight= 5, max_depth= 4, learning_rate = 0.25, 
                           gamma= 0.2, colsample_bytree= 0.7)
xgb1.fit(x_train,y_train)
y_pred = xgb1.predict_proba(x_test)[:,1]
roc_auc_score(y_test,y_pred)

**End of Notebook**
* Please Upvote if you like it!!
* Comment down about the NoteBook.
Thankyou!!