## Problem Statement

Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it's customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive and they are classified as a lead.

Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

Now the company needs your help in building a model to predict whether the person will be interested in their proposed Health plan/policy given the information about:

- Demographics (city, age, region etc.)
- Information regarding holding policies of the customer
- Recommended Policy Information

###  Step 1: Importing the Relevant Libraries

In [1]:
# Basic eda library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Pre-Processing Library 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Imputer 
from sklearn.impute import KNNImputer

# Train Test Split
from sklearn.model_selection import train_test_split

# Model Tunning 
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# ML Model
from xgboost import XGBClassifier

# Metrix
from sklearn.metrics import accuracy_score

# To hide or filtering Warning for clear view
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

#### Customising Visulization

In [2]:
sns.set(rc = {'figure.figsize':(16,5)})
sns.set_style('whitegrid')

## Step 2: Reading Data

In [3]:
train_df = pd.read_csv('train_Df64byy.csv')
test_df = pd.read_csv('test_YCcRUnU.csv')

In [4]:
train_df.head()

Unnamed: 0,ID,City_Code,Region_Code,Accomodation_Type,Reco_Insurance_Type,Upper_Age,Lower_Age,Is_Spouse,Health Indicator,Holding_Policy_Duration,Holding_Policy_Type,Reco_Policy_Cat,Reco_Policy_Premium,Response
0,1,C3,3213,Rented,Individual,36,36,No,X1,14+,3.0,22,11628.0,0
1,2,C5,1117,Owned,Joint,75,22,No,X2,,,22,30510.0,0
2,3,C5,3732,Owned,Individual,32,32,No,,1.0,1.0,19,7450.0,1
3,4,C24,4378,Owned,Joint,52,48,No,X1,14+,3.0,19,17780.0,0
4,5,C8,2190,Rented,Individual,44,44,No,X2,3.0,1.0,16,10404.0,0


In [5]:
test_df.head()

Unnamed: 0,ID,City_Code,Region_Code,Accomodation_Type,Reco_Insurance_Type,Upper_Age,Lower_Age,Is_Spouse,Health Indicator,Holding_Policy_Duration,Holding_Policy_Type,Reco_Policy_Cat,Reco_Policy_Premium
0,50883,C1,156,Owned,Individual,30,30,No,,6.0,3.0,5,11934.0
1,50884,C4,7,Owned,Joint,69,68,Yes,X1,3.0,3.0,18,32204.8
2,50885,C1,564,Rented,Individual,28,28,No,X3,2.0,4.0,17,9240.0
3,50886,C3,1177,Rented,Individual,23,23,No,X3,3.0,3.0,18,9086.0
4,50887,C1,951,Owned,Individual,75,75,No,X3,,,5,22534.0


## Step 3: Data Inspection

In [6]:
train_df.shape, test_df.shape

((50882, 14), (21805, 13))

* __We have 50882 rows and 14 columns in Train set whereas Test set has 21805 rows and 13 columns.__

In [7]:
# Sum of null values in train data
train_df.isnull().sum()

ID                             0
City_Code                      0
Region_Code                    0
Accomodation_Type              0
Reco_Insurance_Type            0
Upper_Age                      0
Lower_Age                      0
Is_Spouse                      0
Health Indicator           11691
Holding_Policy_Duration    20251
Holding_Policy_Type        20251
Reco_Policy_Cat                0
Reco_Policy_Premium            0
Response                       0
dtype: int64

In [8]:
# Sum of null values in test data
test_df.isnull().sum()

ID                            0
City_Code                     0
Region_Code                   0
Accomodation_Type             0
Reco_Insurance_Type           0
Upper_Age                     0
Lower_Age                     0
Is_Spouse                     0
Health Indicator           5027
Holding_Policy_Duration    8603
Holding_Policy_Type        8603
Reco_Policy_Cat               0
Reco_Policy_Premium           0
dtype: int64

In [9]:
# Categorical Data
train_categorical = train_df.select_dtypes(include=[np.object])
obj_col = [col for col in train_df.keys() if train_df[col].dtype=='O']
print('Categorical Data of train data -->', train_categorical.shape[1],'\n',  obj_col)
print('*****************************************************************************')

# Numerical Data
train_numerical = train_df.select_dtypes(include=[np.int64])
print('Numerical Data of train data -->', train_numerical.shape[1])

Categorical Data of train data --> 6 
 ['City_Code', 'Accomodation_Type', 'Reco_Insurance_Type', 'Is_Spouse', 'Health Indicator', 'Holding_Policy_Duration']
*****************************************************************************
Numerical Data of train data --> 6


In [10]:
# Categorical Data
test_categorical = test_df.select_dtypes(include=[np.object])
obj_col_test = [col for col in test_df.keys() if test_df[col].dtype=='O']
print('Categorical Data of test data -->', test_categorical.shape[1],'\n',  obj_col_test)
print('*****************************************************************************')

# Numerical Data
test_numerical = test_df.select_dtypes(include=[np.int64])
print('Numerical Data of test data -->', test_numerical.shape[1])

Categorical Data of test data --> 6 
 ['City_Code', 'Accomodation_Type', 'Reco_Insurance_Type', 'Is_Spouse', 'Health Indicator', 'Holding_Policy_Duration']
*****************************************************************************
Numerical Data of test data --> 5


## Step 5: Data Cleaning

**Data Cleaning required in these column**
- __Health Indicator        --->  __11691__
- __Holding_Policy_Duration --->  __20251__
- __Holding_Policy_Type     --->  __20251__

* __5.1 Lets start with Health Indicator column.__

In [11]:
train_df['Health Indicator'].unique()

array(['X1', 'X2', nan, 'X4', 'X3', 'X6', 'X5', 'X8', 'X7', 'X9'],
      dtype=object)

In [12]:
def imp(col):
    train_df['Health Indicator']
  
    if col=='X1':
        return 1
    elif col=='X2':
        return 2
    elif col=='X3':
        return 3
    elif col=='X4':
        return 4
    elif col=='X5':
        return 5
    elif col=='X6':
        return 6
    elif col=='X7':
        return 7
    elif col=='X8':
        return 8
    elif col=='X9':
        return 9
    else:
        return np.nan

In [13]:
#train Data 
train_df['Health Indicator'] = train_df['Health Indicator'].apply(imp).astype(float)
train_df['Health Indicator'] = train_df['Health Indicator'].replace('nan',np.nan)
# Test Data
test_df['Health Indicator'] = test_df['Health Indicator'].apply(imp).astype(float)
test_df['Health Indicator'] = test_df['Health Indicator'].replace('nan',np.nan)

* __5.2 Holding_Policy_Duration column.__

In [14]:
train_df['Holding_Policy_Duration'].unique()

array(['14+', nan, '1.0', '3.0', '5.0', '9.0', '14.0', '7.0', '2.0',
       '11.0', '10.0', '8.0', '6.0', '4.0', '13.0', '12.0'], dtype=object)

In [15]:
test_df['Holding_Policy_Duration'].unique()

array(['6.0', '3.0', '2.0', nan, '14+', '5.0', '1.0', '4.0', '12.0',
       '11.0', '7.0', '9.0', '13.0', '8.0', '14.0', '10.0'], dtype=object)

In [16]:
train_df['Holding_Policy_Duration'] = train_df['Holding_Policy_Duration'].replace('14+','14')

test_df['Holding_Policy_Duration'] = test_df['Holding_Policy_Duration'].replace('14+','14')

* __5.3 Holding_Policy_Type column.__

In [17]:
train_df['Holding_Policy_Type'].unique()

array([ 3., nan,  1.,  4.,  2.])

In [18]:
test_df['Holding_Policy_Type'].unique()

array([ 3.,  4., nan,  1.,  2.])

## __5.5 Label Encoder.__

In [19]:
# Intiallising Encoder
encoder = LabelEncoder()

In [20]:
train_df['City_Code'] = encoder.fit_transform(train_df['City_Code'])
test_df['City_Code'] = encoder.fit_transform(test_df['City_Code'])

In [21]:
train_df['Accomodation_Type'] = pd.get_dummies(train_df['Accomodation_Type'],drop_first=False)
train_df['Reco_Insurance_Type'] = pd.get_dummies(train_df['Reco_Insurance_Type'],drop_first=False)
train_df['Is_Spouse'] = pd.get_dummies(train_df['Is_Spouse'],drop_first=False)

In [22]:
test_df['Accomodation_Type'] = pd.get_dummies(test_df['Accomodation_Type'],drop_first=False)
test_df['Reco_Insurance_Type'] = pd.get_dummies(test_df['Reco_Insurance_Type'],drop_first=False)
test_df['Is_Spouse'] = pd.get_dummies(test_df['Is_Spouse'],drop_first=False)

## __5.6 Missing Value Treatment.__
__KNN Imputer .__

In [23]:
knnimputer = KNNImputer(n_neighbors=5)

In [24]:
key_tr = train_df.keys()
key_ts = test_df.keys()

In [25]:
train_df = knnimputer.fit_transform(train_df)
test_df = knnimputer.fit_transform(test_df)

In [26]:
train_df=pd.DataFrame(train_df,columns=key_tr)
test_df=pd.DataFrame(test_df,columns=key_ts)

In [27]:
train_df['Health Indicator'].unique()

array([1. , 2. , 3.2, 2.8, 4. , 3. , 2.6, 6. , 5. , 1.4, 1.8, 3.4, 2.2,
       4.8, 1.6, 2.4, 3.6, 1.2, 3.8, 8. , 7. , 4.2, 4.6, 4.4, 9. , 5.2,
       5.8])

In [28]:
train_df['Health Indicator'] = np.ceil(train_df['Health Indicator'])

In [29]:
train_df.isnull().sum().sum()

0

In [30]:
test_df.isnull().sum().sum()

0

__Now, we are done with Missing Value.__

## Split DEPENDENT AND INDEPENDENT variable

In [31]:
X=train_df.drop(['Response', 'ID', 'City_Code', 'Region_Code'],axis=1)
y=train_df['Response']

In [32]:
X.shape

(50882, 10)

In [33]:
len(y)

50882

In [34]:
X.head(1)

Unnamed: 0,Accomodation_Type,Reco_Insurance_Type,Upper_Age,Lower_Age,Is_Spouse,Health Indicator,Holding_Policy_Duration,Holding_Policy_Type,Reco_Policy_Cat,Reco_Policy_Premium
0,0.0,1.0,36.0,36.0,1.0,1.0,14.0,3.0,22.0,11628.0


## Standard Scalling

In [35]:
scaler = StandardScaler()

In [36]:
X = scaler.fit_transform(X)
test_df = scaler.fit_transform(test_df)

In [37]:
X.shape

(50882, 10)

In [38]:
y.shape

(50882,)

In [39]:
X_key = key_tr.drop(['ID', 'City_Code', 'Region_Code', 'Response'])

In [40]:
X=pd.DataFrame(X,columns=X_key)
test_df=pd.DataFrame(test_df,columns=key_ts)

## Train Test Split

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state =101)

## SMOTE

In [42]:
pip install imblearn




In [43]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [44]:
X_train_smote, y_train_smote = smote.fit_sample(X_train,y_train)

#### Hyper Parameter Optimization

In [45]:
params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
}

In [46]:
xgboost_model = XGBClassifier(max_depth=10)

In [47]:
random_search=RandomizedSearchCV(xgboost_model,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5)


In [48]:
random_search.fit(X,y)



RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=10,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=100, n_...
                                           random_state=None, reg_alpha=None,
                                           reg_lambda=None,
                                          

In [49]:
random_search.best_params_

{'min_child_weight': 1,
 'max_depth': 8,
 'learning_rate': 0.05,
 'gamma': 0.4,
 'colsample_bytree': 0.4}

In [50]:
xgbclassifier = XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=10,
                                           min_child_weight=None,
                                           monotone_constraints=None,
                                           n_estimators=100,
                                           random_state=None, reg_alpha=None,
                                           reg_lambda=None,
                                           scale_pos_weight=None,
                                           subsample=None, tree_method=None,
                                           validate_parameters=None,
                                           verbosity=None)

In [51]:
xgbclassifier.fit(X, y)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [54]:
xgbclassifier.fit(X_train_smote, y_train_smote)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [52]:
pred = xgbclassifier.predict(X_test)

In [55]:
pred_smote = xgbclassifier.predict(X_test)

In [56]:
pred_smote

array([0., 0., 0., ..., 0., 0., 0.])

In [53]:
pred

array([0., 0., 0., ..., 0., 0., 0.])

In [58]:
from sklearn.metrics import roc_auc_score

In [59]:
roc_score = roc_auc_score(pred_smote, y_test)
roc_score

0.5541105048002314

In [60]:
score = accuracy_score(pred_smote, y_test)
score

0.7348837209302326

In [61]:
final_predictionht = xgbclassifier.predict(test_df.drop(['ID', 'City_Code', 'Region_Code'],axis=1))

In [62]:
final_predictionht.shape

(21805,)

In [63]:
a = pd.DataFrame(final_predictionht)

In [64]:
a[0].unique()

array([1., 0.])

In [65]:
a[0].value_counts()

1.0    21017
0.0      788
Name: 0, dtype: int64

In [66]:
submission = pd.read_csv('sample_submission_QrCyCoT.csv')
final_predictions = xgbclassifier.predict(test_df.drop(['ID', 'City_Code', 'Region_Code'],axis=1))
submission['Response'] = final_predictionht
#only positive predictions for the target variable

submission.to_csv('XGBOOST3smote_ht.csv', index=False)