Kindly upvote if you like this notebook.<br>
Any issues or mistake kindly let me know in comments, happy to correct.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score,accuracy_score

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [None]:
insurance_df = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')

In [None]:
insurance_df.columns

**Data Analysis**

In [None]:
insurance_df.dtypes

In [None]:
insurance_df.isnull().sum()

In [None]:
insurance_df.shape

In [None]:
categorical_columns=[]
continuous_columns=[]
for col in insurance_df.columns:
    if insurance_df[col].dtype!='object':
        continuous_columns.append(col)
    else:
        categorical_columns.append(col)

In [None]:
continuous_columns

In [None]:
plt.figure(figsize=(16,16))
for i, col in enumerate(['id','Age','Region_Code','Annual_Premium','Policy_Sales_Channel','Vintage']):
    plt.subplot(4,4,i+1)
    sns.boxplot(insurance_df[col])
    plt.tight_layout()

In [None]:
insurance_df.loc[insurance_df.Annual_Premium> 400000,'Annual_Premium']=400000

    I don't see much outliers except in Annual_Premium, We will replace premium values greater than 400000 with 400000

In [None]:
insurance_df['Gender'].value_counts()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
sns.countplot(data=insurance_df,x='Gender',hue='Vehicle_Damage',ax=ax[0])
sns.countplot(data=insurance_df,x='Gender',hue='Previously_Insured',ax=ax[1])
fig.show()

We clearly see that male have more vehicle damage than female, even then male don't have insurance.

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
sns.countplot(data=insurance_df,x='Gender',hue='Vehicle_Age',ax=ax[0])
sns.countplot(data=insurance_df,x='Previously_Insured',hue='Vehicle_Damage',ax=ax[1])
fig.show()

I see most of the vehicles are new(less than two years). People have't got insuranced for new vehicles.<BR>
It is surprsing that many vehicles within 2 years have got so much damage.

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
# fig, ax = plt.subplots() 
sns.countplot(data=insurance_df,x='Gender',hue='Previously_Insured',ax=ax[0])
sns.countplot(data=insurance_df,x='Gender',hue='Vehicle_Damage',ax=ax[1])
fig.show()

In [None]:
plt.figure(figsize=(20,9))
sns.FacetGrid(insurance_df, hue = 'Response',
             height = 6,xlim = (0,150)).map(sns.kdeplot, 'Age', shade = True,bw=2).add_legend()

Age is almost normally distributed for people who are interested in buying insurance. People with age nearly 30 are more interested in buying insurance.<br>
I think young people doesn't like to get insurance.

In [None]:
plt.figure(figsize=(20,9))
sns.FacetGrid(insurance_df, hue = 'Gender',
             height = 6,xlim = (0,150)).map(sns.kdeplot, 'Age', shade = True,bw=2).add_legend()

I see no much signifiant difference in age vs gender

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(y='Age', x ='Gender', hue="Previously_Insured", data=insurance_df)

Females have got insurance at young age.

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(y='Age', x ='Gender', hue="Response", data=insurance_df)

As I said previously young doesn't like to get insurance, when we drill down further we see that
* People who like to get insurence their age is normally distributed. The mean age of both are nearly 45 years.
* Both young male and female doesn't like to buy insurance, distributed is right skewed.
* But the mean age of male and female, who is not interested to buy insurance, has huge difference.

**With this we move to modelling**

In [None]:
le = LabelEncoder()
insurance_df['Gender'] = le.fit_transform(insurance_df['Gender'])
insurance_df['Driving_License'] = le.fit_transform(insurance_df['Driving_License'])
insurance_df['Previously_Insured'] = le.fit_transform(insurance_df['Previously_Insured'])
insurance_df['Vehicle_Damage'] = le.fit_transform(insurance_df['Vehicle_Damage'])
insurance_df['Driving_License'] = le.fit_transform(insurance_df['Driving_License'])
insurance_df['Vehicle_Age'] = le.fit_transform(insurance_df['Vehicle_Age'])

In [None]:
insurance_df=insurance_df[['Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response']]

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(insurance_df.corr())

I see

In [None]:
def evaluation_stats(model,X_train, X_test, y_train, y_test,algo,is_feature=False):
    print('Train Accuracy')
    y_pred_train = model.predict(X_train)                           
    print(accuracy_score(y_train, y_pred_train))
    print('Validation Accuracy')
    y_pred_test = model.predict(X_test)                           
    print(accuracy_score(y_test, y_pred_test))
    print("\n")
    print("Train AUC Score")
    print(roc_auc_score(y_train, y_pred_train))
    print("Test AUC Score")
    print(roc_auc_score(y_test, y_pred_test))
    
    if is_feature:
        plot_feature_importance(rf_model.feature_importances_,X.columns,algo)

def training(model,X_train, y_train):
    return model.fit(X_train, y_train)

def plot_feature_importance(importance,names,model_type):
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
insurance_df.columns

In [None]:
insurance_df['Response'].value_counts()

Data is highly imbalanced, but still we will try to train few models without over sampling

In [None]:
X = insurance_df.drop(["Response"], axis=1)
y = insurance_df["Response"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 101)

In [None]:
rf_model = training(RandomForestClassifier(),X_train,y_train)
evaluation_stats(rf_model,X_train, X_test, y_train, y_test,'RANDOM FOREST')

This RF with out oversampling clearly over fits, Train accuracy and AUC is very high. Model is not able to generalization.

In [None]:
xbg_model = training(XGBClassifier(),X_train,y_train)
evaluation_stats(xbg_model,X_train, X_test, y_train, y_test,'XGB')

XBG is not completely overfitting, but AUC is low

**Now we will try with over sampling**

In [None]:
sm = SMOTE(random_state=101)
X_res, y_res = sm.fit_resample(X_train, y_train)

In [None]:
rf_model = training(RandomForestClassifier(),X_res, y_res)
evaluation_stats(rf_model,X_res, X_test, y_res, y_test,'RANDOM FOREST')

Model is overfitting to train dataset, but it is performing good on validation dataset. <br> This is little tricky

In [None]:
xbg_model = training(XGBClassifier(),X_train,y_train)
evaluation_stats(xbg_model,X_res, X_test, y_res, y_test,'XGB')

Lets check with adding parameters to the model

In [None]:
rf_model = training(RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=3),X_res, y_res)
evaluation_stats(rf_model,X_res, X_test, y_res, y_test,'RANDOM FOREST')

This model is not over fitting on the train dataset AUC is quiet decent enough 

In [None]:
xbg_model = training(XGBClassifier(n_estimators=1000,max_depth=10),X_res, y_res)
evaluation_stats(xbg_model,X_res, X_test, y_res, y_test,'XGB',is_feature=False)

Over fitting

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_res, y_res)
evaluation_stats(clf,X_train, X_test, y_train, y_test,'LR',is_feature=False)

I think this is also good model, it is not completely overfitting, accuracy is .68 but AUC is .74

**RF with parameters criterion = entropy ,n_estimators = 200 and max_depth = 3 were giving best results, that is AUC of 79%**