<h1 align='center'>Welcome to my Notebook</h1>

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold,KFold
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report,roc_auc_score,confusion_matrix,accuracy_score,f1_score

import optuna
from optuna.samplers import TPESampler

In [None]:
train = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')
test = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/test.csv')
sample = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/sample_submission.csv')

sns.set(style='white', context='notebook', palette='deep')

<h1 align='center'>Exploratory Data Analysis</h1>

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.isnull().sum()

In [None]:
train.skew()

In [None]:
train.dtypes

In [None]:
all_features = pd.concat([train.drop(['id','Response'],axis=1),test.drop('id',axis=1)],axis=0)
y = train['Response']

Let's combine the train and test set so our transformations are easier as we don't have to apply them seperately to each set

<h1 align='center'>Categorical Features Data Analysis</h1>

In [None]:
fig = px.pie(train,values=train['Response'].value_counts(),names=['Class 0','Class 1'],hole=0.6,labels={0:'Response = 0'},color_discrete_sequence=px.colors.sequential.Sunset)
fig.show(showlegend=True)

In [None]:
sns.countplot(train['Response'])
plt.show()

We can see from the above visualisations that:

* The Data is highly imbalanced, with 87.7% belonging to Class 0 and 12.3% belonging to Class 1. We will use both UnderSampling and OverSampling to balance the data out equally

In [None]:
sns.barplot(train['Driving_License'],train['Response'])
plt.show()

Here, we can see that people who have the a driving license are more likely to be interested in getting insurance

In [None]:
sns.barplot(train['Previously_Insured'],train['Response'])
plt.show()

Now,we see that people who are not previously insured are more likely to be interested in getting insurance, as one does not want to pay for multiple insurances

In [None]:
sns.barplot(train['Vehicle_Damage'],train['Response'])
plt.show()

People with vehicle damage are also more likely to be interested in getting insurance, which is self-explanatory

In [None]:
sns.barplot(train['Vehicle_Age'],train['Response'])
plt.show()

Here we see that people with older Vehicles are more likely to be interested. This is obvious, as the longer a vehicle is on the road, the more likely it is to have problems and issues as oppose to new cars

In [None]:
sns.barplot(train['Gender'],train['Response'])
plt.show()

Males are more likely to be interested in getting insured. This could be due to many possible reasons, but one could be just that more males in the dataset.

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(train['Age'],train['Response'])
plt.show()

In this visualisation, we see what ages are more likely to be interested in insurance. We see the following things:
1. Ages between 20-27 are not very interested in getting insured. This can be because insurance is not their main priority, and they probably cannot afford it as they are most likely students
2. Ages 30-55 are more interested in getting insured, as they use the car on a day to day basis to travel for work, so they are more in need to repairs 
3. Age 65+ are less likely to be interested in insurance as this is the retirment age and they do not have a need for a car any more, let alone insurance

In [None]:
sns.boxplot(train['Age'])
plt.show()

The reason for this boxplot was to see if there was any outliers(in this case, any extreme cases or accidental ages, e.g a 5 year old interested in insurance!)

In [None]:
bins = [20, 30, 40, 50, 60, 70,90]
labels = ['20-27', '28-39', '40-49', '50-59', '60-69', '70+']
age_categories = pd.cut(train['Age'], bins, labels = labels,include_lowest = True)

In [None]:
sns.barplot(age_categories,train['Response'])
plt.show()

After splitting the ages into 6 groups, we see that this visualisation confirms our observations from before; the main age groups interested in getting insurance are between 28-50.

<h1 align='center'>Numerical Features Data Analysis</h1>

In [None]:
sns.boxplot(train['Annual_Premium'])
plt.show()

We see that Annual Premium has a wide varietly of values

In [None]:
g = sns.distplot(train['Annual_Premium'],label='Skewness: '+str(round(train['Annual_Premium'].skew(),4)))
g = g.legend(loc='best')
plt.show()

Annual Premium is also skewed. We might try a log or square root transformation to mitigate the skewness

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(train.corr(),annot=True,cmap='rainbow')
plt.show()

<h1 align='center'>Feature Preprocessing</h1>

Here we plot a correlation heatmap and see that there is no strong correlation between any feautures and the target feature

In [None]:
all_features['Vehicle_Age'] = all_features['Vehicle_Age'].map({'> 2 Years':2,'1-2 Year':1,'< 1 Year':0})
all_features['Vehicle_Damage'] = all_features['Vehicle_Damage'].map({'Yes':1,'No':0})
all_features['Gender'] = all_features['Gender'].map({'Male':1,'Female':0}) 
all_features

Here we just map categorical values to their appropriate numerical counterparts

<h1 align='center'>Modelling</h1>

In [None]:
X = all_features.iloc[:len(train),:]
X_test = all_features.iloc[len(train):,:]

kf = StratifiedKFold(n_splits=12,shuffle=True,random_state=42)

We define resplit our train and test set, and set up 12 Fold Stratified Cross Validation. It is important to stratify as this ensures that during our training and validation, we split according to target distribution

In [None]:
for train_index,val_index in kf.split(X,y):
    X_train,X_val = X.iloc[train_index],X.iloc[val_index],
    y_train,y_val = y.iloc[train_index],y.iloc[val_index],

Here we define our validation set

<h1 align='center'>Modelling Using Undersampling</h1>

We will use `imblearn`'s RandomUnderSampler to undersample from the majority class so that they match

In [None]:
rus = RandomOverSampler(random_state=42)
X_rus,y_rus = rus.fit_sample(X_train,y_train)

<h1 align='center'>Basic LightGBM</h1>

Let's fit a vanilla LGBMClassifier on the undersampled data and evaluate it

In [None]:
lgb_rus = LGBMClassifier(random_state=42)
lgb_rus.fit(X_rus,y_rus)
print(classification_report(y_val,lgb_rus.predict(X_val)))
print('ROC AUC Score: ' + str(roc_auc_score(y_val,lgb_rus.predict(X_val))))

In [None]:
sns.heatmap(confusion_matrix(y_val,lgb_rus.predict(X_val)),cmap='magma',annot=True,fmt='g')
plt.show()

Our model did suprisingly well with default parameters, but not fantastic. Let's optimize

<h1 align='center'>Hyperparameter Tuning</h1>

In [None]:
def create_model(trial):
    n_estimators = trial.suggest_int('n_estimators',100,500)
    num_leaves = trial.suggest_int('num_leaves',10,500)
    max_depth = trial.suggest_int('max_depth',4,20)
    learning_rate = trial.suggest_uniform('learning_rate',0.0001,1)
    min_child_samples = trial.suggest_int('min_child_samples',10,50)
    model = LGBMClassifier(n_estimators=n_estimators,num_leaves=num_leaves,
    max_depth=max_depth,learning_rate=learning_rate,min_child_samples=min_child_samples)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(X_rus,y_rus)
    score = roc_auc_score(y_val,model.predict(X_val))
    return score

sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler,direction='maximize')
study.optimize(objective,n_trials=60)

In [None]:
lgb_params = study.best_params
lgb_params['random_state'] = 42
lgb = LGBMClassifier(**lgb_params)
lgb.fit(X_rus, y_rus)
preds = lgb.predict(X_val)
print(classification_report(y_val,lgb.predict(X_val)))
print('ROC AUC Score: ' + str(roc_auc_score(y_val,lgb.predict(X_val))))

In [None]:
sns.heatmap(confusion_matrix(y_val,lgb.predict(X_val)),cmap='magma',annot=True,fmt='g')
plt.show()

<h1 align='center'>Thanks and make sure to learn!</h1>