# Travel Insurance Prediction

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict if a person will purchase travel insurance?

# 2. Data

Data from: https://www.kaggle.com/tejashvi14/travel-insurance-prediction-data

## Context

A Tour & Travels Company Is Offering Travel Insurance Package To Their Customers.
The New Insurance Package Also Includes Covid Cover.
The Company Requires To Know The Which Customers Would Be Interested To Buy It Based On Its Database History.
The Insurance Was Offered To Some Of The Customers In 2019 And The Given Data Has Been Extracted From The Performance/Sales Of The Package During That Period.
The Data Is Provided For Almost 2000 Of Its Previous Customers And You Are Required To Build An Intelligent Model That Can Predict If The Customer Will Be Interested To Buy The Travel Insurance Package Based On Certain Parameters Given Below. 

# 3. Evaluation

As this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## Inputs /  Features

    1. Age- Age Of The Customer
    2. Employment Type- The Sector In Which Customer Is Employed
    3. GraduateOrNot- Whether The Customer Is College Graduate Or Not
    4. AnnualIncome- The Yearly Income Of The Customer In Indian Rupees[Rounded To Nearest 50 Thousand Rupees]
    5. FamilyMembers- Number Of Members In Customer's Family
    6. ChronicDisease- Whether The Customer Suffers From Any Major Disease Or Conditions Like Diabetes/High BP or Asthama,etc.
    7. FrequentFlyer- Derived Data Based On Customer's History Of Booking Air Tickets On Atleast 4 Different Instances In The Last 2 Years[2017-2019].
    8. EverTravelledAbroad- Has The Customer Ever Travelled To A Foreign Country[Not Necessarily Using The Company's Services]
    
## Output / Label
    9. TravelInsurance- Did The Customer Buy Travel Insurance Package During Introductory Offering Held In The Year 2019.

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the Dataset

In [None]:
# Local
# df = pd.read_csv('Data/TravelInsurancePrediction.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/travel-insurance-prediction-data/TravelInsurancePrediction.csv')
df.head()

## Data Exporation

In [None]:
df.info()

In [None]:
df.isnull().sum()

We will drop the coloum 0 as that is just the index for the colum.

In [None]:
df = df.drop('Unnamed: 0', axis=1)

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Label count of the dataset')
sns.countplot(data=df, x='TravelInsurance');

As from the count plot we can see that the data is in-balanced

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of age')
sns.histplot(data=df, x='Age',bins=10, kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Age colored by Travel Insurance purchased')
sns.countplot(data=df, x='Age', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Employment Type colored by Travel Insurance purchased')
sns.countplot(data=df, x='Employment Type', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Graduate colored by Travel Insurance purchased')
sns.countplot(data=df, x='GraduateOrNot', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Annual Income vs age colored by Travel Insurance purchased')
sns.scatterplot(data=df, x='AnnualIncome',y='Age', hue='TravelInsurance', s=150);

In [None]:
df['FamilyMembers'].value_counts()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Family Members by Travel Insurance purchased')
sns.countplot(data=df, x='FamilyMembers', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Chronic Diseases by Travel Insurance purchased')
sns.countplot(data=df, x='ChronicDiseases', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Frequent Flyer by Travel Insurance purchased')
sns.countplot(data=df, x='FrequentFlyer', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Ever Travelled Abroad by Travel Insurance purchased')
sns.countplot(data=df, x='EverTravelledAbroad', hue='TravelInsurance');

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(data=df, x='AnnualIncome');

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(data=df, x='Age');

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=pd.get_dummies(df, drop_first=True).corr(), annot=True);

In [None]:
pd.get_dummies(df, drop_first=True).corr()[['TravelInsurance']].sort_values('TravelInsurance', ascending=True)[:-1]

We can see, Travel Insurance purchase is highly correlated to EverTravelledAbroad, AnnualIncome, FrequentFlyer and Employment Type

# 5. Modelling

In [None]:
X = df.drop('TravelInsurance', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = df['TravelInsurance']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Baseline Model Scores

In [None]:
from sklearn.metrics import classification_report,precision_score, recall_score,f1_score


In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        model_scores[name] = model.score(X_test,y_test)
        model_recall[name] = recall_score(y_test, y_preds)
        model_f1[name] = f1_score(y_test, y_preds)
        model_precision[name] = precision_score(y_test, y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    model_recall = pd.DataFrame(model_recall, index=['Recall']).transpose()
    model_recall = model_recall.sort_values('Recall')
    model_f1 = pd.DataFrame(model_f1, index=['F1']).transpose()
    model_f1 = model_f1.sort_values('F1')
    model_precision = pd.DataFrame(model_precision, index=['Precision']).transpose()
    model_precision = model_precision.sort_values('Precision')
        
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores

In [None]:
model_recall

In [None]:
model_f1

In [None]:
model_precision

We will the LGBMClassifier and RandomForestClassifier as that provide the best overall for recall and F1 scores

## Random Search CV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                     param_distributions=params[name],
                                      scoring='f1',
                                      cv=5,
                                     n_iter=30,
                                     verbose=0)        
        rs_model.fit(X_train,y_train)
        model_rs_scores[name] = rs_model.score(X_test,y_test)
        model_rs_best_param[name] = rs_model.best_params_
        y_preds = rs_model.predict(X_test)
        print('\n')
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        
    return model_rs_scores, model_rs_best_param

## Baseline CV

In [None]:
models = {'LGBMClassifier': LGBMClassifier(),
         'RandomForestClassifier': RandomForestClassifier()}

params = {'LGBMClassifier':{},      
          'RandomForestClassifier': {}
         }

In [None]:
model_rs_scores_base, model_rs_best_param_base = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

## RS model 1

In [None]:
params = {'LGBMClassifier':{'num_leaves': np.arange(21,42,2),
                           'learning_rate': np.linspace(0.1,0.9,9),
                            'n_estimators':[50,100,200,300,500],
                            'min_split_gain':np.linspace(0.0,0.9,10),
                            'min_child_weight':np.linspace(0.0,0.9,10),
                            'min_child_samples': [10,20,40,80,100],
                            'reg_alpha': np.linspace(0.0,0.9,10),
                            'reg_lambda': np.linspace(0.0,0.9,10)
                           },
          'RandomForestClassifier': {'n_estimators':[50,100,200,300],
                                    'criterion':['gini','entropy'],
                                    'max_features': ['auto', 'sqrt','log2'],
                                     'oob_score': [True,False],
                                     'bootstrap': [True,False],
                                     'ccp_alpha': np.linspace(0.0,0.9,10)
                                    }
         }

In [None]:
model_rs_scores1, model_rs_best_param1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores1

In [None]:
model_rs_best_param1

## RS Model 2

In [None]:
params = {'LGBMClassifier':{'num_leaves': np.arange(22,24),
                           'learning_rate': np.linspace(0.5,0.7,10),
                            'n_estimators':[250,300,350,400],
                            'min_split_gain':np.linspace(0.1,0.3,5),
                            'min_child_weight':np.linspace(0.0,0.1,5),
                            'min_child_samples': [90,100,110,120,140,180],
                            'reg_alpha': np.linspace(0.5,0.7,5),
                            'reg_lambda': np.linspace(0.2,0.4,5)
                           },
          'RandomForestClassifier': {'n_estimators':[40,50,60],
                                    'criterion':['entropy'],
                                    'max_features': ['log2'],
                                     'oob_score': [False],
                                     'bootstrap': [False],
                                     'ccp_alpha': np.linspace(0.0,0.1,10)
                                    }
         }

In [None]:
model_rs_scores2, model_rs_best_param2 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores2

In [None]:
model_rs_best_param2

## RS model 3

In [None]:
params = {'LGBMClassifier':{'num_leaves': [23],
                           'learning_rate': [0.6333333333333333],
                            'n_estimators':[220,230,240,250],
                            'min_split_gain':np.linspace(0.1,0.2,5),
                            'min_child_weight':[0.1],
                            'min_child_samples': [105,110,115],
                            'reg_alpha': [0.5],
                            'reg_lambda': [0.5]
                           },
          'RandomForestClassifier': {'n_estimators':[55,60,65,70,80],
                                    'criterion':['entropy'],
                                    'max_features': ['log2'],
                                     'oob_score': [False],
                                     'bootstrap': [False],
                                     'ccp_alpha': [0.011111111111111112]
                                    }
         }

In [None]:
model_rs_scores3, model_rs_best_param3 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores3

In [None]:
model_rs_best_param3

From the random search CV, we are not seeing any improvement in the model already, we will use the current best hyperparams for the final model and do the evalution

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix,plot_roc_curve
from sklearn.model_selection import cross_val_score

In [None]:
model = LGBMClassifier(reg_lambda=0.5,
                      reg_alpha=0.5,
                      num_leaves=23,
                      n_estimators=220,
                      min_split_gain=0.1,
                      min_child_weight = 0.1,
                      min_child_samples=105,
                      learning_rate = 0.6333333333333333)

In [None]:
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test, y_preds))

## Confusion Matirx

In [None]:
plot_confusion_matrix(model, X_test, y_test)

## ROC curve

In [None]:
plot_roc_curve(model, X_test, y_test)

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns)

In [None]:
feat_importances

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances.sort_values(0).T);

## Evalution using cross-validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=cv,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=cv,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=cv,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=cv,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X_train, y_train, cv=10)

In [None]:
cv_merics

with the model, and with the CV evalution, we are able to get the following:

    Accuracy 	0.827338
    Precision 	0.872269
    Recall 	    0.596556
    f1          0.596556 	 	

# 7. Experimentation / Improvements

with a lower scoring model of Recall 60% and f1 of 60% in the CV and classification, we hope to get a better scoring model.

maybe we can look into the follow for improvements:

    1. Check for other outliers?
    2. Build and looking in to the data again to build a better model
    3. Getting more data, as the current dataset is small