<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Travel-insurance-prediction" data-toc-modified-id="Travel-insurance-prediction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Travel insurance prediction</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Visualization" data-toc-modified-id="Visualization-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Visualization</a></span></li><li><span><a href="#Pre-processing" data-toc-modified-id="Pre-processing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Pre-processing</a></span></li><li><span><a href="#Fitting-a-models" data-toc-modified-id="Fitting-a-models-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Fitting a models</a></span></li></ul></li></ul></div>

# Travel insurance prediction   
Motivation: Predict Whether A Given Customer Would Like To Buy The Insurance Package, Once The Corona Lockdown Ends And Travelling Resumes.  
Your Work Could Probably Help Save Thousands Of Rupees Of A Family.  

Kaggle link - https://www.kaggle.com/tejashvi14/travel-insurance-prediction-data

In [None]:
import os
import warnings

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib import cycler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import xgboost as xgb

# Save the default settings if you want to roll back the style you made
IPython_default = plt.rcParams.copy()
%matplotlib inline

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file = os.path.join(dirname, filename)
data = pd.read_csv(file)

## Overview

In [None]:
print(data.shape)
data.head()

In [None]:
data.any().isnull()

In [None]:
data.dtypes

4 qualitative signs, we will transform them in the future

In [None]:
def show_objects(df):
    cols = df.select_dtypes(include=['object']).columns
    for col in cols:
        print("Unique values for {}".format(col))
        print(data.loc[:, col].unique())
    
show_objects(data)

Great, object columns have 2 values

## Visualization

From less to more. First, let's look at the linear dependencies

But before that, adjust your style

In [None]:
# Tune style
monokai_black = '#232323'

colors = cycler('color',
                ['#CCCC00','#3399FF','#59f4ff','#884dff', '#4dff62', 
                 '#6f992f', '#ffa230','#ff3030','#545454','#000000']
               )
plt.rc('axes', axisbelow=True, grid=True, titlesize=15,
       prop_cycle=colors, titlecolor='white', labelcolor='white')
plt.rc('figure', figsize=(16,6), facecolor='#232323')
plt.rc('xtick', direction='out', color='white')
plt.rc('ytick', direction='out', color='white')
plt.rc('legend', facecolor='#232323', edgecolor='white')
plt.rc('lines', linewidth=2.5)
plt.rc('patch', force_edgecolor=True, facecolor='#232323',
       edgecolor='black', lw=1.5)
plt.rc('text', color='white')

And we will create a couple of functions for better readability of graphs

In the console 

    pip install forex-python
    
or..

In [None]:
! pip install forex-python
from forex_python.converter import CurrencyRates
def convert_to_dollars(amount):
    c = CurrencyRates()
    return round(c.convert('INR', 'USD', amount), 2)

In [None]:
def currency(x, pos):
    # Two arguments - value and position of mark
    return '$ {:1.0f}K'.format(convert_to_dollars(x)*1e-3)

In [None]:
def depend_for_income(name_col):
    fig, ax = plt.subplots()
    ax.bar(data[name_col], data["AnnualIncome"])
    formatter = FuncFormatter(currency)
    ax.yaxis.set_major_formatter(formatter)
    ax.set(xlabel=name_col, ylabel='Annual Income')
    plt.show()

In [None]:
depend_for_income("Age")

People about 27 years old have the smallest earnings. The richest people 30 and 35 have the smallest maximum earnings. People aged 28 and 32 mostly have the maximum earnings relative to the entire sample.

In [None]:
depend_for_income("Employment Type")

As expected, self-employed people are crossing the threshold of earnings of officially employed people. In addition, one-fourth of the self-employed have earnings like the richest officially employed people

In [None]:
depend_for_income("GraduateOrNot")

We see that the vast majority of people who have not received a college education do not cross the average earnings threshold of the sample.  
You can also notice that graduates pass the maximum salary threshold, unlike non-graduates. And taking into account the fact of past visualization, it can be understood that people with education in this sample become self-employed with success

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2)
gover_sector = data[data["Employment Type"] == "Government Sector"]
private_sector = data[data["Employment Type"] != "Government Sector"]
ax1.hist(private_sector["GraduateOrNot"])
ax1.set(title='The number of people who received education in PS')
ax2.hist(gover_sector["GraduateOrNot"])
ax2.set(title='The number of people who received education in GS')
plt.show()

In [None]:
def encoder(row):
    obj_cols = data.select_dtypes(include=['object']).columns
    if row.dtypes == "object":
        return LabelEncoder().fit_transform(row)
    return row

In [None]:
corr = data.apply(encoder).corr()
g = sns.heatmap(corr,annot=True)
g.figure.set_size_inches(10,10)

There is no highly correlating features

## Pre-processing

In [None]:
# Delete excess column
data = data.drop(data.columns[0], axis=1)

In [None]:
def pipeline_models(models: list=None) -> dict:   
    if not models:
        models = [RandomForestClassifier, LogisticRegression, GaussianProcessClassifier, 
                  AdaBoostClassifier, GradientBoostingClassifier, KNeighborsClassifier,
                  GaussianNB, xgb.XGBClassifier, LGBMClassifier]
    else:
        models = models
        
    for model in models:
        yield model
        
def preprocess(data):
    cat_cols = data.select_dtypes(include=['object']).columns
    # Drop target column
    num_cols = data.select_dtypes(include=['number']).columns[:-1]
    preprocessor = make_column_transformer((StandardScaler(), num_cols),
                                           (OneHotEncoder(), cat_cols))
    return preprocessor
def fit_pipe(model, X_train, X_test, y_train, y_test, preprocessor, **params):
        pipe = Pipeline([
        ("preproc", preprocessor),
        ("model", model(**params))
        ])
        with warnings.catch_warnings():  # Disable warnings
            warnings.simplefilter("ignore")
            pipe.fit(X_train, y_train)
            pred = pipe.predict_proba(X_test)[:,1]
        return roc_auc_score(y_test, pred), pipe

In [None]:
X = data.drop("TravelInsurance", axis=1)
y = data["TravelInsurance"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=46)

## Fitting a models

In [None]:
results = {}
for model in pipeline_models():
    acc, _ = fit_pipe(model, X_train, X_test, y_train, y_test, preprocess(data))
    results[type(model()).__name__] = acc

In [None]:
import operator
sorted_tuples = sorted(results.items(), key=operator.itemgetter(1), reverse=True)
sorted_dict = {k: v for k, v in sorted_tuples}
for i, (k, v) in enumerate(sorted_dict.items()):
    print("{}. {} Accuracy -  {}\n".format(i+1, k, round(v, 5)))

In [None]:
help(GradientBoostingClassifier)

In [None]:
n_rate_est = []
for i in range(50):
    n_rate_est.append(fit_pipe(GradientBoostingClassifier, X_train, X_test, y_train, y_test,
                                      preprocess(data),
                                      **{"n_estimators": i+100})[0]) # ignore model params

In [None]:
fig, ax = plt.subplots()
ax.plot(n_rate_est)
formatter = FuncFormatter(lambda x,p: int(x+100))
ax.xaxis.set_major_formatter(formatter)
ax.set(xlabel="N Estimators", ylabel='Accuracy')
plt.show()

In [None]:
dict(sorted([(pos+100, v) for pos, v in enumerate(n_rate_est)], key=lambda x: x[1], reverse=True))

But these data tell us little. You should look at the effectiveness of hyperparameters in conjunction with others, and not separately. But on the other hand, we should not choose the best hyperparameters for the sample, because we will train it to predict almost indentical data well

In [None]:
# After a quick grid search has been performed
best_result, model = fit_pipe(GradientBoostingClassifier, X_train, X_test, y_train, y_test, 
                              preprocess(data), 
                                    **{"max_features":'sqrt',
                                       "subsample": .6,
                                       "min_weight_fraction_leaf": .08,
                                       "min_impurity_decrease": .06,
                                       })

In [None]:
best_result

In [None]:
model.named_steps['model'].feature_importances_