# Deep learning model's performance on tabular data compared to GBDT and TabNet model

Recent years' development of deep learning models is very impressive and without any doubt, it is the state of the art in the area of computer vision (CV) and natural language processing (NLP). However, when it comes to structured data, a gradient boosted decision tree (GBDT) model still seems to be a strong opponent of deep learning models.

This notebook explores the model performance of GBDT (XGBoost), deep learning model (MLP) and TabNet (deep learning model for tabular data) on the home insurance dataset (structured data). Those models predict whether a home insurance will be lapsed.

## Summary
### Model performance
As mentioned above, here compares the model performance of XGBoost, MLP and TabNet with and without pre-train. The models are evaluated by ROC AUC score and F1 score. F1 scores were calculated at 0.27 as a threshold as I assumed that the distribution of the lapsed insurances is similar to the training distribution. Below is a summary of it.

|                         	| ROC AUC 	| F1 score 	| Time (sec) 	|
|:-----------------------:	|:-------:	|:--------:	|:----------:	|
|         XGBoost         	|  0.7706 	|  0.5591  	|     500    	|
|           MLP           	|  0.7514 	|  0.5458  	|     184    	|
| TabNet without pretrain 	|  0.7579 	|  0.5529  	|    1464    	|
|   TabNet with pretrain  	|  0.7524 	|  0.5484  	|    2370    	|

As we can see, in terms of the accuracy of the model, the XGBoost model is the best one, yet other models are also not far behind it. I have used with and without pretraining for the TabNet model (notebook of without pretraining TabNet can be found [here](https://www.kaggle.com/kyosukemorita/home-insurance-pretrained-tabnet). TabNet with pretraining supposed to be having a better result, but in this dataset, it got a slightly worse result than without pretraining. I am not sure what is exactly the reason but I guess this can be improved by appropriate hyperparameters.

When we look into the distribution of the predictions of each model, we can observe that there are some degrees of similarity between the XGBoost and TabNet model. I guess it might be because TabNet is also using a tree-based-like algorithm. MLP model has a quite different shape compared to other models.

In terms of training time, the MLP model was the fastest one. I have used GPU, so that is the main reason why I got this result. Both TabNet models took quite a long time compared to other models. This makes a lot of differences when it comes to hyperparameter tuning. In this experiment, I didn't do any hyperparameter tuning and used arbitrary parameters. Although MLP's training time is almost 1/3 of the XGBoost model, the number of parameters it needs to optimise is easily more than 10 times of the XGBoost, so if I was doing hyperparameter tuning, it might take longer than the XGBoost model's training with hyperparameter tuning.

### Explainability
Explainability is quite important for some machine learning model business use cases. For example, it is critical to be able to explain why a model is making a particular decision in finance/banking. Imagine that we are deploying a model that can be used for loan approval and a customer wants to know why his application was rejected. Banks can't tell him that we don't know as there are strong regulators in the industry.
Explainability of the model is one of the drawbacks of MLP models. Although we can still evaluate which features contributed to making predictions by using some ways such as using SHAP, it would be more useful if we can check the feature importance list quickly. In this notebook, I will compare only XGBoost and TabNet models' feature importance.

The top 5 important features of the XGBoost model are;

- Marital status - Partner
- Payment method - Non-Direct debit
- Option "Emergencies" included after 1st renewal
- Building coverage - Self-damage
- Option "Replacement of keys" included before 1st renewal

The top 5 important features of the TabNet model without pretraining are;

- Property type 21 (Detail not given)
- "HP1" included before 1st renewal
- Payment method - Pure Direct debit
- Type of membership 6 (Detail not given)
- Insurance cover length in years

Surprisingly, those two models' important features are quite different. The important features from XGBoost are more "understandable and expected" to me - for example, if a customer has a partner, that person should be financially more responsible, thus, the home insurance will less likely to lapse. On the other hand, important features of TabNet are, I would say, less intuitive. The most important feature is "property type 21", where the detail of this feature is not given, so we don't know what is special about this property type. Also the second most important feature, "HP1" included before 1st renewal, where again we don't know what is "HP1". Perhaps, this can be an advantage of TabNet. As it is a deep learning model, it can explore a non-obvious relationship of the features and uses the optimal feature set, especially like this time, where not all the features' details are given.

### Model selection for deployment in the real-life business
When we want to use a machine learning model in real-life business, we need to select the best way to deploy the model and often there are some trade-offs. For example, it is a known fact that when we built a few models like this time and those models' accuracies are quite similar, ensemble them might increase the accuracy. If this ensemble strategy worked perfectly like improved the F1 score by 10%, then it is absolutely necessary to take this strategy, but if this improvement was only 1%, do we still want to take this strategy? Probably not, right? - as running one more model makes the computation more expensive, so usually if the benefit of deploying one more model surpasses the computation cost, we can take this ensemble strategy, otherwise, it is not optimal in terms of business.

Also, regarding the model explainability, whereas the XGBoost model used all 115 features, the TabNet model is only using 16 features (the pre-trained model used only 4 features). This is quite a huge difference and also important to understand those differences. As I mentioned above, in some real-life business use cases, it is critical to know how much contribution those features make. So sometimes although the accuracy is quite high, if the model couldn't explain why it's making that decision, it is difficult to convince people to use it in real life, especially in very sensitive business.

Considering the above 2 points, we would consider that the XGBoost model is superior to other deep learning models in this case. In terms of accuracy, the XGBoost model was slightly better than others (I haven't tried ensembling those predictions from all the models but let's assume, it didn't improve the accuracy much - I might be wrong). And in terms of explainability, as discussed above, the XGBoost model's feature importance list is somewhat we could understand (we can see some logic behind it) and somewhat expected.

### Conclusion
This notebook experimentally compared the model performance of XGBoost, MLP and TabNet on tabular data. Here we are using the home insurance dataset to predict its lapse. As the result of this experiment, we have seen that the XGBoost model has slightly better than other deep learning models in terms of accuracy (F1 score and ROC AUC score), but as this experiment used GPU, the MLP model was the fastest to complete its training. Furthermore, we compared their explainability by seeing the feature importance list of the XGBoost model and TabNet model. The XGBoost model's feature importance list was somewhat more understandable and expected, on the other hand, the TabNet model's one was less intuitive. I think this is caused because of the structure of the algorithm - deep learning models, by nature, explores non-obvious relationships of the features and often it is difficult to understand by a human. From this simple experiment, we confirm that although improvement of deep learning models in recent years is impressive and definitely state-of-the-art, on tabular data, GBDT models are still as good as those deep learning models and sometimes even better than them, especially when we would like to deploy a machine learning model in the real-life business.

In [None]:
# TabNet
!pip install pytorch-tabnet

In [None]:
import os
import numpy as np
import pandas as pd
import warnings
import time
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
from pytorch_tabnet.pretraining import TabNetPretrainer
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score
import xgboost as xgb
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

In [None]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None

In [None]:
def timer(myFunction):
    def functionTimer(*args, **kwargs):
        start_time = time.time()
        result = myFunction(*args, **kwargs)
        end_time = time.time()
        computation_time = round(end_time - start_time, 2)
        print("{} is excuted".format(myFunction.__name__))
        print('Computation took: {:.2f} seconds'.format(computation_time))
        return result
    return functionTimer

In [None]:
@timer
def prepareInputs(df: "pd.dataFrame") -> "pd.dataFrame":
    """Prepare the input for training

    Args:
        df (pd.DataFrame): raw data
        
    Process:
        1. Exclude missing values
        2. Clean the target variable
        3. Create dummy variables for categorical variables
        4. Create age features
        5. Impute missing value
    
    Return: pd.dataFrame
    """
    
    # 1. Exclude missing values
    df = df[df["POL_STATUS"].notnull()]
    
    # 2. Clean the target variable
    df = df[df["POL_STATUS"] != "Unknown"]
    df["lapse"] = np.where(df["POL_STATUS"] == "Lapsed", 1, 0)
    
    # 3. Create dummy variables for categorical variables
    categorical_cols = ["CLAIM3YEARS", "BUS_USE", "AD_BUILDINGS",
                        "APPR_ALARM", "CONTENTS_COVER", "P1_SEX",
                        "BUILDINGS_COVER", "P1_POLICY_REFUSED", 
                        "APPR_LOCKS", "FLOODING",
                        "NEIGH_WATCH", "SAFE_INSTALLED", "SEC_DISC_REQ",
                        "SUBSIDENCE", "LEGAL_ADDON_POST_REN", 
                        "HOME_EM_ADDON_PRE_REN","HOME_EM_ADDON_POST_REN", 
                        "GARDEN_ADDON_PRE_REN", "GARDEN_ADDON_POST_REN", 
                        "KEYCARE_ADDON_PRE_REN", "KEYCARE_ADDON_POST_REN", 
                        "HP1_ADDON_PRE_REN", "HP1_ADDON_POST_REN",
                        "HP2_ADDON_PRE_REN", "HP2_ADDON_POST_REN", 
                        "HP3_ADDON_PRE_REN", "HP3_ADDON_POST_REN", 
                        "MTA_FLAG", "OCC_STATUS", "OWNERSHIP_TYPE",
                        "PROP_TYPE", "PAYMENT_METHOD", "P1_EMP_STATUS",
                        "P1_MAR_STATUS"
                        ]
    
    for col in categorical_cols:
        dummies = pd.get_dummies(df[col], 
                                 drop_first = True,
                                 prefix = col
                                )
        df = pd.concat([df, dummies], 1)
    
    # 4. Create age features
    df["age"] = (datetime.strptime("2013-01-01", "%Y-%m-%d") - pd.to_datetime(df["P1_DOB"])).dt.days // 365
    df["property_age"] = 2013 - df["YEARBUILT"]
    df["cover_length"] = 2013 - pd.to_datetime(df["COVER_START"]).dt.year
    
    # 5. Impute missing value
    df["RISK_RATED_AREA_B_imputed"] = df["RISK_RATED_AREA_B"].fillna(df["RISK_RATED_AREA_B"].mean())
    df["RISK_RATED_AREA_C_imputed"] = df["RISK_RATED_AREA_C"].fillna(df["RISK_RATED_AREA_C"].mean())
    df["MTA_FAP_imputed"] = df["MTA_FAP"].fillna(0)
    df["MTA_APRP_imputed"] = df["MTA_APRP"].fillna(0)

    return df

In [None]:
# Split train and test
@timer
def splitData(df: "pd.DataFrame", FEATS: "list"):
    """Split the dataframe into train and test
    
    Args:
        df: preprocessed dataframe
        FEATS: feature list
        
    Returns:
        X_train, y_train, X_test, y_test
    """
    
    train, test = train_test_split(df, test_size = .3, random_state = 42)
    train, test = prepareInputs(train), prepareInputs(test)
    
    return train[FEATS], train["lapse"], test[FEATS], test["lapse"]

In [None]:
# Standardise the data sets
@timer
def standardiseNumericalFeats(X_train, X_test):
    """Standardise the numerical features
    
    Returns:
        Standardised X_train and X_test
    """

    numerical_cols = [
        "age", "property_age", "cover_length", "RISK_RATED_AREA_B_imputed", 
        "RISK_RATED_AREA_C_imputed", "MTA_FAP_imputed", "MTA_APRP_imputed",
        "SUM_INSURED_BUILDINGS", "NCD_GRANTED_YEARS_B", "SUM_INSURED_CONTENTS", 
        "NCD_GRANTED_YEARS_C", "SPEC_SUM_INSURED", "SPEC_ITEM_PREM", 
        "UNSPEC_HRP_PREM", "BEDROOMS", "MAX_DAYS_UNOCC", "LAST_ANN_PREM_GROSS"
    ]

    for col in numerical_cols:
        scaler = StandardScaler()

        X_train[col] = scaler.fit_transform(X_train[[col]])
        X_test[col] = scaler.transform(X_test[[col]])
        
    return X_train, X_test

## XGBoost

In [None]:
@timer
def trainXgbModel(X_train, y_train, X_test, y_test, FEATS, ROUNDS) -> "XGBoost model obj":
    """Train XGBoost model
    
    Arg:
        ROUNDS: Number of training rounds
    
    Return:
        Model object
    """
    
    params = {
                'eta': 0.02,
                'max_depth': 10,
                'min_child_weight': 7,
                'subsample': 0.6,
                'objective': 'binary:logistic',
                'eval_metric': 'error',
                'grow_policy': 'lossguide'
            }
    
    dtrain, dtest = xgb.DMatrix(X_train, y_train, feature_names=FEATS), xgb.DMatrix(X_test, y_test, feature_names=FEATS)

    EVAL_LIST = [(dtrain, "train"),(dtest, "test")]

    xgb_model = xgb.train(params,dtrain,ROUNDS,EVAL_LIST)
    
    return xgb_model

## 1D-CNN

In [None]:
@timer
def trainD1CnnModel(X_train, y_train):
    """Train D1-CNN model
    
    Return:
        keras model obj
    """

    d1_cnn_model = keras.Sequential([
        layers.Dense(4096, activation='relu'),
        layers.Reshape((256, 16)),
        layers.BatchNormalization(),
        layers.Dropout(0.2),
        layers.Conv1D(filters=16, kernel_size=5, strides=1, activation='relu'),
        layers.MaxPooling1D(pool_size=2),
        layers.Flatten(),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid'),
    ])

    d1_cnn_model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=3e-3),
        loss='binary_crossentropy',
        metrics=[keras.metrics.BinaryCrossentropy()]
    )

    early_stopping = keras.callbacks.EarlyStopping(
        patience=25,
        min_delta=0.001,
        restore_best_weights=True,
    )

    d1_cnn_model.fit(
        X_train, y_train,
        batch_size=10000,
        epochs=5000,
        callbacks=[early_stopping],
        validation_data=(X_test, y_test),
    )
    
    return d1_cnn_model

## TabNet

In [None]:
@timer
def tabNetPretrain(X_train):
    """Pretrain TabNet model
    
    Return:
        TabNet pretrainer obj
    """
    tabnet_params = dict(n_d=8, n_a=8, n_steps=3, gamma=1.3,
                             n_independent=2, n_shared=2,
                             seed=42, lambda_sparse=1e-3,
                             optimizer_fn=torch.optim.Adam,
                             optimizer_params=dict(lr=2e-2,
                                                   weight_decay=1e-5
                                                  ),
                             mask_type="entmax",
                             scheduler_params=dict(max_lr=0.05,
                                                   steps_per_epoch=int(X_train.shape[0] / 256),
                                                   epochs=200,
                                                   is_batch_level=True
                                                  ),
                             scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
                             verbose=10
                        )

    pretrainer = TabNetPretrainer(**tabnet_params)

    pretrainer.fit(
        X_train=X_train.to_numpy(),
        eval_set=[X_train.to_numpy()],
        max_epochs = 100,
        patience = 10, 
        batch_size = 256, 
        virtual_batch_size = 128,
        num_workers = 1, 
        drop_last = True)
    
    return pretrainer

In [None]:
@timer
def trainTabNetModel(X_train, y_train, pretrainer):
    """Train TabNet model
    
    Args:
        pretrainer: pretrained model. If not using this, use None
        
    Return:
        TabNet model obj
    """
    
    tabNet_model = TabNetClassifier(
                                   n_d=16,
                                   n_a=16,
                                   n_steps=4,
                                   gamma=1.9,
                                   n_independent=4,
                                   n_shared=5,
                                   seed=42,
                                   optimizer_fn = torch.optim.Adam,
                                   scheduler_params = {"milestones": [150,250,300,350,400,450],'gamma':0.2},
                                   scheduler_fn=torch.optim.lr_scheduler.MultiStepLR
                                  )

    tabNet_model.fit(
        X_train = X_train.to_numpy(),
        y_train = y_train.to_numpy(),
        eval_set=[(X_train.to_numpy(), y_train.to_numpy()),
                  (X_test.to_numpy(), y_test.to_numpy())],
        max_epochs = 100,
        batch_size = 256,
        patience = 10,
        from_unsupervised = pretrainer
        )
    
    return tabNet_model

## Evaluation

In [None]:
# Make predictions
def makePredictions(X_test, xgb_model, d1_cnn_model, tabNet_model):
    """Make predictions
    
    Return:
        Predictions from each models
    """
    
    y_xgb_pred = xgb_model.predict(xgb.DMatrix(X_test, feature_names=FEATS))
    y_d1_cnn_pred = d1_cnn_model.predict(X_test).reshape(1, -1)[0]
    y_tabNet_pred = tabNet_model.predict_proba(X_test.to_numpy())[:,1]
    
    return [y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred]

In [None]:
# Evaluation
def evaluate(y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred) -> None:
    """Evaluate the predictions
    
    Process:
        Print ROC AUC and F1 score of each models
    """
    
    preds = {"XGBoost":y_xgb_pred, "D1 CNN":y_d1_cnn_pred, "TabNet":y_tabNet_pred}

    for key in preds:
        print("The ROC AUC score of "+ str(key) +" model is " +
              str(round(roc_auc_score(y_test, preds[key]), 4))
             )

    for key in preds:
        print("The F1 score of "+ str(key) +" model at threshold = 0.27 is " +
              str(round(f1_score(y_test, np.where(preds[key] > 0.27, 1, 0)), 4))
             )

In [None]:
# Plot prediction distribution
def plotPredictionDistribution(y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred) -> None:
    """Plot histogram of predicted probability distributions of each model
    """
    
    preds = {"XGBoost":y_xgb_pred, "D1 CNN":y_d1_cnn_pred, "TabNet":y_tabNet_pred}

    for key in preds:
        plt.hist(preds[key], bins = 100)
        plt.title(f"Predicted probability distribution of {key}")
        plt.show()

In [None]:
ROUNDS = 500

FEATS = [
         "CLAIM3YEARS_Y", "BUS_USE_Y", "AD_BUILDINGS_Y",
         "CONTENTS_COVER_Y", "P1_SEX_M", "P1_SEX_N", "BUILDINGS_COVER_Y", 
         "P1_POLICY_REFUSED_Y", "APPR_ALARM_Y", "APPR_LOCKS_Y", "FLOODING_Y", 
         "NEIGH_WATCH_Y", "SAFE_INSTALLED_Y", "SEC_DISC_REQ_Y", "SUBSIDENCE_Y", 
         "LEGAL_ADDON_POST_REN_Y", "HOME_EM_ADDON_PRE_REN_Y", 
         "HOME_EM_ADDON_POST_REN_Y", "GARDEN_ADDON_PRE_REN_Y",
         "GARDEN_ADDON_POST_REN_Y", "KEYCARE_ADDON_PRE_REN_Y", 
         "KEYCARE_ADDON_POST_REN_Y", "HP1_ADDON_PRE_REN_Y", "HP1_ADDON_POST_REN_Y", 
         "HP2_ADDON_PRE_REN_Y", "HP2_ADDON_POST_REN_Y", "HP3_ADDON_PRE_REN_Y", 
         "HP3_ADDON_POST_REN_Y", "MTA_FLAG_Y", "OCC_STATUS_LP",
         "OCC_STATUS_PH", "OCC_STATUS_UN", "OCC_STATUS_WD",
         "OWNERSHIP_TYPE_2.0", "OWNERSHIP_TYPE_3.0", "OWNERSHIP_TYPE_6.0", 
         "OWNERSHIP_TYPE_7.0", "OWNERSHIP_TYPE_8.0", "OWNERSHIP_TYPE_11.0", 
         "OWNERSHIP_TYPE_12.0", "OWNERSHIP_TYPE_13.0", "OWNERSHIP_TYPE_14.0", 
         "OWNERSHIP_TYPE_16.0", "OWNERSHIP_TYPE_17.0", 
         "OWNERSHIP_TYPE_18.0", "PROP_TYPE_2.0", "PROP_TYPE_3.0", "PROP_TYPE_4.0", 
         "PROP_TYPE_7.0", "PROP_TYPE_9.0", "PROP_TYPE_10.0", 
         "PROP_TYPE_16.0", "PROP_TYPE_17.0", "PROP_TYPE_18.0", "PROP_TYPE_19.0", 
         "PROP_TYPE_20.0", "PROP_TYPE_21.0", "PROP_TYPE_22.0", "PROP_TYPE_23.0", 
         "PROP_TYPE_24.0", "PROP_TYPE_25.0", "PROP_TYPE_26.0", "PROP_TYPE_27.0", 
         "PROP_TYPE_29.0", "PROP_TYPE_30.0", "PROP_TYPE_31.0", 
         "PROP_TYPE_32.0", "PROP_TYPE_37.0", "PROP_TYPE_39.0", 
         "PROP_TYPE_40.0", "PROP_TYPE_44.0", "PROP_TYPE_45.0", "PROP_TYPE_47.0", 
         "PROP_TYPE_48.0", "PROP_TYPE_51.0", "PROP_TYPE_52.0", "PROP_TYPE_53.0", 
         "PAYMENT_METHOD_NonDD", "PAYMENT_METHOD_PureDD", "P1_EMP_STATUS_C", 
         "P1_EMP_STATUS_E", "P1_EMP_STATUS_F", "P1_EMP_STATUS_H", "P1_EMP_STATUS_I", 
         "P1_EMP_STATUS_N", "P1_EMP_STATUS_R", "P1_EMP_STATUS_S", "P1_EMP_STATUS_U", 
         "P1_EMP_STATUS_V", "P1_MAR_STATUS_B", "P1_MAR_STATUS_C", "P1_MAR_STATUS_D", 
         "P1_MAR_STATUS_M", "P1_MAR_STATUS_N", "P1_MAR_STATUS_O", "P1_MAR_STATUS_P", 
         "P1_MAR_STATUS_S", "P1_MAR_STATUS_W", 
         "age", "property_age", "cover_length", "RISK_RATED_AREA_B_imputed", 
         "RISK_RATED_AREA_C_imputed", "MTA_FAP_imputed", "MTA_APRP_imputed",
         "SUM_INSURED_BUILDINGS", "NCD_GRANTED_YEARS_B", "SUM_INSURED_CONTENTS", 
         "NCD_GRANTED_YEARS_C", "SPEC_SUM_INSURED", "SPEC_ITEM_PREM", 
         "UNSPEC_HRP_PREM", "BEDROOMS", "MAX_DAYS_UNOCC", "LAST_ANN_PREM_GROSS"
        ]


print("Reading the data")
df = pd.read_csv("../input/home-insurance/home_insurance.csv")

print("Preprocessing the data")
X_train, y_train, X_test, y_test = splitData(df, FEATS)
X_train, X_test = standardiseNumericalFeats(X_train, X_test)

print("The ratio of lapse class in training set is " +
      str(round(y_train.sum()/len(y_train) * 100, 2)) +
      "%"
     )

print("The ratio of lapse class in test set is " +
      str(round(y_test.sum()/len(y_test) * 100, 2)) +
      "%"
     )

print("Training XGBoost model")
xgb_model = trainXgbModel(X_train, y_train, X_test, y_test, FEATS, ROUNDS)

print("Training MLP model")
d1_cnn_model = trainD1CnnModel(X_train, y_train)

print("Training TabNet model")
tabNet_model = trainTabNetModel(X_train, y_train, None)

print("Making predictions")
y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred = makePredictions(X_test, xgb_model, d1_cnn_model, tabNet_model)

print("Evaluation of the model")
evaluate(y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred)

print("Prediction distribution")
plotPredictionDistribution(y_xgb_pred, y_d1_cnn_pred, y_tabNet_pred)

## Feature importance

In [None]:
# XGBoost model
importance_xgb = pd.DataFrame.from_dict(xgb_model.get_score(importance_type="gain"),orient="index").sort_values(0, ascending = False)
importance_xgb.columns = ["importance"]
importance_xgb

In [None]:
# TabNet model
importance_tabNet = pd.DataFrame(tabNet_model.feature_importances_,index=X_train.columns).sort_values(0, ascending = False)
importance_tabNet.columns = ["importance"]
importance_tabNet