Neural networks have exploded in popularity in almost every data format, from images and videos to text. One data format where they are yet to make in-roads is tabular data. For tabular data, the most popular algorithms continue to be tree-based boosting algorithms such as [XGBoost](https://en.wikipedia.org/wiki/XGBoost), [LighGBM](https://en.wikipedia.org/wiki/LightGBM) and [CatBoost](https://en.wikipedia.org/wiki/Catboost), or even simple linear algorithms like [Linear](https://en.wikipedia.org/wiki/Linear_regression)/[Logistic](https://en.wikipedia.org/wiki/Logistic_regression) Regression.

TabNet<sup>[\[1\]](#ref1)</sup> is a remedy to this. It is an attention-based neural network introduced in 2019 by a Google Cloud AI team. It beats XGBoost, LightGBM and CatBoost on multiple datasets such as [Forest Cover Type](https://archive.ics.uci.edu/ml/datasets/covertype) and [Poker Hand](https://archive.ics.uci.edu/ml/datasets/Poker+Hand). It also tends to be more explainable than these other algorithms.

In this notebook, we will use TabNet to solve the [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic) competition. We will use the [TabNet implementation for PyTorch by Dreamquark](https://github.com/dreamquark-ai/tabnet) in order to do this. We will also use [Optuna](https://github.com/optuna/optuna) to tune the hyperparameters of the model.


> Note: For more on TabNet, refer to this video by the co-author of the above mentioned library: [Talks # 4: Sebastien Fischman - Pytorch-TabNet: Beating XGBoost on Tabular Data Using Deep Learning](https://www.youtube.com/watch?v=ysBaZO8YmX8).

In [None]:
# Install TabNet
!pip install pytorch-tabnet

# TabNet Architecture

Architecturally, TabNet consists of multiple encoder steps, as shown below:

![](https://drive.google.com/uc?export=view&id=1fM0jdeUB7pgj_Zg1z7jwmeM7ATZzWMF6)

Intuitively, each step selects a subset of features from all the features available in the training data to use for its predictions. How it differs from most other neural networks is that this selection step occurs for each sample instead of the entire training data. Thus, the predictions for each sample are generated by different subset of features. This leads to better performance.

There are three main layers that make up a step.

### Feature Transformer

![TabNet Feature Transformer](https://drive.google.com/uc?export=view&id=1TSyJwEYjAN5CT5cdTLLaVbcmmXXaNpL6)

The feature transformer layer generates an internal representation of the features.

Each layer consists of stacks of a fully-connected layer, a batch normalisation and a [GLU](https://paperswithcode.com/method/glu) activation function, with skip connections in between stacks.

Some of the stacks are shared across all the steps and some stacks are local to a step. That is, the feature transformer in each step uses some weights that are common across every step and some weights that are learnt specifically for that step. This ensures that other steps have some information from every step while generating their internal representation.

Take note of the split block in the overall architecture. This block is used to split the internal representation between the next step (red arrow) and the overall output (blue arrow). Clearly, no such split is required in step $0$ and hence, there is only a red arrow.

### Attentive Transformer

![TabNet Attentive Transformer](https://drive.google.com/uc?export=view&id=17SAFPueqARehjkxD0UmRmLAXAT8c-DXx)

The attentive transformer layer takes the learned representation of features as input and outputs a mask which is then used to select the features that should be used for this step and the current sample. The mask can be thought of as consisting of probabilities that sum up to $1$.

Each layer consists of a fully-connected layer, a batch normalisation and a "sparse" softmax activation. The softmax activation is sparse since the generated mask has a lot of zeros in it, denoting that the features associated with those zeros are not used for generating predictions.

The prior scales shown in the diagram consists of information which denotes how much each feature has been used in the previous steps. This is taken into account while generating the mask. Mathematically, if the current step is $i$, $P[i]=\prod_{j=1}^{i}(\gamma-M[j])$, where $\gamma$ is a relaxation parameter and $M[j]$ is the mask in step $j$. When $\gamma=1$, a feature is enforced to be used only for one step (since then $\gamma-M[j]\approx0$) and as $\gamma$ increases, the constraint relaxes so that the feature can be used for multiple steps.

### Feature Masking

The feature masking layer uses the generated mask to select a subset of features. It is a element-wise product between the original features and the generated mask.

# Explainability

An important goal behind TabNet is to be explainable. This is achieved by using the generated masks in each step. Using the masks, it is possible to visualize which features are being used the most for each sample at each step.

# Imports

In [None]:
import functools
import os
import random
import warnings

import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import torch
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn import metrics


warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
def seed_everything(seed=42):
    torch.manual_seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    
seed_everything()

# Configuration

We will define a configuration class that will store some basic configuration that is used throughout the notebook.

In [None]:
class Config:
    DATA_DIR = "../input/spaceship-titanic-prepared-datasets"
    MAX_EPOCHS = 30
    N_TRIALS = 30
    PATIENCE = 20
    BATCH_SIZE = 1024
    NUM_WORKERS = 2
    
    DEFAULTS = {
        "n_d": 8,
        "n_a": 8,
        "n_steps": 3,
        "n_shared":  2,
        "cat_emb_dim": 1,
        "lr": 2e-2,
        "mask_type": "entmax",
        "lambda_sparse": 1e-3,
        "max_epochs": MAX_EPOCHS,
        "patience": PATIENCE,
    }
    
    @classmethod
    def filepath(cls, filename):
        return os.path.join(cls.DATA_DIR, filename)

# Load Datasets

> Note: This notebook uses the datasets as prepared in [Spaceship Titanic - Logistic Regression Baselines](https://www.kaggle.com/code/defcodeking/spaceship-titanic-logistic-regression-baselines). Link to dataset: [Spaceship Titanic Prepared Datasets](https://www.kaggle.com/datasets/defcodeking/spaceship-titanic-prepared-datasets).

We will load the datasets.

In [None]:
train_df = pd.read_csv(Config.filepath("train_prepared_both_le.csv"))
test_df = pd.read_csv(Config.filepath("test_prepared_both_le.csv"))

In [None]:
train_df.head()

In [None]:
test_df.head()

# Preprocessing

TabNet comes with suppoprt for categorical features out of the box. All we need to do is to make sure they are label encoded and have their datatype as integer. We will define a function which takes the training and test dataframes and label encodes all the one-hot encoded columns.

In [None]:
def preprocess_datasets(train_df, test_df):
    # Make copies so that original datasets remain unchanged
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Drop Transported and kfold
    drop = ["Transported", "kfold"]
    dropped = train_df[drop].values
    train_df = train_df.drop(drop, axis=1)
    
    # Drop PassengerId
    passenger_id = test_df["PassengerId"].values
    test_df = test_df.drop("PassengerId", axis=1)
    
    # Add suffix to index and store indices
    # So that the dataframes can be merged and split
    train_df = train_df.rename("train_{}".format)
    test_df = test_df.rename("test_{}".format)
    
    tr_idx = train_df.index
    te_idx = test_df.index
    
    # Merge
    df = pd.concat([train_df, test_df])
    
    oh_cols = ["CabinDeck", "HomePlanet", "Destination", "GroupSize"]
    
    for oh_col in oh_cols:
        # Get all columns associated with the one-hot column
        columns = [column for column in df.columns if column.startswith(f"{oh_col}_")]
        
        # .idxmax() returns that column name which has the maximum value in the row
        values = df[columns].idxmax(axis=1)
        
        # Get all levels and make a mapping from level to index
        levels = values.value_counts().index
        mapping = {level: idx for idx, level in enumerate(levels)}
        
        # Add column with the mapping and specify type as int
        df[oh_col] = values.map(mapping).astype(int)
        
        # Drop one-hot columns
        df = df.drop(columns, axis=1)
        
    # Make sure other categorical features have the right type
    missing = (col for col in df.columns if col.endswith("_missing"))
    others = ["CryoSleep", "VIP", "Alone", "CabinNum", "GroupId", *missing]
    df[others] = df[others].astype(int)
        
    # Split and add dropped columns
    train_df = df.loc[tr_idx, :]
    train_df[drop] = dropped
    
    test_df = df.loc[te_idx, :]
    test_df["PassengerId"] = passenger_id
    
    return train_df, test_df

## Prepare Datasets

We will use the function above to prepare the datasets.

In [None]:
# Prepare datasets
train_df, test_df = preprocess_datasets(train_df, test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

## Define Some Parameters Required By TabNet

In order to correctly handle categorical features, TabNet requires two arguments:

- `cat_idxs`: This should be a list of positions of categorical features in terms of indices in the training data.
- `cat_dims`: This should be a list of the number of classes in each categorical features in the training data. It should have the same order as `cat_idxs`.

Thus, we will create these two lists.

In [None]:
cols = train_df.columns

# Create a list of all categorical columns in the dataframe
missing = (col for col in cols if col.endswith("_missing"))
categorical_features = [
    "CryoSleep",
    "VIP",
    "Alone",
    "CabinNum",
    "GroupId",
    "HomePlanet",
    "Destination",
    "CabinDeck",
    "GroupSize",
    "CabinSide",
    *missing
]

In [None]:
target = "Transported"
unused = ["kfold"]

# Create a list of all features
features = [col for col in cols if col not in unused + [target]]

# Find the indices of all categorical features
categorical_idx = [idx for idx, feature in enumerate(features) if feature in categorical_features]

# Get the number of classes of each categorical feature
# The concatenation is required since there are some categorical features
# With some levels present only in either train or test data
categorical_dims = [
    pd.concat([train_df[col], test_df[col]]).nunique() for col in cols if col in categorical_features
]


features, categorical_idx, categorical_dims

# Parameters For TabNet

The function below will be used to get the the initialization parameters for TabNet. Using a function this way allows us to easily integrate Optuna into the pipeline. Take note of the number of hyperparameters TabNet has. Out of these, only a few need to be tuned. This is also another benefit TabNet has over XGBoost and LightGBM, which have a relatively large number of hyperparameters.

Also take note of `device_name`. When set to `auto`, TabNet automatically detects a GPU and uses it if available.

In [None]:
TabNetClassifier()

In [None]:
def get_init_params(params, verbose=1, cat_idxs=None, cat_dims=None):
    return {
        "n_d": params.get("n_d", 8),
        "n_a": params.get("n_d", 8),
        "n_steps": params.get("n_steps", 3),
        "n_shared": params.get("n_shared", 2),
        "cat_emb_dim": params.get("cat_emb_dim", 1),
        "optimizer_params": {"lr": params.get("lr", 2e-2)},
        "mask_type": params.get("mask_type", "sparsemax"),
        "lambda_sparse": params.get("lambda_sparse", 1e-3),
        "optimizer_fn": torch.optim.Adam,
        "cat_idxs": cat_idxs or [],
        "cat_dims": cat_dims or [],
        "verbose": verbose,
    }

# Training Loop

We will define the training loop as a function.

It takes the training and test dataframes, parameters for training, a list of features to be used, name of the target in the dataframe, a verbosity argument, and optional categorical indices and dimensions as required by TabNet. At the end, it returns the test predictions, training history, the average best score and data that can be used for explainability.

> Note: TabNet automatically selects the epoch with the best score after `.fit()` for making predictions.

In [None]:
def train(
    train_df,
    test_df,
    params,
    features,
    target,
    verbose=1,
    cat_idxs=None,
    cat_dims=None
):
    # Variables to store test predictions, history and total best score
    test_preds, history, total_best_score = [], {}, 0.0
    
    # Variable to store the data required for explainability
    explains = {}
    
    for fold in range(5):
        print(f"Fold {fold + 1}:")
        
        # Get the training and validation sets
        train = train_df[train_df["kfold"] != fold]
        val = train_df[train_df["kfold"] == fold]
        
        # Get the training features and labels
        X_train = train[features].values
        y_train = train[target].values
        
        # Get the validation features and labels
        X_val = val[features].values
        y_val = val[target].values
        
        # Get the init parameters
        init_params = get_init_params(params, verbose=verbose, cat_idxs=cat_idxs, cat_dims=cat_dims)
        
        # Create model
        clf = TabNetClassifier(**init_params)
        
        # Train model
        clf.fit(
            X_train=X_train,
            y_train=y_train,
            eval_set=[(X_train, y_train), (X_val, y_val)],
            eval_name=["train", "valid"],
            eval_metric=["accuracy"],
            max_epochs=params.get("max_epochs", Config.MAX_EPOCHS),
            patience=params.get("patience", Config.PATIENCE),
            batch_size=Config.BATCH_SIZE,
            virtual_batch_size=128,
            num_workers=Config.NUM_WORKERS,
            weights=1,
            drop_last=False
        )
        
        test = test_df[features].values
        
        # Get test predictions
        test_pred = clf.predict_proba(test)[:, 1]
        test_preds.append(test_pred)
        
        # Store data for explainability
        explains[f"fold_{fold}"] = clf.explain(test)
        
        # Store fold history
        history[f"fold_{fold}"] = clf.history
        
        # Get the best score and add it to total
        # Note: The best cost is actually the accuracy here
        # Therefore, Optuna should be set to maximize this
        total_best_score += clf.best_cost
        
        print("\n\n")
    
    # Get final test predictions and add to dataframe
    test_preds = np.vstack(test_preds)
    test_preds = np.mean(test_preds, axis=0)

    result_df = test_df.copy()
    result_df["preds"] = test_preds
    
    # Calculate average best score
    best_score = total_best_score / 5
    
    return result_df, history, best_score, explains

# Optuna Objective

We will define the objective function that Optuna will optimize to find hyperparameters.

In [None]:
def objective(
    trial,
    train_df,
    test_df,
    features,
    target,
    cat_idxs=None,
    cat_dims=None
):
    n_d = trial.suggest_int("n_d", 8, 16, step=4)
    
    params = {
        "n_d": n_d,
        "n_a": n_d,
        "n_steps": trial.suggest_int("n_steps", 3, 5),
        "n_shared": trial.suggest_int("n_shared", 2, 5),
        "cat_emb_dim": trial.suggest_int("cat_emb_dim", 1, 5),
        "lr": trial.suggest_float("lr", 2e-4, 2e-2),
        "mask_type": trial.suggest_categorical("mask_type", ["entmax", "sparsemax"]),
        "lambda_sparse": trial.suggest_float("lambda_sparse", 1e-3, 3e-3, log=True),
        "patience": trial.suggest_int("patience", 5, 20, step=5),
        "max_epochs": trial.suggest_int("max_epochs", 5, 30, step=5),
    }
    
    _, _, score, _ = train(
        train_df=train_df,
        test_df=test_df,
        params=params,
        features=features,
        target=target,
        verbose=0,
        cat_idxs=cat_idxs,
        cat_dims=cat_dims,
    )
    
    return score

# Hyperparameter Search

We will define a function that will use Optuna to find the best hyperparameters and return the found hyperparameters.

In [None]:
def hyperparameter_search(
    train_df,
    test_df,
    features,
    target,
    cat_idxs=None,
    cat_dims=None,
    n_trials=Config.N_TRIALS
):  
    # Get the objective
    objective_ = functools.partial(
        objective,
        train_df=train_df,
        test_df=test_df,
        features=features,
        target=target,
        cat_idxs=cat_idxs,
        cat_dims=cat_dims,
    )
    
    # Create study
    sampler = optuna.samplers.TPESampler(seed=42)
    study = optuna.create_study(direction="maximize", sampler=sampler)
    
    # Enqueue a trial which uses the default values
    study.enqueue_trial(Config.DEFAULTS)
    
    # Optimize
    study.optimize(objective_, n_trials=n_trials)

    return study.best_params

# Training

We will first find the best hyperparameters.

In [None]:
print("Finding best hyperparameters...")
params = hyperparameter_search(
    train_df=train_df,
    test_df=test_df,
    features=features,
    target=target,
    cat_idxs=categorical_idx,
    cat_dims=categorical_dims,
)

In [None]:
print("Best hyperparameters:")
params

Now we will train the final model with the best parameters.

In [None]:
print("Training with best hyperparameters...")
results, history, _, explains = train(
    train_df=train_df,
    test_df=test_df,
    params=params,
    features=features,
    target=target,
    cat_idxs=categorical_idx,
    cat_dims=categorical_dims,
)

# Plots

## Loss Curve

We will visualize the loss curve.

In [None]:
f, axs = plt.subplots(3, 2, figsize=(10, 8))

for fold, ax in zip(range(5), axs.flatten()):
    fold_history = history[f"fold_{fold}"]
    
    ax.set_title(f"Fold {fold + 1}")
    ax.plot(fold_history["loss"])
    ax.set_xlabel("Epochs")
    ax.set_ylabel("Training Loss")
    
f.delaxes(axs.flatten()[-1])
plt.tight_layout()

## Accuracy Curve

We will visualize the accuracy curve.

In [None]:
f, axs = plt.subplots(3, 2, figsize=(10, 8))

for fold, ax in zip(range(5), axs.flatten()):
    fold_history = history[f"fold_{fold}"]
    
    ax.set_title(f"Fold {fold + 1}")
    ax.plot(fold_history["train_accuracy"], label="Train")
    ax.plot(fold_history["valid_accuracy"], label="Valid")
    ax.set_xlabel("Epochs")
    ax.set_ylabel("Accuracy")
    ax.legend()
    
f.delaxes(axs.flatten()[-1])
plt.tight_layout()

## Explainability

Below, we will visualize the explainability data. The y-axis is the sample number and the x-axis is the position of the feature in the dataset. A lighter color implies that that feature contributed more to the results for that sample.

In [None]:
n_steps = params["n_steps"]
# Change this number to see other folds
fold = 0

explain_matrix, masks = explains[f"fold_{0}"]

fig, axs = plt.subplots(1, n_steps, figsize=(20, 20))

for i in range(n_steps):
    axs[i].imshow(masks[i][:50])
    axs[i].set_xlabel("Features")
    axs[i].set_label("Samples")
    axs[i].set_title(f"Mask {i} for fold {fold + 1}")

# Submission

In [None]:
submission = results[["PassengerId", "preds"]]
submission = submission.rename(columns={"preds": "Transported"})
submission["Transported"] = submission["Transported"] >= 0.5
submission.head()

In [None]:
submission["Transported"].value_counts()

In [None]:
submission.to_csv("submission.csv", index=False)

# Conclusion

In this notebook, we explored how TabNet coupled with Optuna can be used to achieve good performance on the Spaceship Titanic dataset.

# References


<a id="#ref1">[1]</a> Sercan O. ArÄ±k; and Tomas Pfister. 2019. [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/pdf/1908.07442v5.pdf). Google Cloud AI.