# Introduction
The world is filled with Tabular Data. Every day databases capture credit card transactions, login events, alarms and customer loyatly card or basket information. This data is unique, it is mixed and often requires a combination of luck and domain expertise to find features useful in regression, classification, survival or ranking problems. Analysts may be working with hundreds or thousands of count, continuous and categorical data sources and derived features, and in many applications, are required for fairness and algorithmic bias to provide global and local explainations about the features used in a given prediction at scale. This isnt the world where a pixel is a pixel is a pixel with edges and local structure, or just a series of word embeddings, there is a lot of garbage data for which there is no off-the-shelf pretrained model to help you. 

For a long time, both in production in many AutoML services and in competition Histogram Gradient Boosting has become popular for its robustness and explainablity- with drawback. These drawbacks are in the ability to handle incredibly sparse information efficiently, multi-output problems and, more recently, continue training on pretrained image and audio backbones which allow the incorperations of other rich data sources. Recently, efforts by [Yandex on NODE](https://research.yandex.com/publications/241) and [Google on TabNet](https://arxiv.org/abs/1908.07442) have looked to push the state of the art, so that deep learning is not just the tool for images, text and audio but also for the wealth of tabular data that dominates make industries.  That is not to say, however, that deep learning cannot or has not shown impressive results in narrow use cases on tabular data: it has. But rather these papers have looked to find general purpose approaches for deep learning on tabualr problems which meet the demands of practitioners in this field. 

In a previous post I looked at existing TabNet implementation and compared them against popular off-the-shelf Boosting and MLP approaches. In this post I take a deeper look at the tricks used in TabNet and offer a detailed and flexible implementation for Tensorflow users looking to give this model a go. 

# Tricks
So like any Deep Learning papers, this paper is going to reach into a bag of tricks. Many of these tricks take inspiration from breakthroughs in Natural Language Processing (NLP) and Image Recognition, but are going to useful for improve the stability of our models in training on tabular data, allow for some kind of features selection, like in Tree Ensembles, and fight overfitting. 
  
__0. Learned Embeddings__  
So I nearly didn't include this one. Embeddings are so pervasive these days across deep learning domains that it seems easy to forget their noverly or significance. Embeddings are a nice way of representing categorical data by learning a continuous vector to represent that word, user or product. This means rather than feeding into the first layer of the model a one-hot encoding of our categories, we in effect add an extra 'prelayer' for categorical variables so the first layer in the model sees dense continuous features rather than sparse discrete features. This has proven very effective in a host of applications in particular in unsupervised modelling and transfer learning. In Natural Language these embeddings capture in many cases the symantic meaning of words, which [researchers can later visualize](https://projector.tensorflow.org/) to explore problems such as algorithmic bias. 
![embedding-diagram](https://developers.google.com/machine-learning/crash-course/images/EmbeddingExample2-1.svg)
  
__1. Ghost Batch Normalization__  
To understand [Ghost Batch Normalization](https://arxiv.org/abs/1705.08741), we need to introduce [Batch Normalization](https://www.youtube.com/watch?v=em6dfRxYkYU). I have linked a video by Andrew Ng which explains very well the advantages in model stability and overfitting of this approach. Using Ghost Batch Normalization we feed large batches into our model, which can help with speed and stability, but at each Batch Normalization layer, split the batch into many smaller virtual-batches upon which to normalize the data seperately before recombining and feeding to next layer. This allows us to train faster and more stably on larger dataset, while still taking advantage of the regularizing effect on Batch Normalization on small batch sizes. 
![types-of-bn](https://i.stack.imgur.com/DLwRc.png)
  
__2. Gated Linear Units (GLUs)__  
[Gated Linear Units](https://arxiv.org/pdf/1612.08083.pdf) can be thought again as a kind of attention mechanism. Similar to LSTMs, the gates formed by this approach involve taking two dense layer outputs, applying a sigmoid activation to one of them, and then multiply the two together. This, authors claim, serves to control the information passed on in the hierarchy depending on what is relevant in some context. 
![GLU-diagram](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQN4fHkyIwoetD75ZvHYXNH-pmotF1DGIHikh3AzRJvRfAxI5o&s)

__3. Attention Mechanisms__  
[Attention Mechanism can very get complicated](https://www.youtube.com/watch?v=iDulhoQ2pro) but in simple applications are very similar to the GLUs mentioned previously. In this paper the attention mechanism is used to perform some kind of feature selection where a model with a Sparsemax (or Entmax) activation function is used to predict a mask to select only a subset of features for use in later layers of the network. 
![attention-diagram](https://i.imgur.com/1152PYf.png)
  
__4. Sparsemax and Entmax__  
[Sparsemax](https://arxiv.org/abs/1602.02068) and [Entmax](https://arxiv.org/pdf/1905.05702.pdf) are a more extreme kind of softmax activation function which result in a binary mask rather than mask which just sums to one. This has proven useful in conjunction with growing interest into attention mechanisms in Neural Networks and is used in NODE as well to immitate decision trees.   
![entmax-diagram](https://github.com/deep-spin/entmax/raw/master/entmax.png)
  
__5. Skip connections and Residual Networks__  
There are a lot of resources on Skip connections. These appoaches have been taken very far in ['ResNet'](https://www.youtube.com/watch?v=GWt6Fu05voI) and [Neural Differential Equations](https://arxiv.org/abs/1806.07366) but are rather simple. In skip connections we take the output of a layer and and its input, $Y = f(X) + X. This has been show to improve the stability of deep models, as in theory layers only learn the changes needed to be made to the input, providing deep models the ability to represent shallow models where optimal. 
![skip connection](https://miro.medium.com/max/570/1*D0F3UitQ2l5Q0Ak-tjEdJg.png)

  

  


# TabNet
![tabnet](https://github.com/titu1994/tf-TabNet/raw/master/images/TabNet.png?raw=true)
  
The TabNet Architecture comprises a number of layers and blocks, which together decribe the model. To understand how these pieces fit together we are going to implement them and slowly build up and join each component together. 

In [None]:
from typing import Optional, Union, Tuple

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import tensorflow_addons as tfa
import pandas as pd
from sklearn.metrics import accuracy_score


@tf.function
def identity(x):
    return x

__GLU Block__  
The first component we are going to need to build is our GLUBlock which complises two fully connected layers, two ghost batch normalization, our identity and sigmoid activation function and multiplication operation. Here we use Tensorflow 2.0 custom layer subclassing to make this layer easy to work with a reusable across the rest of our model. Here I have added a number of type-hints for users to make working with this customer layer easy to follow and apply. 

In [None]:
class GLUBlock(tf.keras.layers.Layer):
    def __init__(self, units: Optional[int] = None,
                 virtual_batch_size: Optional[int] = 128, 
                 momentum: Optional[float] = 0.02):
        super(GLUBlock, self).__init__()
        self.units = units
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        
    def build(self, input_shape: tf.TensorShape):
        if self.units is None:
            self.units = input_shape[-1]
            
        self.fc_outout = tf.keras.layers.Dense(self.units, 
                                               use_bias=False)
        self.bn_outout = tf.keras.layers.BatchNormalization(virtual_batch_size=self.virtual_batch_size, 
                                                            momentum=self.momentum)
        
        self.fc_gate = tf.keras.layers.Dense(self.units, 
                                             use_bias=False)
        self.bn_gate = tf.keras.layers.BatchNormalization(virtual_batch_size=self.virtual_batch_size, 
                                                          momentum=self.momentum)
        
    def call(self, inputs: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None):
        output = self.bn_outout(self.fc_outout(inputs), 
                                training=training)
        gate = self.bn_gate(self.fc_gate(inputs), 
                            training=training)
    
        return output * tf.keras.activations.sigmoid(gate) # GLU

__Feature Transformer Block__  
Here we again use subclassing to define a layer to represent either the shared or independent steps to 'Feature Transformer' in the diagram above. This block comprises two GLU Blocks with a skip connection form the output of the first block to the output of the second. Here I have had to add a flag to add a skip connection over the first GLU Block, as the this is only present in the decision step dependent block. 

In [None]:
class FeatureTransformerBlock(tf.keras.layers.Layer):
    def __init__(self, units: Optional[int] = None, virtual_batch_size: Optional[int]=128, 
                 momentum: Optional[float] = 0.02, skip=False):
        super(FeatureTransformerBlock, self).__init__()
        self.units = units
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        self.skip = skip
        
    def build(self, input_shape: tf.TensorShape):
        if self.units is None:
            self.units = input_shape[-1]
        
        self.initial = GLUBlock(units = self.units, 
                                virtual_batch_size=self.virtual_batch_size, 
                                momentum=self.momentum)
        self.residual =  GLUBlock(units = self.units, 
                                  virtual_batch_size=self.virtual_batch_size, 
                                  momentum=self.momentum)
        
    def call(self, inputs: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None):
        initial = self.initial(inputs, training=training)
        
        if self.skip == True:
            initial += inputs

        residual = self.residual(initial, training=training) # skip
        
        return (initial + residual) * np.sqrt(0.5)

__Attention Block__  
This block is simple to implement and involves prior to the actual mask operation, just a dense layer fed into a batch normalization layer, followed by a sparsemax actication function. The major complication in this block is in how to handle TabNet prior, used to encourage orthogonal feature selection across decision steps. Here we just use it as an input to our layer and reserve to handle the updates to our priors in our TabNet step layer. 

In [None]:
class AttentiveTransformer(tf.keras.layers.Layer):
    def __init__(self, units: Optional[int] = None, virtual_batch_size: Optional[int] = 128, 
                 momentum: Optional[float] = 0.02):
        super(AttentiveTransformer, self).__init__()
        self.units = units
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        
    def build(self, input_shape: tf.TensorShape):
        if self.units is None:
            self.units = input_shape[-1]
            
        self.fc = tf.keras.layers.Dense(self.units, 
                                        use_bias=False)
        self.bn = tf.keras.layers.BatchNormalization(virtual_batch_size=self.virtual_batch_size, 
                                                     momentum=self.momentum)
        
    def call(self, inputs: Union[tf.Tensor, np.ndarray], priors: Optional[Union[tf.Tensor, np.ndarray]] = None, training: Optional[bool] = None) -> tf.Tensor:
        feature = self.bn(self.fc(inputs), 
                          training=training)
        if priors is None:
            output = feature
        else:
            output = feature * priors
        
        return tfa.activations.sparsemax(output)

__TabNetStep__  
In this TabNetStep Block I take a nunmber of design decision to make implmentation and reusability simpler.  At this layer we take as inputs our batch normalized features, the output of our shared feature transformer, and our priors of the current step and output the features embedding at our split point, the masked feature to used in the shared feature transfomer black of the next step and the mask used in our attention operation. This mask will be important as we most though layers in ensuring new features are selected across steps and providing local and global feature attributions for each output. This block comprises our FeatureTransformerBlock and Attention Transfomer block and starts to piece all our components together. 

In [None]:
class TabNetStep(tf.keras.layers.Layer):
    def __init__(self, units: Optional[int] = None, virtual_batch_size: Optional[int]=128, 
                 momentum: Optional[float] =0.02):
        super(TabNetStep, self).__init__()
        self.units = units
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        
    def build(self, input_shape: tf.TensorShape):
        if self.units is None:
            self.units = input_shape[-1]
        
        self.unique = FeatureTransformerBlock(units = self.units, 
                                              virtual_batch_size=self.virtual_batch_size, 
                                              momentum=self.momentum,
                                              skip=True)
        self.attention = AttentiveTransformer(units = input_shape[-1], 
                                              virtual_batch_size=self.virtual_batch_size, 
                                              momentum=self.momentum)
        
    def call(self, inputs, shared, priors, training=None) -> Tuple[tf.Tensor]:  
        split = self.unique(shared, training=training)
        keys = self.attention(split, priors, training=training)
        masked = keys * inputs
        
        return split, masked, keys

__TabNetEncoder__  
I opted to present the entire model architecture as a layer. This makes this easier to work with between use cases, as we apply TabNet in unsupervised, self-supervised and multiple supervised domains without having to rewrite large tracts of code each time. You will see here, we accumulate our feature embeddings at each decision step, update our priors and compute out entropy loss used to limit how often features are reused across steps.  This makes for a complicated layer, but in many ways adds modularity which is very useful going forward. 

In [None]:
class TabNetEncoder(tf.keras.layers.Layer):
    def __init__(self, units: int =1, 
                 n_steps: int = 3, 
                 n_features: int = 8,
                 outputs: int = 1, 
                 gamma: float = 1.3,
                 epsilon: float = 1e-8, 
                 sparsity: float = 1e-5, 
                 virtual_batch_size: Optional[int]=128, 
                 momentum: Optional[float] =0.02):
        super(TabNetEncoder, self).__init__()
        
        self.units = units
        self.n_steps = n_steps
        self.n_features = n_features
        self.virtual_batch_size = virtual_batch_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.momentum = momentum
        self.sparsity = sparsity
        
    def build(self, input_shape: tf.TensorShape):            
        self.bn = tf.keras.layers.BatchNormalization(virtual_batch_size=self.virtual_batch_size, 
                                                     momentum=self.momentum)
        self.shared_block = FeatureTransformerBlock(units = self.n_features, 
                                                    virtual_batch_size=self.virtual_batch_size, 
                                                    momentum=self.momentum)        
        self.initial_step = TabNetStep(units = self.n_features, 
                                       virtual_batch_size=self.virtual_batch_size, 
                                       momentum=self.momentum)
        self.steps = [TabNetStep(units = self.n_features, 
                                 virtual_batch_size=self.virtual_batch_size, 
                                 momentum=self.momentum) for _ in range(self.n_steps)]
        self.final = tf.keras.layers.Dense(units = self.units, 
                                           use_bias=False)
    

    def call(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> Tuple[tf.Tensor]:        
        entropy_loss = 0.
        encoded = 0.
        output = 0.
        importance = 0.
        prior = tf.reduce_mean(tf.ones_like(X), axis=0)
        
        B = prior * self.bn(X, training=training)
        shared = self.shared_block(B, training=training)
        _, masked, keys = self.initial_step(B, shared, prior, training=training)

        for step in self.steps:
            entropy_loss += tf.reduce_mean(tf.reduce_sum(-keys * tf.math.log(keys + self.epsilon), axis=-1)) / tf.cast(self.n_steps, tf.float32)
            prior *= (self.gamma - tf.reduce_mean(keys, axis=0))
            importance += keys
            
            shared = self.shared_block(masked, training=training)
            split, masked, keys = step(B, shared, prior, training=training)
            features = tf.keras.activations.relu(split)
            
            output += features
            encoded += split
            
        self.add_loss(self.sparsity * entropy_loss)
          
        prediction = self.final(output)
        return prediction, encoded, importance

# Data
We will be looking at a customer churn classification problem for broadband internet customers. The aim of this notebook is not to explore many complicated approaches to featur engineering but to explore the inner workings of tabnet. The main aim in choosing a dataset was for it to be reasonably large, at around 510125 observations, and to have mixed categorical, count and continuous data, as it common to tabular datasets. The only operations we performed to clean and resample the data was to ensure there was class balance. 

In [None]:
CATEGORICAL_COLUMNS = ['line_stat', 'serv_type', 'serv_code',
                       'bandwidth', 'term_reas_code', 'term_reas_desc',
                       'with_phone_service', 'current_mth_churn']
NUMERIC_COLUMNS = ['contract_month', 'ce_expiry', 'secured_revenue', 'complaint_cnt']

df = pd.read_csv('/kaggle/input/broadband-customers-base-churn-analysis/bbs_cust_base_scfy_20200210.csv').assign(complaint_cnt = lambda df: pd.to_numeric(df.complaint_cnt, 'coerce'))
df.loc[:, NUMERIC_COLUMNS] = df.loc[:, NUMERIC_COLUMNS].astype(np.float32).pipe(lambda df: df.fillna(df.mean())).pipe(lambda df: (df - df.mean())/df.std())
df.loc[:, CATEGORICAL_COLUMNS] = df.loc[:, CATEGORICAL_COLUMNS].astype(str).applymap(str).fillna('')
df = df.groupby('churn').apply(lambda df: df.sample(df.churn.value_counts().min()))
df.head()

We will be taking a simple randomized test-train split approach to cross-validation, though in other applications k-fold, stratified k-fold or backtesting may be more appropriate in a competition or production application. 

In [None]:
from sklearn.model_selection import train_test_split

def get_labels(x: pd.Series) -> pd.Series:
    """
    Converts strings to unqiue ints for use in Pytorch Embedding
    """
    labels, levels = pd.factorize(x)
    return pd.Series(labels, name=x.name, index=x.index)

X, E, y = (df
           .loc[:, NUMERIC_COLUMNS]
           .astype('float32')
           .join(pd.get_dummies(df.loc[:, CATEGORICAL_COLUMNS])),
           df
           .loc[:, NUMERIC_COLUMNS]
           .astype('float32')
           .join(df.loc[:, CATEGORICAL_COLUMNS].apply(get_labels).add(1).astype('int32')),
           df.churn == 'Y')

X_train, X_valid, E_train, E_valid, y_train, y_valid = train_test_split(X.to_numpy(), E, y.to_numpy(), train_size=250000, test_size=250000)

Here I wrote some simple helpers to convert of Pandas DataFrame to TF Data records for easy and flexible use with our DenseFeature layer for embeddings in TF2. 

In [None]:
def get_feature(x: pd.DataFrame, dimension=1) -> Union[tf.python.feature_column.NumericColumn, tf.python.feature_column.EmbeddingColumn]:
    if x.dtype == np.float32:
        return tf.feature_column.numeric_column(x.name)
    else:
        return tf.feature_column.embedding_column(
        tf.feature_column.categorical_column_with_identity(x.name, num_buckets=x.max() + 1, default_value=0),
        dimension=dimension)
    
def df_to_dataset(X: pd.DataFrame, y: pd.Series, shuffle=False, batch_size=50000) -> tf.python.data.ops.dataset_ops.TensorSliceDataset:
    ds = tf.data.Dataset.from_tensor_slices((dict(X.copy()), y.copy()))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(X))
    ds = ds.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
    return ds

columns = [get_feature(f) for k, f in E_train.iteritems()]
feature_column = tf.keras.layers.DenseFeatures(columns, trainable=True)

train, valid = df_to_dataset(E_train, y_train), df_to_dataset(E_valid, y_valid)

# Supervised Learning
The first application we will be looking at is in supervised learning this is a primary aim of TabNet so is one we will explore. Here I tried to trick to a number of hyperparameter defaults found in other implementation, exploring only a smaller feature vector size for the purpose of visualization later on.  In my experiment this hampers the formance of the model greatly but, in my implementation, reduces greatly the overall footprint of the model given the use of weight sharing across the steps.   
Here I use Tensorflow 2's model subclassing approach to make explainations and feature visualization easier later on. For production use, the subclassing API does have some limitation in how model can be serialized and unserialized- some of which have been adressed in Tensorflow 2.2 and 2.3 releases. 

In [None]:
class TabNetClassifier(tf.keras.Model):
    def __init__(self, outputs: int = 1, 
                 n_steps: int = 3, 
                 n_features: int = 8,
                 gamma: float = 1.3, 
                 epsilon: float = 1e-8, 
                 sparsity: float = 1e-5, 
                 feature_column: Optional[tf.keras.layers.DenseFeatures] = None, 
                 pretrained_encoder: Optional[tf.keras.layers.Layer] = None,
                 virtual_batch_size: Optional[int] = 128, 
                 momentum: Optional[float] = 0.02):
        super(TabNetClassifier, self).__init__()
        
        self.outputs = outputs
        self.n_steps = n_steps
        self.n_features = n_features
        self.feature_column = feature_column
        self.pretrained_encoder = pretrained_encoder
        self.virtual_batch_size = virtual_batch_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.momentum = momentum
        self.sparsity = sparsity
        
        if feature_column is None:
            self.feature = tf.keras.layers.Lambda(identity)
        else:
            self.feature = feature_column
            
        if pretrained_encoder is None:
            self.encoder = TabNetEncoder(units=outputs, 
                                        n_steps=n_steps, 
                                        n_features = n_features,
                                        outputs=outputs, 
                                        gamma=gamma, 
                                        epsilon=epsilon, 
                                        sparsity=sparsity,
                                        virtual_batch_size=self.virtual_batch_size, 
                                        momentum=momentum)
        else:
            self.encoder = pretrained_encoder

    def forward(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> Tuple[tf.Tensor]:
        X = self.feature(X)
        output, encoded, importance = self.encoder(X)
          
        prediction = tf.keras.activations.sigmoid(output)
        return prediction, encoded, importance
    
    def call(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        prediction, _, _ = self.forward(X)
        return prediction
    
    def transform(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        _, encoded, _ = self.forward(X)
        return encoded
    
    def explain(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        _, _, importance = self.forward(X)
        return importance    

In [None]:
m = TabNetClassifier(outputs=1, n_steps=3, n_features = 2, feature_column=feature_column, virtual_batch_size=250)
m.compile(tf.keras.optimizers.Adam(learning_rate=0.025), tf.keras.losses.binary_crossentropy)
m.fit(train, epochs=100)

In [None]:
m.summary()

We show the performance in terms of accuracy in terms of both our training and validation samples to determine the impact of overfitting. 

In [None]:
tf_tabnet_y_pred = m.predict(train)

accuracy_score(y_train, tf_tabnet_y_pred > 0.5)

In [None]:
tf_tabnet_y_pred = m.predict(valid)

accuracy_score(y_valid, tf_tabnet_y_pred > 0.5)

We can visualize our models feature space to analyze class seperation. 

In [None]:
import holoviews as hv
hv.extension('bokeh')

Z_train = m.transform(dict(E_train)).numpy()

hv.Scatter(pd.DataFrame(Z_train, columns=['Component 1', 'Component 2'])
 .assign(label=y_train.astype(str))
 .sample(1000),
  kdims='Component 1', vdims=['Component 2', 'label']).opts(color='label', cmap="Category10", title='Latent feature space')

We can use of learned masks to determine local and global feature importances or attributions for the model. I don't this the local attributions follow the same sensitivity and interprettations of other model agnostic or tree-based methods but are, in previous experiment, similar to those of common Boosting approaches. 

In [None]:
A_train = m.explain(dict(E_train)).numpy()

pd.Series(A_train.mean(0), index=E.columns).plot.bar(title='Global Importances')

## Unsupervised Pretraining
The TabNet authors see their approach particularly valuable in unsupervised or self-supervised learning applications where models can be pre-trained across large amounts of unlabelled data and then fine-tuned on labelled examples. To allow for such as approach, they define a decoder architures which takes in the encoders feature space of the encoder model and passes this input through a number of step of Feature Tansformer Blocks and Dense Layers. This decoder then returns the original feature input of the encoder model as output for use in training. 

In [None]:
class TabNetDecoder(tf.keras.layers.Layer):
    def __init__(self, units=1, 
                 n_steps = 3, 
                 n_features = 8,
                 outputs = 1, 
                 gamma = 1.3,
                 epsilon = 1e-8, 
                 sparsity = 1e-5, 
                 virtual_batch_size=128, 
                 momentum=0.02):
        super(TabNetDecoder, self).__init__()
        
        self.units = units
        self.n_steps = n_steps
        self.n_features = n_features
        self.virtual_batch_size = virtual_batch_size
        self.momentum = momentum
        
    def build(self, input_shape: tf.TensorShape):
        self.shared_block = FeatureTransformerBlock(units = self.n_features, 
                                                    virtual_batch_size=self.virtual_batch_size, 
                                                    momentum=self.momentum)
        self.steps = [FeatureTransformerBlock(units = self.n_features,
                                              virtual_batch_size=self.virtual_batch_size, 
                                              momentum=self.momentum) for _ in range(self.n_steps)]
        self.fc = [tf.keras.layers.Dense(units = self.units) for _ in range(self.n_steps)]
    

    def call(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        decoded = 0.
        
        for ftb, fc in zip(self.steps, self.fc):
            shared = self.shared_block(X, training=training)
            feature = ftb(shared, training=training)
            output = fc(feature)
            
            decoded += output
        return decoded

Where this encoder-decoder approach differs from many autoencoders is in its use of masks. The encoder is fed features with certain features masks with zero or their mean and is required to then predict these masked features. This implementation is complicated and see I naively decided to define an internal loss funciton to perform this operation with a dummy loss function used an model compile time, rather than try to handle this in TF Data. This provides room for further work, but does make the model easier for novices unfamiliar with custom loss function and TFData. 

In [None]:
class TabNetAutoencoder(tf.keras.Model):
    def __init__(self, outputs: int = 1, 
                 inputs: int = 12,
                 n_steps: int  = 3, 
                 n_features: int  = 8,
                 gamma: float = 1.3, 
                 epsilon: float = 1e-8, 
                 sparsity: float = 1e-5, 
                 feature_column: Optional[tf.keras.layers.DenseFeatures] = None, 
                 virtual_batch_size: Optional[int] = 128, 
                 momentum: Optional[float] = 0.02):
        super(TabNetAutoencoder, self).__init__()
        
        self.outputs = outputs
        self.inputs = inputs
        self.n_steps = n_steps
        self.n_features = n_features
        self.feature_column = feature_column
        self.virtual_batch_size = virtual_batch_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.momentum = momentum
        self.sparsity = sparsity
        
        if feature_column is None:
            self.feature = tf.keras.layers.Lambda(identity)
        else:
            self.feature = feature_column
            
        self.encoder = TabNetEncoder(units=outputs, 
                                    n_steps=n_steps, 
                                    n_features = n_features,
                                    outputs=outputs, 
                                    gamma=gamma, 
                                    epsilon=epsilon, 
                                    sparsity=sparsity,
                                    virtual_batch_size=self.virtual_batch_size, 
                                    momentum=momentum)
        
        self.decoder = TabNetDecoder(units=inputs, 
                                     n_steps=n_steps, 
                                     n_features = n_features,
                                     virtual_batch_size=self.virtual_batch_size, 
                                     momentum=momentum)
        
        self.bn = tf.keras.layers.BatchNormalization(virtual_batch_size=self.virtual_batch_size, 
                                                     momentum=momentum)
        
        self.do = tf.keras.layers.Dropout(0.25)

    def forward(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> Tuple[tf.Tensor]:
        X = self.feature(X)
        X = self.bn(X)
        
        # training mask
        M = self.do(tf.ones_like(X), training=training)
        D = X*M
        
        #encoder
        output, encoded, importance = self.encoder(D)
        prediction = tf.keras.activations.sigmoid(output)        
        
        return prediction, encoded, importance, X, M
    
    def call(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        # encode
        prediction, encoded, _, X, M = self.forward(X)
        T = X * (1 - M)

        #decode
        reconstruction = self.decoder(encoded)
        
        #loss
        loss  = tf.reduce_mean(tf.where(M != 0., tf.square(T-reconstruction), tf.zeros_like(reconstruction)))
        
        self.add_loss(loss)
        
        return prediction
    
    def transform(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        _, encoded, _, _, _ = self.forward(X)
        return encoded
    
    def explain(self, X: Union[tf.Tensor, np.ndarray], training: Optional[bool] = None) -> tf.Tensor:
        _, _, importance, _, _ = self.forward(X)
        return importance

In [None]:
@tf.function
def dummy_loss(y, t):
    return 0.

In [None]:
ae = TabNetAutoencoder(outputs=1, inputs=12, n_steps=3, n_features = 2, feature_column=feature_column, virtual_batch_size=250)
ae.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.005), loss=dummy_loss)
ae.fit(train, epochs=100)

In [None]:
ae.summary()

We can visual the latent space described by the model for used in unsupervised applciations. 

In [None]:
import holoviews as hv
hv.extension('bokeh')

Z_train = ae.transform(dict(E_train)).numpy()

hv.Scatter(pd.DataFrame(Z_train, columns=['Component 1', 'Component 2'])
 .assign(label=y_train.astype(str))
 .sample(1000),
  kdims='Component 1', vdims=['Component 2', 'label']).opts(color='label', cmap="Category10", title='Latent feature space')

Unlike many unsupervised autoencoder model, we get some kind of feature importances to the model without having to rely on model agnostic explainations or gradient-based explainations. 

In [None]:
AE_train = ae.explain(dict(E_train)).numpy()

pd.Series(AE_train.mean(0), index=E.columns).plot.bar(title='Global Importances')

We will be using this pretrained layer now for use in our next fine-tuning experiment. 

In [None]:
ae.layers[1]

# Self-supervised Fine-tuning
Given the major motivation of this paper, the performance and flexibility of this approach in fine-tuning appears critical. Here we initialize our classifier with our pretrained TabNet encoder layer and continue training on our labelled data. 

In [None]:
pm = TabNetClassifier(outputs=1, n_steps=3, n_features = 2, feature_column=feature_column, pretrained_encoder=ae.layers[1], virtual_batch_size=250)
pm.compile(tf.keras.optimizers.Adam(learning_rate=0.05), tf.keras.losses.binary_crossentropy)
pm.fit(train, epochs=150) 

In my experiments pretraining had limited impacy on the performance of the model and exhibitted very different training characteristics requiring a much higher learning rate. This can be typical in transfer learning application and is reason for which many researchers have experimented with partial weight reinitialization by readding some noise to the model at this step. 

In [None]:
tf_tabnet_y_pred = pm.predict(train)

accuracy_score(y_train, tf_tabnet_y_pred > 0.5)

In [None]:
tf_tabnet_y_pred = pm.predict(valid)

accuracy_score(y_valid, tf_tabnet_y_pred > 0.5)

Again, we visualize the latent feature representation of our model and the learned feature importances. I think this is an interesting application of TabNet which may in time, make it an important tool in solving particular tabular data problems. 

In [None]:
Z_train = pm.transform(dict(E_train)).numpy()

hv.Scatter(pd.DataFrame(Z_train, columns=['Component 1', 'Component 2'])
 .assign(label=y_train.astype(str))
 .sample(1000),
  kdims='Component 1', vdims=['Component 2', 'label']).opts(color='label', cmap="Category10", title='Latent feature space')

In [None]:
AE_train = pm.explain(dict(E_train)).numpy()

pd.Series(AE_train.mean(0), index=E.columns).plot.bar(title='Global Importances')

# Conclusion
TabNet is exciting. I think it is too early to know its impact. For now we will have to stay tuned to Winner's Posts and industry Blog Posts to see its usability and real-world performance. I think, in theory, this approach may unlock new domains or approaches to modelling on large mixed datasets. I think, as with many deep learning approaches, there are some challenges in automation which require resiliant infrastucture. 