# TabNet: A neural network designed for tabular data

[TabNet](https://arxiv.org/pdf/1908.07442.pdf) brings deep learning to tabular data. TabNet has been developed by researchers at Google Cloud AI and achieves SOTA performance on a number of test cases.
This notebook is a simple example of performing a binary classification using the [pyTorch implementation](https://pypi.org/project/pytorch-tabnet/) for the kaggle [Jane Street Market Prediction](https://www.kaggle.com/c/jane-street-market-prediction) competition.

**Note:** Version 3.0.0 of pyTorch TabNet has just been released, and can be installed on kaggle with the internet turned off (essential for this competition) via the following dataset:
[Official version of pytorch-tabnet release](https://www.kaggle.com/optimo/officialpytorchtabnet).

In [None]:
import numpy  as np
import pandas as pd

# install datatable (without using the Internet)
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null
import datatable as dt

# install TabNet 3.0.0
!pip install  ../input/officialpytorchtabnet/pytorch_tabnet-3.0.0-py3-none-any.whl pytorch-tabnet  > /dev/null
from pytorch_tabnet.tab_model import TabNetClassifier

### Read in the training data
The `train.csv` is large (5.77G) so we shall use [datatable](https://datatable.readthedocs.io/en/latest/) to speed things up:

In [None]:
train_data = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()

### Data cleaning
Drop the rows that have zero `weight`

In [None]:
train_data = train_data.query('weight > 0').reset_index(drop = True)

### Create our `action`

In [None]:
train_data['action'] = ( (train_data['resp']) > 0 )*1

### Use all 130 features

In [None]:
all_features    = [i for i in range(0,130)]
train_features  = [x+7 for x in all_features]
test_features   = [x+1 for x in all_features]

### Do something about the missing values
Here we shall simply fill them with zeros

In [None]:
X_train = train_data.iloc[ : , train_features].fillna(0)

### Make our `X_train` and `y_train`

In [None]:
X_train = X_train.to_numpy()
y_train = train_data.loc[:, 'action'].to_numpy().squeeze()

### Train TabNet
We only use 2 epochs as a demonstration. Serious training takes quite some time.

In [None]:
%%time

classifier = TabNetClassifier(verbose=1,seed=42)

classifier.fit(X_train=X_train, y_train=y_train,
               patience=1,
               max_epochs=2,
               eval_metric=['auc'])

### Save our trained TabNet model
As mentioned, training may take some time, so this would be the natural end of the training notebook.

In [None]:
saved_filename = classifier.save_model('JaneStreet_TabNet_model')

The following would be the main content of the scoring notebook, where we read back in the saved trained model and create our `submission.csv`:
### Read in the saved TabNet model
(Here commented out as this is an *all-in-one* notebook)

In [None]:
# classifier = TabNetClassifier()
# classifier.load_model('../input/your_training_notebook/JaneStreet_TabNet_model.zip')

### Run the model on the test data

In [None]:
import janestreet
env = janestreet.make_env() 
iter_test = env.iter_test()

for (test_df, sample_prediction_df) in iter_test:
    test_weight = test_df.iloc[0].weight
    if test_weight > 0:
        X_test = (test_df.iloc[:, test_features ].fillna(0)).to_numpy()
        proba = classifier.predict_proba(X_test)[:,1]
        sample_prediction_df.action = 1 if proba > 0.5 else 0
    else:
        sample_prediction_df.action = 0
    env.predict(sample_prediction_df)

# Related links
* [Sercan O. Arik and Tomas Pfister "TabNet: Attentive Interpretable Tabular Learning", arXiv:1908.07442 (2019)](https://arxiv.org/pdf/1908.07442.pdf)
* [TabNet on AI Platform: High-performance, Explainable Tabular Learning](https://cloud.google.com/blog/products/ai-machine-learning/ml-model-tabnet-is-easy-to-use-on-cloud-ai-platform) (Google Cloud)
* [pytorch-tabnet](https://github.com/dreamquark-ai/tabnet) (GitHub)