Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabnet (pytorch-tabnet) #52

Open
szilard opened this issue Mar 1, 2021 · 41 comments
Open

tabnet (pytorch-tabnet) #52

szilard opened this issue Mar 1, 2021 · 41 comments
Labels

Comments

@szilard
Copy link
Owner

szilard commented Mar 1, 2021

https://github.com/dreamquark-ai/tabnet

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Setup:

## CPU

sudo docker run --rm  -ti continuumio/anaconda3 /bin/bash

pip3 install -U pytorch-tabnet


## GPU

sudo docker build -t tabnet_gpu .

sudo nvidia-docker run -ti --rm tabnet_gpu /bin/bash

Dockerfile:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

ENV DEBIAN_FRONTEND=noninteractive

## https://github.com/ContinuumIO/docker-images/blob/master/anaconda3/debian/Dockerfile
## Latest commit edc2451 on Nov 30, 2020

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV PATH /opt/conda/bin:$PATH

RUN apt-get update --fix-missing && \
    apt-get install -y wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
    apt-get clean

RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh -O ~/anaconda.sh && \
    /bin/bash ~/anaconda.sh -b -p /opt/conda && \
    rm ~/anaconda.sh && \
    ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate base" >> ~/.bashrc && \
    find /opt/conda/ -follow -type f -name '*.a' -delete && \
    find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
    /opt/conda/bin/conda clean -afy

RUN pip3 install -U pytorch-tabnet xgboost

CMD [ "/bin/bash" ]

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Starter code (adapted from https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb ):

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

idx_trvl = np.random.choice(["train", "valid"], p =[.8, .2], size=(d_train.shape[0],))

X_train = X_all[0:d_train.shape[0]][idx_trvl=="train"].to_numpy()
y_train = y_all[0:d_train.shape[0]][idx_trvl=="train"]
X_valid = X_all[0:d_train.shape[0]][idx_trvl=="valid"].to_numpy()
y_valid = y_all[0:d_train.shape[0]][idx_trvl=="valid"]

X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=1000, patience=20,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

To change training data size:

# d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv")

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Compare to XGBoost sample code:

import xgboost as xgb

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0]]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


dxgb_train = xgb.DMatrix(X_train, label = y_train)
dxgb_test = xgb.DMatrix(X_test)


param = {'max_depth':10, 'eta':0.1, 'objective':'binary:logistic', 'tree_method':'hist'}             
%time md = xgb.train(param, dxgb_train, num_boost_round = 100)


y_pred = md.predict(dxgb_test)   
print(metrics.roc_auc_score(y_test, y_pred))

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Results:

p3.2xlarge (8 CPU cores, 60GB RAM, Tesla V100-SXM2 GPU, 16GB GPU memory)

data 0.1m rows:

Device used : cuda

epoch 0  | loss: 0.65115 | train_auc: 0.66573 | valid_auc: 0.66859 |  0:00:09s
epoch 1  | loss: 0.63131 | train_auc: 0.69202 | valid_auc: 0.69403 |  0:00:19s
epoch 2  | loss: 0.62756 | train_auc: 0.69904 | valid_auc: 0.69698 |  0:00:29s
...

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0    39W / 300W |   1269MiB / 16160MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9095      C   /opt/conda/bin/python            1267MiB |
+-----------------------------------------------------------------------------+

mpstat -P ALL
Linux 5.4.0-1021-aws (ip-172-31-1-150)  03/02/21        _x86_64_        (8 CPU)

08:30:11     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:30:11     all    5.52    0.04    0.65    2.42    0.00    0.01    0.55    0.00    0.00   90.81
08:30:11       0   12.59    0.02    0.69    5.88    0.00    0.01    0.44    0.00    0.00   80.38
08:30:11       1    5.49    0.00    0.54    4.14    0.00    0.01    0.60    0.00    0.00   89.22
08:30:11       2    1.55    0.05    0.56    2.28    0.00    0.00    0.60    0.00    0.00   94.96
08:30:11       3    1.10    0.04    0.62    1.63    0.00    0.00    0.55    0.00    0.00   96.06
08:30:11       4   13.70    0.16    0.82    0.97    0.00    0.00    0.53    0.00    0.00   83.82
08:30:11       5    3.35    0.00    0.54    0.78    0.00    0.00    0.56    0.00    0.00   94.77
08:30:11       6    4.27    0.01    0.58    1.06    0.00    0.00    0.58    0.00    0.00   93.49
08:30:11       7    2.14    0.00    0.88    2.66    0.00    0.01    0.54    0.00    0.00   93.77

...
epoch 44 | loss: 0.59977 | train_auc: 0.73965 | valid_auc: 0.71134 |  0:07:33s
epoch 45 | loss: 0.5972  | train_auc: 0.74166 | valid_auc: 0.71608 |  0:07:43s
epoch 46 | loss: 0.59842 | train_auc: 0.74134 | valid_auc: 0.715   |  0:07:53s

Early stopping occurred at epoch 46 with best_epoch = 26 and best_valid_auc = 0.72051
Best weights from best epoch are automatically used!
CPU times: user 7min 58s, sys: 2.09 s, total: 8min
Wall time: 8min

print(metrics.roc_auc_score(y_test, y_pred))
0.7119781941594729

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

XGBoost:

CPU:

CPU times: user 10.1 s, sys: 524 ms, total: 10.6 s
Wall time: 1.61 s

In [23]: print(metrics.roc_auc_score(y_test, y_pred))
0.7234366181705152

GPU:

param = {'max_depth':10, 'eta':0.1, 'objective':'binary:logistic', 'tree_method':'gpu_hist'}  

CPU times: user 2.46 s, sys: 248 ms, total: 2.71 s
Wall time: 2.71 s

In [27]: print(metrics.roc_auc_score(y_test, y_pred))
0.7240759297879703

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

More info on the data/tabnet:

d_train
Out[26]:
      Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance dep_delayed_15min
0       c-8       c-21       c-7     1934            AA    ATL  DFW       732                 N
1       c-4       c-20       c-3     1548            US    PIT  MCO       834                 N
2       c-9        c-2       c-5     1422            XE    RDU  CLE       416                 N
3      c-11       c-25       c-6     1015            OO    DEN  MEM       872                 N
4      c-10        c-7       c-6     1828            WN    MDW  OMA       423                 Y
...     ...        ...       ...      ...           ...    ...  ...       ...               ...
99995   c-5        c-4       c-3     1618            OO    SFO  RDD       199                 N
99996   c-1       c-18       c-3      804            CO    EWR  DAB       884                 N
99997   c-1       c-24       c-2     1901            NW    DTW  IAH      1076                 N
99998   c-4       c-27       c-4     1515            MQ    DFW  GGG       140                 N
99999  c-11       c-17       c-4     1800            WN    SEA  SMF       605                 N

[100000 rows x 9 columns]


X_train
Out[27]:
array([[1934,  732,   10, ...,    1,   19,   82],
       [1548,  834,    6, ...,   19,  226,  180],
       [1828,  423,    1, ...,   20,  182,  210],
       ...,
       [ 804,  884,    0, ...,    5,   98,   76],
       [1901, 1076,    0, ...,   14,   88,  139],
       [1800,  605,    2, ...,   20,  258,  269]])


In [29]: cat_idxs
Out[29]: [2, 3, 4, 5, 6, 7]

In [30]: cat_dims
Out[30]: [12, 31, 7, 23, 307, 307]


md
Out[32]: TabNetClassifier(n_d=8, n_a=8, n_steps=3, gamma=1.3, cat_idxs=[2, 3, 4, 5, 6, 7], 
cat_dims=[12, 31, 7, 23, 307, 307], cat_emb_dim=1, n_independent=2, n_shared=2, 
epsilon=1e-15, momentum=0.02, lambda_sparse=0.001, seed=0, clip_value=1, 
verbose=1, optimizer_fn=<class 'torch.optim.adam.Adam'>, optimizer_params={'lr': 0.02}, 
scheduler_fn=<class 'torch.optim.lr_scheduler.StepLR'>, 
scheduler_params={'step_size': 50, 'gamma': 0.9}, mask_type='entmax', 
input_dim=None, output_dim=None, device_name='auto')

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Results on 1M rows:

Very slow, stopped after 1 hour:

epoch 0  | loss: 0.62403 | train_auc: 0.7244  | valid_auc: 0.72432 |  0:01:35s
epoch 1  | loss: 0.60722 | train_auc: 0.73152 | valid_auc: 0.73089 |  0:03:12s
epoch 2  | loss: 0.60457 | train_auc: 0.73307 | valid_auc: 0.7321  |  0:04:48s
epoch 3  | loss: 0.60434 | train_auc: 0.73074 | valid_auc: 0.73007 |  0:06:25s
epoch 4  | loss: 0.60333 | train_auc: 0.73529 | valid_auc: 0.73426 |  0:08:02s
epoch 5  | loss: 0.60159 | train_auc: 0.73255 | valid_auc: 0.73174 |  0:09:40s
epoch 6  | loss: 0.60216 | train_auc: 0.73659 | valid_auc: 0.73485 |  0:11:20s
epoch 7  | loss: 0.60207 | train_auc: 0.73629 | valid_auc: 0.73505 |  0:13:00s
epoch 8  | loss: 0.60109 | train_auc: 0.73675 | valid_auc: 0.73551 |  0:14:40s
epoch 9  | loss: 0.60219 | train_auc: 0.73621 | valid_auc: 0.73493 |  0:16:22s
epoch 10 | loss: 0.60286 | train_auc: 0.73797 | valid_auc: 0.73608 |  0:18:03s
epoch 11 | loss: 0.60006 | train_auc: 0.73767 | valid_auc: 0.73618 |  0:19:44s
epoch 12 | loss: 0.60065 | train_auc: 0.73055 | valid_auc: 0.73015 |  0:21:26s
epoch 13 | loss: 0.60221 | train_auc: 0.73672 | valid_auc: 0.73494 |  0:23:08s
epoch 14 | loss: 0.60069 | train_auc: 0.73623 | valid_auc: 0.73457 |  0:24:48s
epoch 15 | loss: 0.60025 | train_auc: 0.73899 | valid_auc: 0.73672 |  0:26:28s
epoch 16 | loss: 0.60112 | train_auc: 0.73696 | valid_auc: 0.73502 |  0:28:05s
epoch 17 | loss: 0.60138 | train_auc: 0.72279 | valid_auc: 0.72241 |  0:29:44s
epoch 18 | loss: 0.60548 | train_auc: 0.7315  | valid_auc: 0.72944 |  0:31:21s
epoch 19 | loss: 0.60114 | train_auc: 0.73756 | valid_auc: 0.73539 |  0:32:57s
epoch 20 | loss: 0.60005 | train_auc: 0.73884 | valid_auc: 0.73716 |  0:34:33s
epoch 21 | loss: 0.59891 | train_auc: 0.73819 | valid_auc: 0.7363  |  0:36:09s
epoch 22 | loss: 0.60166 | train_auc: 0.73707 | valid_auc: 0.73574 |  0:37:48s
epoch 23 | loss: 0.6021  | train_auc: 0.73613 | valid_auc: 0.73465 |  0:39:29s
epoch 24 | loss: 0.59973 | train_auc: 0.73953 | valid_auc: 0.73827 |  0:41:10s
epoch 25 | loss: 0.59873 | train_auc: 0.73943 | valid_auc: 0.7377  |  0:42:48s
epoch 26 | loss: 0.60035 | train_auc: 0.74019 | valid_auc: 0.73831 |  0:44:21s
epoch 27 | loss: 0.60074 | train_auc: 0.71763 | valid_auc: 0.71827 |  0:45:57s
epoch 28 | loss: 0.60204 | train_auc: 0.73863 | valid_auc: 0.73705 |  0:47:35s
epoch 29 | loss: 0.60013 | train_auc: 0.73872 | valid_auc: 0.73735 |  0:49:13s
epoch 30 | loss: 0.59869 | train_auc: 0.73575 | valid_auc: 0.73365 |  0:50:51s
epoch 31 | loss: 0.59862 | train_auc: 0.74101 | valid_auc: 0.7397  |  0:52:28s
epoch 32 | loss: 0.60233 | train_auc: 0.73744 | valid_auc: 0.73598 |  0:54:08s

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

XGBoost 1M rows:

CPU:

(8 cores only)

Wall time: 6.44 s

In [23]: print(metrics.roc_auc_score(y_test, y_pred))
0.7523232418031532

GPU:

Wall time: 4.15 s

In [23]: print(metrics.roc_auc_score(y_test, y_pred))
0.7533844881816474

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

Trying out other values for hyper parameters for 0.1 M rows data:

Default parameter values:

md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc']
)
epoch 49 | loss: 0.43199 | train_auc: 0.73726 | valid_auc: 0.71074 |  0:07:42s
epoch 50 | loss: 0.43204 | train_auc: 0.73595 | valid_auc: 0.71217 |  0:07:51s
epoch 51 | loss: 0.43165 | train_auc: 0.73697 | valid_auc: 0.71263 |  0:08:00s
epoch 52 | loss: 0.43191 | train_auc: 0.73649 | valid_auc: 0.71253 |  0:08:10s
epoch 53 | loss: 0.4315  | train_auc: 0.7359  | valid_auc: 0.7128  |  0:08:19s
epoch 54 | loss: 0.4316  | train_auc: 0.73627 | valid_auc: 0.70992 |  0:08:29s

Early stopping occurred at epoch 54 with best_epoch = 44 and best_valid_auc = 0.71362
Best weights from best epoch are automatically used!
CPU times: user 8min 33s, sys: 2.4 s, total: 8min 35s
Wall time: 8min 35s

In [27]: print(metrics.roc_auc_score(y_test, y_pred))
0.7143332514023264

very similar to results above:

epoch 44 | loss: 0.59977 | train_auc: 0.73965 | valid_auc: 0.71134 |  0:07:33s
epoch 45 | loss: 0.5972  | train_auc: 0.74166 | valid_auc: 0.71608 |  0:07:43s
epoch 46 | loss: 0.59842 | train_auc: 0.74134 | valid_auc: 0.715   |  0:07:53s

Early stopping occurred at epoch 46 with best_epoch = 26 and best_valid_auc = 0.72051
Best weights from best epoch are automatically used!
CPU times: user 7min 58s, sys: 2.09 s, total: 8min
Wall time: 8min

print(metrics.roc_auc_score(y_test, y_pred))
0.7119781941594729

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

On CPU (rather than GPU) (8 cores only though):

sudo docker run -ti --rm tabnet_gpu /bin/bash
md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=1000, patience=20,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)



Device used : cpu


mpstat -P ALL
Linux 5.4.0-1021-aws (ip-172-31-1-150)  03/02/21        _x86_64_        (8 CPU)

10:28:06     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:28:06     all    8.05    0.01    0.22    0.54    0.00    0.00    0.13    0.00    0.00   91.05
10:28:06       0    4.59    0.00    0.21    1.32    0.00    0.00    0.10    0.00    0.00   93.77
10:28:06       1   16.50    0.00    0.22    0.93    0.00    0.00    0.14    0.00    0.00   82.22
10:28:06       2   15.52    0.01    0.19    0.51    0.00    0.00    0.14    0.00    0.00   83.63
10:28:06       3   10.64    0.01    0.25    0.36    0.00    0.00    0.13    0.00    0.00   88.62
10:28:06       4    5.24    0.04    0.28    0.22    0.00    0.00    0.12    0.00    0.00   94.10
10:28:06       5    5.30    0.00    0.18    0.17    0.00    0.00    0.13    0.00    0.00   94.22
10:28:06       6    3.76    0.00    0.17    0.24    0.00    0.00    0.13    0.00    0.00   95.69
10:28:06       7    2.83    0.00    0.26    0.60    0.00    0.00    0.12    0.00    0.00   96.18


nvidia-smi
Tue Mar  2 10:28:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


epoch 0  | loss: 0.65027 | train_auc: 0.63531 | valid_auc: 0.64192 |  0:00:09s
epoch 1  | loss: 0.62979 | train_auc: 0.68813 | valid_auc: 0.69382 |  0:00:19s
epoch 2  | loss: 0.62891 | train_auc: 0.69958 | valid_auc: 0.70239 |  0:00:29s
epoch 3  | loss: 0.62316 | train_auc: 0.70729 | valid_auc: 0.70938 |  0:00:39s
epoch 4  | loss: 0.61907 | train_auc: 0.70851 | valid_auc: 0.71051 |  0:00:48s
...
epoch 53 | loss: 0.5986  | train_auc: 0.74566 | valid_auc: 0.7193  |  0:08:44s
epoch 54 | loss: 0.59522 | train_auc: 0.74581 | valid_auc: 0.71679 |  0:08:54s
epoch 55 | loss: 0.59541 | train_auc: 0.74725 | valid_auc: 0.71946 |  0:09:04s
epoch 56 | loss: 0.59553 | train_auc: 0.74515 | valid_auc: 0.71192 |  0:09:14s

Early stopping occurred at epoch 56 with best_epoch = 36 and best_valid_auc = 0.72201
Best weights from best epoch are automatically used!
CPU times: user 31min 6s, sys: 4.51 s, total: 31min 11s
Wall time: 9min 16s

In [27]: print(metrics.roc_auc_score(y_test, y_pred))
0.7129940175217027

Screen Shot 2021-03-02 at 2 32 15 AM

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

m5.4xlarge CPU only (no GPU) (16 CPU cores):

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

idx_trvl = np.random.choice(["train", "valid"], p =[.8, .2], size=(d_train.shape[0],))

X_train = X_all[0:d_train.shape[0]][idx_trvl=="train"].to_numpy()
y_train = y_all[0:d_train.shape[0]][idx_trvl=="train"]
X_valid = X_all[0:d_train.shape[0]][idx_trvl=="valid"].to_numpy()
y_valid = y_all[0:d_train.shape[0]][idx_trvl=="valid"]

X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=1000, patience=20,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
epoch 46 | loss: 0.59857 | train_auc: 0.74215 | valid_auc: 0.71772 |  0:05:37s
epoch 47 | loss: 0.5976  | train_auc: 0.74279 | valid_auc: 0.71512 |  0:05:44s
epoch 48 | loss: 0.59852 | train_auc: 0.74411 | valid_auc: 0.71825 |  0:05:51s
epoch 49 | loss: 0.59193 | train_auc: 0.74576 | valid_auc: 0.71745 |  0:05:58s
epoch 50 | loss: 0.59302 | train_auc: 0.74683 | valid_auc: 0.71677 |  0:06:05s

Early stopping occurred at epoch 50 with best_epoch = 30 and best_valid_auc = 0.71973
Best weights from best epoch are automatically used!
CPU times: user 43min 22s, sys: 2.74 s, total: 43min 25s
Wall time: 6min 7s

In [27]: print(metrics.roc_auc_score(y_test, y_pred))
0.7108215092817305

Screen Shot 2021-03-02 at 8 22 20 AM

@szilard
Copy link
Owner Author

szilard commented Mar 2, 2021

With normalizing the numeric variables:

d_all["DepTime"]=d_all["DepTime"]/2400
d_all["Distance"]=np.log10(d_all["Distance"]/100)

Screen Shot 2021-03-02 at 8 31 48 AM

Screen Shot 2021-03-02 at 8 31 57 AM


epoch 29 | loss: 0.60077 | train_auc: 0.73926 | valid_auc: 0.70858 | 0:03:35s
epoch 30 | loss: 0.60221 | train_auc: 0.73805 | valid_auc: 0.70598 | 0:03:42s
epoch 31 | loss: 0.59789 | train_auc: 0.73964 | valid_auc: 0.70676 | 0:03:49s
epoch 32 | loss: 0.59851 | train_auc: 0.74012 | valid_auc: 0.70553 | 0:03:56s
epoch 33 | loss: 0.59718 | train_auc: 0.73896 | valid_auc: 0.70458 | 0:04:03s

Early stopping occurred at epoch 33 with best_epoch = 13 and best_valid_auc = 0.70874
Best weights from best epoch are automatically used!
CPU times: user 29min, sys: 2 s, total: 29min 2s
Wall time: 4min 5s

In [32]: print(metrics.roc_auc_score(y_test, y_pred))
0.7073240833934096

@Optimox
Copy link

Optimox commented Mar 3, 2021

Hello @szilard,

Nice work on the benchmark!

Here is a list of thoughts/questions I have by reading all your comment:

  • first of all, TabNet is definitely slower than XGBoost, all the tips I will give you won't change that. There are 2 cases where TabNet can actually be faster than XGBoost (multi task prediction - regression or classification- since XGBoost does not handle this AND very large number of classes in a multi classification problem, XGBoost does not scale well with number of classes while TabNet is almost O(1) with number of classes)
  • is there "only" 9 columns in the dataset? What happens when you have only a few columns is that the amount of data per batch that you send to the GPU is quite small (batch_size x number of features) so there is no need for a huge GPU (I saw 60Go of RAM), playing with num_workers might also speed up your training. It's written in the paper that you can set the batch_size as large as 10% of the total dataset, I personnaly would not go that far but you could try 4096, 8192, 16384. This will definitely speed up the training
  • also you started with the starter notebook which is good but not necessarily optimal in terms of speed. Sometimes you don't need that many epochs especially with large dataset where the model learns enough in a few epochs. The best way to reduce the number of epochs without losing too much in terms of score is to use OneCycleLearningRate. There is an example here (let's just skip the pretraining part for the moment) : https://www.kaggle.com/optimo/selfsupervisedtabnet
    You only need to change number of epochs and switch the scheduler like this:
MAX_EPOCH = 10 # 20
BS = 4096 # 8192
# In tabnet params
scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
scheduler_params=dict(max_lr=0.05,
                                              steps_per_epoch=int(X_train.shape[0] / BS),
                                              epochs=MAX_EPOCH,
                                              is_batch_level=True),
                                              
# inside the fit call
max_epochs=MAX_EPOCH

EDIT: make sure to set drop_last to True when using OneCycleLR

  • in order to improve the results you might want to tweak some parameters, the first thing I'll do is try to set n_steps to 1, you'll lose the sequential attentation mechanism but keep attention. Maybe setting lambda_sparse to 0 would prevent a strong regularisation to forbid the model to pick too few features. Trying n_d=n_a=16 or 32 might help as well.
  • about embedding dimensions, I think it's ok for low cardinality features to set the dimension to 1, but I guess when you have lots of possibilities (like Destination I guess) you might want to bring this up, something like log(feat_cardinality) might improve the score
  • also you could try to set weights to 0 as it might not help to balance classes on some dataset (don't know anything about the dataset)
  • to be honest I never had great success with pretraining (only marginal improvements), but it could definitely help in a semi-supervised approach with some data where you don't have access to the labels

I guess I'm out of ideas for now, there is no guarantee to beat XGBoost at all but it would be nice to end with a closer score and a closer training time I definitely agree with you :)

Cheers!

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

Wow, thanks @Optimox for comments and suggestions with great details, I really appreciate it.

I'll try out tweaking the params as per your many suggestions above and I'll post here the results as I'm getting them.

Yeah, surely, which algo is the best will depend on the dataset and on this one tabnet might not be able to beat xgboost, but it's already not far. I'll try to see especially if any of the suggestions you made could make it faster.

Thanks again for detailed feedback.

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

m5.2xlarge CPU only (no GPU) (8 CPU cores):

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

idx_trvl = np.random.choice(["train", "valid"], p =[.8, .2], size=(d_train.shape[0],))

X_train = X_all[0:d_train.shape[0]][idx_trvl=="train"].to_numpy()
y_train = y_all[0:d_train.shape[0]][idx_trvl=="train"]
X_valid = X_all[0:d_train.shape[0]][idx_trvl=="valid"].to_numpy()
y_valid = y_all[0:d_train.shape[0]][idx_trvl=="valid"]

X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=1000, patience=20,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
epoch 0  | loss: 0.65813 | train_auc: 0.66324 | valid_auc: 0.663   |  0:00:09s
epoch 1  | loss: 0.63399 | train_auc: 0.69213 | valid_auc: 0.69193 |  0:00:19s
epoch 2  | loss: 0.62437 | train_auc: 0.70212 | valid_auc: 0.70075 |  0:00:28s
epoch 3  | loss: 0.6233  | train_auc: 0.70729 | valid_auc: 0.70566 |  0:00:38s
epoch 4  | loss: 0.62286 | train_auc: 0.71031 | valid_auc: 0.70625 |  0:00:47s
epoch 5  | loss: 0.62021 | train_auc: 0.71293 | valid_auc: 0.7098  |  0:00:57s
epoch 6  | loss: 0.61444 | train_auc: 0.71756 | valid_auc: 0.71235 |  0:01:06s
epoch 7  | loss: 0.61398 | train_auc: 0.7209  | valid_auc: 0.71132 |  0:01:16s
epoch 8  | loss: 0.61277 | train_auc: 0.72358 | valid_auc: 0.70951 |  0:01:25s
epoch 9  | loss: 0.61233 | train_auc: 0.725   | valid_auc: 0.71468 |  0:01:35s
epoch 10 | loss: 0.61174 | train_auc: 0.72741 | valid_auc: 0.71295 |  0:01:44s
epoch 11 | loss: 0.60961 | train_auc: 0.72672 | valid_auc: 0.71123 |  0:01:53s
epoch 12 | loss: 0.60802 | train_auc: 0.72806 | valid_auc: 0.71363 |  0:02:03s
epoch 13 | loss: 0.6083  | train_auc: 0.72885 | valid_auc: 0.71485 |  0:02:12s
epoch 14 | loss: 0.61068 | train_auc: 0.72652 | valid_auc: 0.71343 |  0:02:21s
epoch 15 | loss: 0.60608 | train_auc: 0.7309  | valid_auc: 0.71617 |  0:02:30s
epoch 16 | loss: 0.60619 | train_auc: 0.73212 | valid_auc: 0.7143  |  0:02:40s
epoch 17 | loss: 0.60593 | train_auc: 0.73339 | valid_auc: 0.71603 |  0:02:49s
epoch 18 | loss: 0.6071  | train_auc: 0.73443 | valid_auc: 0.71644 |  0:02:59s
epoch 19 | loss: 0.60327 | train_auc: 0.73519 | valid_auc: 0.71675 |  0:03:08s
epoch 20 | loss: 0.60279 | train_auc: 0.73593 | valid_auc: 0.71602 |  0:03:17s
epoch 21 | loss: 0.60136 | train_auc: 0.73667 | valid_auc: 0.71974 |  0:03:27s
epoch 22 | loss: 0.60338 | train_auc: 0.73501 | valid_auc: 0.71766 |  0:03:36s
epoch 23 | loss: 0.60196 | train_auc: 0.73745 | valid_auc: 0.71607 |  0:03:46s
epoch 24 | loss: 0.60137 | train_auc: 0.73937 | valid_auc: 0.71556 |  0:03:55s
epoch 25 | loss: 0.59901 | train_auc: 0.73532 | valid_auc: 0.70913 |  0:04:04s
epoch 26 | loss: 0.60093 | train_auc: 0.73845 | valid_auc: 0.71182 |  0:04:13s
epoch 27 | loss: 0.59979 | train_auc: 0.74028 | valid_auc: 0.71339 |  0:04:23s
epoch 28 | loss: 0.59773 | train_auc: 0.74007 | valid_auc: 0.71401 |  0:04:31s
epoch 29 | loss: 0.59892 | train_auc: 0.74227 | valid_auc: 0.71467 |  0:04:40s
epoch 30 | loss: 0.59303 | train_auc: 0.74177 | valid_auc: 0.71591 |  0:04:50s
epoch 31 | loss: 0.59585 | train_auc: 0.74102 | valid_auc: 0.7117  |  0:04:59s
epoch 32 | loss: 0.59496 | train_auc: 0.74203 | valid_auc: 0.71381 |  0:05:08s
epoch 33 | loss: 0.59748 | train_auc: 0.74371 | valid_auc: 0.71321 |  0:05:17s
epoch 34 | loss: 0.59633 | train_auc: 0.74422 | valid_auc: 0.71709 |  0:05:27s
epoch 35 | loss: 0.59441 | train_auc: 0.74421 | valid_auc: 0.7132  |  0:05:36s
epoch 36 | loss: 0.59431 | train_auc: 0.74588 | valid_auc: 0.71288 |  0:05:45s
epoch 37 | loss: 0.5925  | train_auc: 0.74702 | valid_auc: 0.71409 |  0:05:54s
epoch 38 | loss: 0.59388 | train_auc: 0.74751 | valid_auc: 0.71434 |  0:06:03s
epoch 39 | loss: 0.59145 | train_auc: 0.74706 | valid_auc: 0.71226 |  0:06:13s
epoch 40 | loss: 0.59286 | train_auc: 0.7489  | valid_auc: 0.71205 |  0:06:22s
epoch 41 | loss: 0.59577 | train_auc: 0.74959 | valid_auc: 0.71335 |  0:06:31s

Early stopping occurred at epoch 41 with best_epoch = 21 and best_valid_auc = 0.71974
Best weights from best epoch are automatically used!
CPU times: user 22min 45s, sys: 2.18 s, total: 22min 47s
Wall time: 6min 33s

In [29]:

In [29]: y_pred = md.predict_proba(X_test)[:,1]

In [30]: print(metrics.roc_auc_score(y_test, y_pred))
0.7091169221435473

Reduce number of epochs as per @Optimox 's suggestions:

    max_epochs=1000, patience=3,

(patience=3 instead of patience=20)

epoch 0  | loss: 0.65513 | train_auc: 0.67451 | valid_auc: 0.67201 |  0:00:08s
epoch 1  | loss: 0.63041 | train_auc: 0.69572 | valid_auc: 0.69199 |  0:00:16s
epoch 2  | loss: 0.62603 | train_auc: 0.7005  | valid_auc: 0.69371 |  0:00:24s
epoch 3  | loss: 0.62409 | train_auc: 0.7097  | valid_auc: 0.69941 |  0:00:32s
epoch 4  | loss: 0.61996 | train_auc: 0.71049 | valid_auc: 0.70114 |  0:00:40s
epoch 5  | loss: 0.61891 | train_auc: 0.71802 | valid_auc: 0.70375 |  0:00:49s
epoch 6  | loss: 0.61704 | train_auc: 0.72077 | valid_auc: 0.70777 |  0:00:57s
epoch 7  | loss: 0.61394 | train_auc: 0.72274 | valid_auc: 0.70531 |  0:01:05s
epoch 8  | loss: 0.61016 | train_auc: 0.7254  | valid_auc: 0.712   |  0:01:13s
epoch 9  | loss: 0.60876 | train_auc: 0.72808 | valid_auc: 0.70849 |  0:01:21s
epoch 10 | loss: 0.60786 | train_auc: 0.72984 | valid_auc: 0.70787 |  0:01:29s
epoch 11 | loss: 0.60615 | train_auc: 0.73097 | valid_auc: 0.71187 |  0:01:37s

Early stopping occurred at epoch 11 with best_epoch = 8 and best_valid_auc = 0.712
Best weights from best epoch are automatically used!
CPU times: user 6min 4s, sys: 619 ms, total: 6min 4s
Wall time: 1min 39s

In [26]:

In [26]: y_pred = md.predict_proba(X_test)[:,1]

In [27]: print(metrics.roc_auc_score(y_test, y_pred))
0.711116213858468

AUC is still ~0.71, but 4x faster

@Optimox
Copy link

Optimox commented Mar 3, 2021

@szilard did you change the scheduler when reducing number of epochs?
Also I guess this will be more beneficial with the bigger dataset.

With low number of epochs you can also set the patience == MAX_EPOCH which will insure that you will complete all the epochs and select best one on validation set in order to make inference, or you can set patience = 0 so that you won't even early stop and keep the latest version (I would recommend first option)

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

Not yet, I did not change anything yet, just patience. I also could just do 10 epochs (fixed) without any early stopping.

I actually want to try the R package before doing all this tweaking.

I like posting a lot of info in GitHub issues (and often), sorry for too many notifications.

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

New baseline: no early stopping

m5.2xlarge CPU only (no GPU) (8 CPU cores)

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train)],
    eval_name=['train'],
    eval_metric=['auc'],
    max_epochs=10, patience=0,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Device used : cpu

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.6447  | train_auc: 0.66958 |  0:00:10s
epoch 1  | loss: 0.6257  | train_auc: 0.70141 |  0:00:20s
epoch 2  | loss: 0.62222 | train_auc: 0.70962 |  0:00:30s
epoch 3  | loss: 0.61941 | train_auc: 0.71391 |  0:00:40s
epoch 4  | loss: 0.61799 | train_auc: 0.71813 |  0:00:50s
epoch 5  | loss: 0.61705 | train_auc: 0.71874 |  0:01:00s
epoch 6  | loss: 0.61507 | train_auc: 0.72355 |  0:01:10s
epoch 7  | loss: 0.61117 | train_auc: 0.72365 |  0:01:20s
epoch 8  | loss: 0.60973 | train_auc: 0.72783 |  0:01:30s
epoch 9  | loss: 0.61041 | train_auc: 0.72952 |  0:01:40s
CPU times: user 6min 12s, sys: 588 ms, total: 6min 13s
Wall time: 1min 42s

In [24]: print(metrics.roc_auc_score(y_test, y_pred))
0.7152092805851693

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

tabnet in R:

Setup:

install.packages("tabnet")
library(data.table)
library(ROCR)
library(tabnet)
library(Matrix)


d_train <- fread("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv", stringsAsFactors=TRUE)
d_test <- fread("https://s3.amazonaws.com/benchm-ml--main/test.csv")

## align cat. values (factors)
d_train_test <- rbind(d_train, d_test)
n1 <- nrow(d_train)
n2 <- nrow(d_test)
d_train <- d_train_test[1:n1,]
d_test <- d_train_test[(n1+1):(n1+n2),]


system.time({
  md <- tabnet_fit(dep_delayed_15min ~ . ,d_train, epochs = 10, verbose = TRUE)
})


phat <- predict(md, d_test, type = "prob")$.pred_Y
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
performance(rocr_pred, "auc")@y.values[[1]]
[Epoch 001] Loss: 0.495622
[Epoch 002] Loss: 0.455483
[Epoch 003] Loss: 0.450127
[Epoch 004] Loss: 0.449376
[Epoch 005] Loss: 0.448024
[Epoch 006] Loss: 0.447154
[Epoch 007] Loss: 0.446089
[Epoch 008] Loss: 0.444280
[Epoch 009] Loss: 0.443956
[Epoch 010] Loss: 0.443126
    user   system  elapsed
2927.067    6.196 1502.377
>
>
> phat <- predict(md, d_test, type = "prob")$.pred_Y
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> performance(rocr_pred, "auc")@y.values[[1]]
[1] 0.70621

For some reason, it seems the R lib is 15x slower than the python implementation (though both call in theory into the same C++ code). Some of the default parameters might be different (TBD), but still.

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

Removing evaluation speeds up things:

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    weights=1,
    drop_last=False
)
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.6447  |  0:00:08s
epoch 1  | loss: 0.62579 |  0:00:16s
epoch 2  | loss: 0.62128 |  0:00:25s
epoch 3  | loss: 0.61964 |  0:00:33s
epoch 4  | loss: 0.61777 |  0:00:41s
epoch 5  | loss: 0.61686 |  0:00:50s
epoch 6  | loss: 0.61476 |  0:00:58s
epoch 7  | loss: 0.61002 |  0:01:06s
epoch 8  | loss: 0.60881 |  0:01:15s
epoch 9  | loss: 0.60938 |  0:01:23s
CPU times: user 5min 12s, sys: 614 ms, total: 5min 12s
Wall time: 1min 25s

In [23]:

In [23]: y_pred = md.predict_proba(X_test)[:,1]

In [24]: print(metrics.roc_auc_score(y_test, y_pred))
0.71644866173345

Do not try to balance the data (weights=0):

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    batch_size=1024, virtual_batch_size=128,
    weights=0
)
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.47899 |  0:00:08s
epoch 1  | loss: 0.45288 |  0:00:16s
epoch 2  | loss: 0.44979 |  0:00:25s
epoch 3  | loss: 0.44707 |  0:00:33s
epoch 4  | loss: 0.44457 |  0:00:42s
epoch 5  | loss: 0.44438 |  0:00:50s
epoch 6  | loss: 0.44358 |  0:00:59s
epoch 7  | loss: 0.44306 |  0:01:07s
epoch 8  | loss: 0.44196 |  0:01:15s
epoch 9  | loss: 0.44088 |  0:01:24s
CPU times: user 5min 16s, sys: 646 ms, total: 5min 17s
Wall time: 1min 26s

In [23]:

In [23]: y_pred = md.predict_proba(X_test)[:,1]

In [24]: print(metrics.roc_auc_score(y_test, y_pred))
0.7064836519507345

@szilard
Copy link
Owner Author

szilard commented Mar 3, 2021

New baseline (params taking their default values commented out):

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.47899 |  0:00:08s
epoch 1  | loss: 0.45288 |  0:00:17s
epoch 2  | loss: 0.44979 |  0:00:26s
epoch 3  | loss: 0.44707 |  0:00:35s
epoch 4  | loss: 0.44457 |  0:00:43s
epoch 5  | loss: 0.44438 |  0:00:52s
epoch 6  | loss: 0.44358 |  0:01:01s
epoch 7  | loss: 0.44306 |  0:01:10s
epoch 8  | loss: 0.44196 |  0:01:18s
epoch 9  | loss: 0.44088 |  0:01:27s
CPU times: user 5min 16s, sys: 583 ms, total: 5min 17s
Wall time: 1min 30s

In [23]:

In [23]: y_pred = md.predict_proba(X_test)[:,1]

In [24]: print(metrics.roc_auc_score(y_test, y_pred))
0.7064836519507345

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

New ec2 instance (still m5.2xlarge):


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)
epoch 0  | loss: 0.47899 |  0:00:07s
epoch 1  | loss: 0.45288 |  0:00:14s
epoch 2  | loss: 0.44979 |  0:00:21s
epoch 3  | loss: 0.44707 |  0:00:29s
epoch 4  | loss: 0.44457 |  0:00:36s
epoch 5  | loss: 0.44438 |  0:00:43s
epoch 6  | loss: 0.44358 |  0:00:51s
epoch 7  | loss: 0.44306 |  0:00:58s
epoch 8  | loss: 0.44196 |  0:01:05s
epoch 9  | loss: 0.44088 |  0:01:13s
CPU times: user 4min 34s, sys: 548 ms, total: 4min 35s
Wall time: 1min 15s

In [35]:

In [35]: y_pred = md.predict_proba(X_test)[:,1]

In [36]: print(metrics.roc_auc_score(y_test, y_pred))
0.7064836519507345

Batch size:

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    batch_size=4*1024, virtual_batch_size=128,
    ## weights=0,
)
epoch 0  | loss: 0.5178  |  0:00:05s
epoch 1  | loss: 0.45922 |  0:00:11s
epoch 2  | loss: 0.45388 |  0:00:17s
epoch 3  | loss: 0.45072 |  0:00:23s
epoch 4  | loss: 0.44899 |  0:00:29s
epoch 5  | loss: 0.44839 |  0:00:35s
epoch 6  | loss: 0.44636 |  0:00:41s
epoch 7  | loss: 0.44527 |  0:00:47s
epoch 8  | loss: 0.44309 |  0:00:53s
epoch 9  | loss: 0.44292 |  0:00:59s
CPU times: user 3min 40s, sys: 508 ms, total: 3min 41s
Wall time: 1min 1s

In [39]:

In [39]: y_pred = md.predict_proba(X_test)[:,1]

In [40]: print(metrics.roc_auc_score(y_test, y_pred))
0.7001853191783345
%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    batch_size=16*1024, virtual_batch_size=128,
    ## weights=0,
)
epoch 0  | loss: 0.61956 |  0:00:05s
epoch 1  | loss: 0.49296 |  0:00:10s
epoch 2  | loss: 0.47155 |  0:00:16s
epoch 3  | loss: 0.46408 |  0:00:21s
epoch 4  | loss: 0.46031 |  0:00:27s
epoch 5  | loss: 0.45869 |  0:00:32s
epoch 6  | loss: 0.45733 |  0:00:38s
epoch 7  | loss: 0.4559  |  0:00:43s
epoch 8  | loss: 0.45489 |  0:00:48s
epoch 9  | loss: 0.45377 |  0:00:54s
CPU times: user 3min 14s, sys: 420 ms, total: 3min 15s
Wall time: 55.9 s

In [43]:

In [43]: y_pred = md.predict_proba(X_test)[:,1]

In [44]: print(metrics.roc_auc_score(y_test, y_pred))
0.6302025008820336

16K too large, degrades AUC. 4K speeds up a bit, degrades only a little. We'll keep 1024.

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

OneCycleLearningRate as suggested above by @Optimox :


MAX_EPOCH = 10 
BS = 1024 

md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
                       scheduler_params=dict(max_lr=0.05,
                                             steps_per_epoch=int(X_train.shape[0] / BS),
                                             epochs=MAX_EPOCH,
                                             is_batch_level=True),
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=MAX_EPOCH, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
    drop_last = True
)
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.54787 |  0:00:07s
epoch 1  | loss: 0.45927 |  0:00:14s
epoch 2  | loss: 0.4539  |  0:00:21s
epoch 3  | loss: 0.44935 |  0:00:29s
epoch 4  | loss: 0.44724 |  0:00:36s
epoch 5  | loss: 0.44558 |  0:00:43s
epoch 6  | loss: 0.44426 |  0:00:51s
epoch 7  | loss: 0.44219 |  0:00:58s
epoch 8  | loss: 0.44077 |  0:01:05s
epoch 9  | loss: 0.44026 |  0:01:12s
CPU times: user 4min 33s, sys: 492 ms, total: 4min 33s
Wall time: 1min 14s

In [65]:

In [65]: y_pred = md.predict_proba(X_test)[:,1]

In [66]: print(metrics.roc_auc_score(y_test, y_pred))
0.7056652438691513

Very similar runtime and AUC to previous.

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

Without LR scheduler:

Remove:

                       scheduler_params={"step_size":50, # how to use learning rate scheduler
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,

Code:


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       mask_type='entmax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)
epoch 0  | loss: 0.47899 |  0:00:07s
epoch 1  | loss: 0.45288 |  0:00:14s
epoch 2  | loss: 0.44979 |  0:00:22s
epoch 3  | loss: 0.44707 |  0:00:29s
epoch 4  | loss: 0.44457 |  0:00:36s
epoch 5  | loss: 0.44438 |  0:00:43s
epoch 6  | loss: 0.44358 |  0:00:51s
epoch 7  | loss: 0.44306 |  0:00:58s
epoch 8  | loss: 0.44196 |  0:01:05s
epoch 9  | loss: 0.44088 |  0:01:13s
CPU times: user 4min 34s, sys: 532 ms, total: 4min 35s
Wall time: 1min 15s

In [75]:

In [75]: y_pred = md.predict_proba(X_test)[:,1]

In [76]: print(metrics.roc_auc_score(y_test, y_pred))
0.7064836519507345

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

mask_type='sparsemax':


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       mask_type='sparsemax' # "sparsemax"
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.48224 |  0:00:07s
epoch 1  | loss: 0.45447 |  0:00:14s
epoch 2  | loss: 0.45087 |  0:00:21s
epoch 3  | loss: 0.44885 |  0:00:29s
epoch 4  | loss: 0.44667 |  0:00:36s
epoch 5  | loss: 0.44576 |  0:00:43s
epoch 6  | loss: 0.44538 |  0:00:51s
epoch 7  | loss: 0.44727 |  0:00:58s
epoch 8  | loss: 0.4467  |  0:01:05s
epoch 9  | loss: 0.44514 |  0:01:13s
CPU times: user 4min 33s, sys: 644 ms, total: 4min 34s
Wall time: 1min 15s

In [83]:

In [83]: y_pred = md.predict_proba(X_test)[:,1]

In [84]: print(metrics.roc_auc_score(y_test, y_pred))
0.7031382841315941

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

New simplified baseline:

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=1,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       ## mask_type='sparsemax',
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
epoch 0  | loss: 0.48224 |  0:00:07s
epoch 1  | loss: 0.45447 |  0:00:14s
epoch 2  | loss: 0.45087 |  0:00:21s
epoch 3  | loss: 0.44885 |  0:00:29s
epoch 4  | loss: 0.44667 |  0:00:36s
epoch 5  | loss: 0.44576 |  0:00:43s
epoch 6  | loss: 0.44538 |  0:00:51s
epoch 7  | loss: 0.44727 |  0:00:58s
epoch 8  | loss: 0.4467  |  0:01:05s
epoch 9  | loss: 0.44514 |  0:01:12s
CPU times: user 4min 33s, sys: 486 ms, total: 4min 34s
Wall time: 1min 14s

In [23]:

In [23]: y_pred = md.predict_proba(X_test)[:,1]

In [24]: print(metrics.roc_auc_score(y_test, y_pred))
0.7031382841315941

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

n_steps=1:

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.47508 |  0:00:04s
epoch 1  | loss: 0.45215 |  0:00:08s
epoch 2  | loss: 0.44853 |  0:00:12s
epoch 3  | loss: 0.44623 |  0:00:16s
epoch 4  | loss: 0.44394 |  0:00:21s
epoch 5  | loss: 0.44179 |  0:00:25s
epoch 6  | loss: 0.44046 |  0:00:29s
epoch 7  | loss: 0.43847 |  0:00:33s
epoch 8  | loss: 0.43788 |  0:00:38s
epoch 9  | loss: 0.43678 |  0:00:42s
CPU times: user 2min 37s, sys: 320 ms, total: 2min 37s
Wall time: 43.7 s

In [27]:

In [27]: y_pred = md.predict_proba(X_test)[:,1]

In [28]: print(metrics.roc_auc_score(y_test, y_pred))
0.7107538398172549

Faster and better AUC.

Also lambda_sparse=0:

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.47405 |  0:00:04s
epoch 1  | loss: 0.45126 |  0:00:08s
epoch 2  | loss: 0.44834 |  0:00:12s
epoch 3  | loss: 0.44528 |  0:00:16s
epoch 4  | loss: 0.44233 |  0:00:21s
epoch 5  | loss: 0.44126 |  0:00:25s
epoch 6  | loss: 0.44    |  0:00:29s
epoch 7  | loss: 0.43786 |  0:00:33s
epoch 8  | loss: 0.43744 |  0:00:38s
epoch 9  | loss: 0.43609 |  0:00:42s
CPU times: user 2min 37s, sys: 264 ms, total: 2min 37s
Wall time: 43.6 s

In [31]:

In [31]: y_pred = md.predict_proba(X_test)[:,1]

In [32]: print(metrics.roc_auc_score(y_test, y_pred))
0.7108606253654501

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

n_d=16, n_a=16

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.47382 |  0:00:05s
epoch 1  | loss: 0.45346 |  0:00:10s
epoch 2  | loss: 0.44816 |  0:00:16s
epoch 3  | loss: 0.44531 |  0:00:21s
epoch 4  | loss: 0.44305 |  0:00:26s
epoch 5  | loss: 0.44061 |  0:00:32s
epoch 6  | loss: 0.43808 |  0:00:37s
epoch 7  | loss: 0.43717 |  0:00:43s
epoch 8  | loss: 0.43614 |  0:00:48s
epoch 9  | loss: 0.43621 |  0:00:53s
CPU times: user 3min 22s, sys: 344 ms, total: 3min 23s
Wall time: 55.2 s

In [35]:

In [35]: y_pred = md.predict_proba(X_test)[:,1]

In [36]: print(metrics.roc_auc_score(y_test, y_pred))
0.7098815942627792

n_d=32, n_a=32

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.46477 |  0:00:07s
epoch 1  | loss: 0.45027 |  0:00:14s
epoch 2  | loss: 0.44732 |  0:00:22s
epoch 3  | loss: 0.44461 |  0:00:29s
epoch 4  | loss: 0.44182 |  0:00:37s
epoch 5  | loss: 0.44    |  0:00:44s
epoch 6  | loss: 0.4394  |  0:00:52s
epoch 7  | loss: 0.43901 |  0:00:59s
epoch 8  | loss: 0.43769 |  0:01:07s
epoch 9  | loss: 0.43658 |  0:01:14s
CPU times: user 4min 45s, sys: 427 ms, total: 4min 45s
Wall time: 1min 16s

In [39]:

In [39]: y_pred = md.predict_proba(X_test)[:,1]

In [40]: print(metrics.roc_auc_score(y_test, y_pred))
0.7092875041159501

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

cat_emb_dim = np.floor(np.log2(cat_dims))

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.48529 |  0:00:04s
epoch 1  | loss: 0.45828 |  0:00:09s
epoch 2  | loss: 0.45028 |  0:00:13s
epoch 3  | loss: 0.44324 |  0:00:18s
epoch 4  | loss: 0.44048 |  0:00:22s
epoch 5  | loss: 0.4383  |  0:00:27s
epoch 6  | loss: 0.43575 |  0:00:32s
epoch 7  | loss: 0.43383 |  0:00:36s
epoch 8  | loss: 0.43228 |  0:00:41s
epoch 9  | loss: 0.43109 |  0:00:45s
CPU times: user 2min 50s, sys: 340 ms, total: 2min 51s
Wall time: 47.2 s

In [55]:

In [55]: y_pred = md.predict_proba(X_test)[:,1]

In [56]: print(metrics.roc_auc_score(y_test, y_pred))
0.7162047844289364
In [30]: d_train
Out[30]:
      Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance dep_delayed_15min
0       c-8       c-21       c-7     1934            AA    ATL  DFW       732                 N
1       c-4       c-20       c-3     1548            US    PIT  MCO       834                 N
2       c-9        c-2       c-5     1422            XE    RDU  CLE       416                 N
3      c-11       c-25       c-6     1015            OO    DEN  MEM       872                 N
4      c-10        c-7       c-6     1828            WN    MDW  OMA       423                 Y
...     ...        ...       ...      ...           ...    ...  ...       ...               ...
99995   c-5        c-4       c-3     1618            OO    SFO  RDD       199                 N
99996   c-1       c-18       c-3      804            CO    EWR  DAB       884                 N
99997   c-1       c-24       c-2     1901            NW    DTW  IAH      1076                 N
99998   c-4       c-27       c-4     1515            MQ    DFW  GGG       140                 N
99999  c-11       c-17       c-4     1800            WN    SEA  SMF       605                 N

[100000 rows x 9 columns]

In [31]: cat_idxs
Out[31]: [2, 3, 4, 5, 6, 7]

In [32]: cat_dims
Out[32]: [12, 31, 7, 23, 307, 307]

In [33]: cat_emb_dim
Out[33]: array([3, 4, 2, 4, 8, 8])

cat_emb_dim = np.floor(np.log(cat_dims))

In [37]: cat_emb_dim
Out[37]: array([2, 3, 1, 3, 5, 5])
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.4844  |  0:00:04s
epoch 1  | loss: 0.45993 |  0:00:08s
epoch 2  | loss: 0.45147 |  0:00:13s
epoch 3  | loss: 0.44668 |  0:00:17s
epoch 4  | loss: 0.44281 |  0:00:22s
epoch 5  | loss: 0.44047 |  0:00:26s
epoch 6  | loss: 0.43904 |  0:00:31s
epoch 7  | loss: 0.43612 |  0:00:35s
epoch 8  | loss: 0.43392 |  0:00:40s
epoch 9  | loss: 0.43232 |  0:00:44s
CPU times: user 2min 45s, sys: 284 ms, total: 2min 45s
Wall time: 45.8 s

In [47]:

In [47]: y_pred = md.predict_proba(X_test)[:,1]

In [48]: print(metrics.roc_auc_score(y_test, y_pred))
0.716194439796583

New baseline:

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]
cat_emb_dim = np.floor(np.log(cat_dims)).astype(int)

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=cat_emb_dim,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       ## mask_type='sparsemax',
                       n_steps=1,
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.4844  |  0:00:04s
epoch 1  | loss: 0.45993 |  0:00:08s
epoch 2  | loss: 0.45147 |  0:00:13s
epoch 3  | loss: 0.44668 |  0:00:17s
epoch 4  | loss: 0.44281 |  0:00:22s
epoch 5  | loss: 0.44047 |  0:00:26s
epoch 6  | loss: 0.43904 |  0:00:31s
epoch 7  | loss: 0.43612 |  0:00:35s
epoch 8  | loss: 0.43392 |  0:00:39s
epoch 9  | loss: 0.43232 |  0:00:44s
CPU times: user 2min 45s, sys: 311 ms, total: 2min 45s
Wall time: 45.7 s

In [24]:

In [24]: y_pred = md.predict_proba(X_test)[:,1]

In [25]: print(metrics.roc_auc_score(y_test, y_pred))
0.716194439796583

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

max_epochs=5

No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.4844  |  0:00:04s
epoch 1  | loss: 0.45993 |  0:00:08s
epoch 2  | loss: 0.45147 |  0:00:13s
epoch 3  | loss: 0.44668 |  0:00:17s
epoch 4  | loss: 0.44281 |  0:00:22s
CPU times: user 1min 24s, sys: 132 ms, total: 1min 24s
Wall time: 23.5 s

In [28]:

In [28]: y_pred = md.predict_proba(X_test)[:,1]

In [29]: print(metrics.roc_auc_score(y_test, y_pred))
0.7075546787518303

Not enough AUC, keep max_epochs=10

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

So after tweaking based on @Optimox 's suggestions, here we go:

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]
cat_emb_dim = np.floor(np.log(cat_dims)).astype(int)


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=cat_emb_dim,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       ## mask_type='sparsemax',
                       n_steps=1,
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    max_epochs=10, patience=0,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)


y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Device used : cpu


No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 0.4844  |  0:00:04s
epoch 1  | loss: 0.45993 |  0:00:08s
epoch 2  | loss: 0.45147 |  0:00:13s
epoch 3  | loss: 0.44668 |  0:00:17s
epoch 4  | loss: 0.44281 |  0:00:22s
epoch 5  | loss: 0.44047 |  0:00:26s
epoch 6  | loss: 0.43904 |  0:00:31s
epoch 7  | loss: 0.43612 |  0:00:35s
epoch 8  | loss: 0.43392 |  0:00:39s
epoch 9  | loss: 0.43232 |  0:00:44s
CPU times: user 2min 45s, sys: 260 ms, total: 2min 45s
Wall time: 45.7 s


In [25]: print(metrics.roc_auc_score(y_test, y_pred))
0.716194439796583

@Optimox
Copy link

Optimox commented Mar 4, 2021

Nice work @szilard!

May I request one last experiment?

  • switch to gpu
  • max epoch 20 and patience 20
  • if you can put back eval_set with train and valid auc this would help understand where the training process stands after 10 epochs

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

Sure, will do.

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

idx_trvl = np.random.choice(["train", "valid"], p =[.8, .2], size=(d_train.shape[0],))

X_train = X_all[0:d_train.shape[0]][idx_trvl=="train"].to_numpy()
y_train = y_all[0:d_train.shape[0]][idx_trvl=="train"]
X_valid = X_all[0:d_train.shape[0]][idx_trvl=="valid"].to_numpy()
y_valid = y_all[0:d_train.shape[0]][idx_trvl=="valid"]

X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]
cat_emb_dim = np.floor(np.log(cat_dims)).astype(int)


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=cat_emb_dim,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       ## mask_type='sparsemax',
                       n_steps=1,
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=20, patience=20,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

On CPU (same m5.2xlarge):

epoch 0  | loss: 0.48833 | train_auc: 0.6126  | valid_auc: 0.60897 |  0:00:05s
epoch 1  | loss: 0.46281 | train_auc: 0.67961 | valid_auc: 0.67715 |  0:00:10s
epoch 2  | loss: 0.45379 | train_auc: 0.69824 | valid_auc: 0.69417 |  0:00:15s
epoch 3  | loss: 0.44967 | train_auc: 0.70836 | valid_auc: 0.69995 |  0:00:20s
epoch 4  | loss: 0.44647 | train_auc: 0.71537 | valid_auc: 0.70304 |  0:00:25s
epoch 5  | loss: 0.44396 | train_auc: 0.71967 | valid_auc: 0.70513 |  0:00:30s
epoch 6  | loss: 0.44187 | train_auc: 0.72416 | valid_auc: 0.70679 |  0:00:35s
epoch 7  | loss: 0.44032 | train_auc: 0.72708 | valid_auc: 0.71118 |  0:00:40s
epoch 8  | loss: 0.43909 | train_auc: 0.72913 | valid_auc: 0.71072 |  0:00:45s
epoch 9  | loss: 0.43681 | train_auc: 0.73327 | valid_auc: 0.7148  |  0:00:50s
epoch 10 | loss: 0.43589 | train_auc: 0.7361  | valid_auc: 0.71585 |  0:00:55s
epoch 11 | loss: 0.43444 | train_auc: 0.73857 | valid_auc: 0.71348 |  0:01:00s
epoch 12 | loss: 0.43305 | train_auc: 0.74264 | valid_auc: 0.71583 |  0:01:05s
epoch 13 | loss: 0.43142 | train_auc: 0.74608 | valid_auc: 0.71513 |  0:01:10s
epoch 14 | loss: 0.43111 | train_auc: 0.749   | valid_auc: 0.71659 |  0:01:15s
epoch 15 | loss: 0.42882 | train_auc: 0.75014 | valid_auc: 0.71627 |  0:01:20s
epoch 16 | loss: 0.42719 | train_auc: 0.75229 | valid_auc: 0.71775 |  0:01:25s
epoch 17 | loss: 0.42576 | train_auc: 0.75604 | valid_auc: 0.71519 |  0:01:30s
epoch 18 | loss: 0.42513 | train_auc: 0.75762 | valid_auc: 0.71999 |  0:01:35s
epoch 19 | loss: 0.42372 | train_auc: 0.75752 | valid_auc: 0.7161  |  0:01:40s
Stop training because you reached max_epochs = 20 with best_epoch = 18 and best_valid_auc = 0.71999
Best weights from best epoch are automatically used!
CPU times: user 5min 58s, sys: 587 ms, total: 5min 58s
Wall time: 1min 41s

In [27]: y_pred = md.predict_proba(X_test)[:,1]

In [28]: print(metrics.roc_auc_score(y_test, y_pred))
0.7153019142655415

So it's about the same like after 10 epochs (but much better than after 5 epochs, that's why I left 10).

Will run it on GPU as well.

@szilard
Copy link
Owner Author

szilard commented Mar 4, 2021

I shouldn't do that and it's evil, but here it is if you put the test set in the eval (to see the AUC after each epoch):

from pytorch_tabnet.tab_model import TabNetClassifier
import torch

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics


d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")


d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat]
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0]].to_numpy()
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])].to_numpy()
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]


cat_idxs = [ i for i, col in enumerate(X_all.columns) if col in vars_cat]
cat_dims = [ len(np.unique(X_all.iloc[:,i].values)) for i in cat_idxs]
cat_emb_dim = np.floor(np.log(cat_dims)).astype(int)


md = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=cat_emb_dim,
                       ## optimizer_fn=torch.optim.Adam,
                       ## optimizer_params=dict(lr=2e-2),
                       ## mask_type='sparsemax',
                       n_steps=1,
)

%%time
md.fit( X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_name=['train', 'test_EVIL'],
    eval_metric=['auc'],
    max_epochs=100, patience=100,
    ## batch_size=1024, virtual_batch_size=128,
    ## weights=0,
)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
epoch 0  | loss: 0.4844  | train_auc: 0.62071 | test_EVIL_auc: 0.62318 |  0:00:07s
epoch 1  | loss: 0.45968 | train_auc: 0.69228 | test_EVIL_auc: 0.68665 |  0:00:14s
epoch 2  | loss: 0.451   | train_auc: 0.70315 | test_EVIL_auc: 0.69525 |  0:00:21s
epoch 3  | loss: 0.44581 | train_auc: 0.71579 | test_EVIL_auc: 0.70578 |  0:00:28s
epoch 4  | loss: 0.44281 | train_auc: 0.71931 | test_EVIL_auc: 0.70784 |  0:00:36s
epoch 5  | loss: 0.44006 | train_auc: 0.72304 | test_EVIL_auc: 0.71234 |  0:00:43s
epoch 6  | loss: 0.43677 | train_auc: 0.73199 | test_EVIL_auc: 0.71953 |  0:00:50s
epoch 7  | loss: 0.4347  | train_auc: 0.7338  | test_EVIL_auc: 0.71797 |  0:00:57s
epoch 8  | loss: 0.43309 | train_auc: 0.73757 | test_EVIL_auc: 0.72012 |  0:01:05s
epoch 9  | loss: 0.43194 | train_auc: 0.7394  | test_EVIL_auc: 0.71784 |  0:01:12s
epoch 10 | loss: 0.43134 | train_auc: 0.74163 | test_EVIL_auc: 0.71915 |  0:01:19s
epoch 11 | loss: 0.43039 | train_auc: 0.74295 | test_EVIL_auc: 0.71937 |  0:01:26s
epoch 12 | loss: 0.42924 | train_auc: 0.74432 | test_EVIL_auc: 0.71974 |  0:01:33s
epoch 13 | loss: 0.4283  | train_auc: 0.74597 | test_EVIL_auc: 0.71767 |  0:01:41s
epoch 14 | loss: 0.42811 | train_auc: 0.74761 | test_EVIL_auc: 0.71971 |  0:01:48s
epoch 15 | loss: 0.42746 | train_auc: 0.74941 | test_EVIL_auc: 0.71978 |  0:01:55s
epoch 16 | loss: 0.42627 | train_auc: 0.75114 | test_EVIL_auc: 0.71785 |  0:02:02s
epoch 17 | loss: 0.42571 | train_auc: 0.75149 | test_EVIL_auc: 0.71926 |  0:02:10s
epoch 18 | loss: 0.42498 | train_auc: 0.75328 | test_EVIL_auc: 0.72245 |  0:02:17s
epoch 19 | loss: 0.42484 | train_auc: 0.75531 | test_EVIL_auc: 0.71889 |  0:02:24s
epoch 20 | loss: 0.42389 | train_auc: 0.75647 | test_EVIL_auc: 0.71809 |  0:02:31s
epoch 21 | loss: 0.42356 | train_auc: 0.75796 | test_EVIL_auc: 0.71801 |  0:02:38s
epoch 22 | loss: 0.42199 | train_auc: 0.75997 | test_EVIL_auc: 0.71915 |  0:02:46s
epoch 23 | loss: 0.42242 | train_auc: 0.75839 | test_EVIL_auc: 0.71692 |  0:02:53s
epoch 24 | loss: 0.42192 | train_auc: 0.75992 | test_EVIL_auc: 0.71544 |  0:03:00s
epoch 25 | loss: 0.42138 | train_auc: 0.76261 | test_EVIL_auc: 0.71524 |  0:03:07s
epoch 26 | loss: 0.42117 | train_auc: 0.76102 | test_EVIL_auc: 0.71697 |  0:03:15s
epoch 27 | loss: 0.42002 | train_auc: 0.76407 | test_EVIL_auc: 0.71803 |  0:03:22s
epoch 28 | loss: 0.41991 | train_auc: 0.76443 | test_EVIL_auc: 0.71431 |  0:03:29s
epoch 29 | loss: 0.41906 | train_auc: 0.76418 | test_EVIL_auc: 0.71285 |  0:03:36s
epoch 30 | loss: 0.41824 | train_auc: 0.76658 | test_EVIL_auc: 0.71568 |  0:03:44s
epoch 31 | loss: 0.41815 | train_auc: 0.76786 | test_EVIL_auc: 0.71358 |  0:03:51s
epoch 32 | loss: 0.41768 | train_auc: 0.768   | test_EVIL_auc: 0.71048 |  0:03:58s
epoch 33 | loss: 0.41817 | train_auc: 0.76699 | test_EVIL_auc: 0.71211 |  0:04:05s
epoch 34 | loss: 0.41728 | train_auc: 0.76909 | test_EVIL_auc: 0.71042 |  0:04:12s
epoch 35 | loss: 0.41631 | train_auc: 0.76927 | test_EVIL_auc: 0.71057 |  0:04:20s
epoch 36 | loss: 0.41613 | train_auc: 0.76786 | test_EVIL_auc: 0.71216 |  0:04:27s
epoch 37 | loss: 0.4165  | train_auc: 0.77291 | test_EVIL_auc: 0.71141 |  0:04:34s
epoch 38 | loss: 0.41622 | train_auc: 0.76998 | test_EVIL_auc: 0.71147 |  0:04:41s
epoch 39 | loss: 0.41609 | train_auc: 0.77182 | test_EVIL_auc: 0.71093 |  0:04:49s
epoch 40 | loss: 0.41534 | train_auc: 0.77209 | test_EVIL_auc: 0.70715 |  0:04:56s
epoch 41 | loss: 0.4141  | train_auc: 0.77301 | test_EVIL_auc: 0.70977 |  0:05:03s
epoch 42 | loss: 0.41527 | train_auc: 0.77323 | test_EVIL_auc: 0.70867 |  0:05:10s
epoch 43 | loss: 0.41407 | train_auc: 0.77576 | test_EVIL_auc: 0.70862 |  0:05:18s
epoch 44 | loss: 0.41422 | train_auc: 0.77332 | test_EVIL_auc: 0.70503 |  0:05:25s
epoch 45 | loss: 0.41393 | train_auc: 0.77583 | test_EVIL_auc: 0.70816 |  0:05:32s
epoch 46 | loss: 0.41425 | train_auc: 0.77569 | test_EVIL_auc: 0.70876 |  0:05:39s
epoch 47 | loss: 0.41295 | train_auc: 0.77856 | test_EVIL_auc: 0.70445 |  0:05:47s
epoch 48 | loss: 0.41323 | train_auc: 0.77851 | test_EVIL_auc: 0.70548 |  0:05:54s
epoch 49 | loss: 0.41313 | train_auc: 0.77951 | test_EVIL_auc: 0.70675 |  0:06:01s
epoch 50 | loss: 0.41309 | train_auc: 0.77907 | test_EVIL_auc: 0.71031 |  0:06:09
...

It seems it reaches close to the best AUC (~0.72) after 8 epochs, so looking at 10 epochs is sensible.

Note: The test set has slightly different distribution than the train set (time gapped split from the original data).

@szilard
Copy link
Owner Author

szilard commented Mar 9, 2021

On 1M rows dataset:

epoch 0  | loss: 0.44987 | train_auc: 0.73058 | test_EVIL_auc: 0.7269  |  0:00:55s
epoch 1  | loss: 0.4354  | train_auc: 0.73689 | test_EVIL_auc: 0.73319 |  0:01:47s
epoch 2  | loss: 0.43182 | train_auc: 0.74066 | test_EVIL_auc: 0.73413 |  0:02:36s
epoch 3  | loss: 0.42976 | train_auc: 0.74279 | test_EVIL_auc: 0.73624 |  0:03:26s
epoch 4  | loss: 0.42863 | train_auc: 0.74371 | test_EVIL_auc: 0.73222 |  0:04:15s
epoch 5  | loss: 0.42728 | train_auc: 0.74605 | test_EVIL_auc: 0.7357  |  0:05:05s
epoch 6  | loss: 0.42637 | train_auc: 0.74763 | test_EVIL_auc: 0.73559 |  0:05:54s
epoch 7  | loss: 0.42519 | train_auc: 0.7487  | test_EVIL_auc: 0.73525 |  0:06:43s
epoch 8  | loss: 0.42452 | train_auc: 0.75031 | test_EVIL_auc: 0.73182 |  0:07:31s
epoch 9  | loss: 0.42361 | train_auc: 0.75099 | test_EVIL_auc: 0.73328 |  0:08:20s
epoch 10 | loss: 0.423   | train_auc: 0.7519  | test_EVIL_auc: 0.73477 |  0:09:10s
epoch 11 | loss: 0.42323 | train_auc: 0.75327 | test_EVIL_auc: 0.73348 |  0:09:59s
epoch 12 | loss: 0.42243 | train_auc: 0.75328 | test_EVIL_auc: 0.73355 |  0:10:48s
epoch 13 | loss: 0.42198 | train_auc: 0.75345 | test_EVIL_auc: 0.73278 |  0:11:37s
epoch 14 | loss: 0.42143 | train_auc: 0.75404 | test_EVIL_auc: 0.73191 |  0:12:27s
epoch 15 | loss: 0.42132 | train_auc: 0.75608 | test_EVIL_auc: 0.73395 |  0:13:17s
epoch 16 | loss: 0.42071 | train_auc: 0.75554 | test_EVIL_auc: 0.73303 |  0:14:07s
epoch 17 | loss: 0.42062 | train_auc: 0.75548 | test_EVIL_auc: 0.73347 |  0:14:56s
epoch 18 | loss: 0.42035 | train_auc: 0.75669 | test_EVIL_auc: 0.73312 |  0:15:45s
epoch 19 | loss: 0.42009 | train_auc: 0.75705 | test_EVIL_auc: 0.7335  |  0:16:34s
epoch 20 | loss: 0.41964 | train_auc: 0.75719 | test_EVIL_auc: 0.73169 |  0:17:22s
epoch 21 | loss: 0.41969 | train_auc: 0.7581  | test_EVIL_auc: 0.7332  |  0:18:11s
epoch 22 | loss: 0.41933 | train_auc: 0.7582  | test_EVIL_auc: 0.73569 |  0:19:00s
epoch 23 | loss: 0.41931 | train_auc: 0.75786 | test_EVIL_auc: 0.73479 |  0:19:49s
epoch 24 | loss: 0.41907 | train_auc: 0.75807 | test_EVIL_auc: 0.7332  |  0:20:38s
epoch 25 | loss: 0.41894 | train_auc: 0.7583  | test_EVIL_auc: 0.73309 |  0:21:26s
epoch 26 | loss: 0.41854 | train_auc: 0.75877 | test_EVIL_auc: 0.73324 |  0:22:15s
epoch 27 | loss: 0.41846 | train_auc: 0.75904 | test_EVIL_auc: 0.73264 |  0:23:03s
epoch 28 | loss: 0.41868 | train_auc: 0.75813 | test_EVIL_auc: 0.73351 |  0:23:53s
epoch 29 | loss: 0.41837 | train_auc: 0.75918 | test_EVIL_auc: 0.73228 |  0:24:41s
epoch 30 | loss: 0.41824 | train_auc: 0.75868 | test_EVIL_auc: 0.73299 |  0:25:30s

It reaches AUC~0.735 after 3-4 epochs (2-3 mins).

@szilard
Copy link
Owner Author

szilard commented Mar 9, 2021

Compare that with standard feed-forward neural net:

library(h2o)
h2o.init(max_mem_size = "10g", nthreads = -1)

dx_train <- h2o.importFile("train-1m.csv")
dx_valid <- h2o.importFile("valid.csv")
dx_test <- h2o.importFile("test.csv")

## to have same normalization as for the other DL libs that don't auto normalize 
dx_train$DepTime <- dx_train$DepTime/2500
dx_valid$DepTime <- dx_valid$DepTime/2500
dx_test$DepTime <- dx_test$DepTime/2500

dx_train$Distance <- log10(dx_train$Distance)/4
dx_valid$Distance <- log10(dx_valid$Distance)/4
dx_test$Distance <- log10(dx_test$Distance)/4

Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]


system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            ## DEFAULT: activation = "Rectifier", hidden = c(200,200), 
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC
 user  system elapsed
 2.097   0.032 232.004
> h2o.performance(md, dx_test)@metrics$AUC
[1] 0.7305076

TODO: Try out various setups such as here:

https://github.com/szilard/benchm-ml#deep-neural-networks
szilard/benchm-ml#28

Screen Shot 2021-03-09 at 1 44 02 PM

@szilard
Copy link
Owner Author

szilard commented Mar 10, 2021


library(h2o)
h2o.init(max_mem_size = "10g", nthreads = -1)


dx_train <- h2o.importFile("train-1m.csv")
dx_valid <- h2o.importFile("valid.csv")
dx_test <- h2o.importFile("test.csv")


## to have same normalization as for the other DL libs that don't auto normalize 
dx_train$DepTime <- dx_train$DepTime/2500
dx_valid$DepTime <- dx_valid$DepTime/2500
dx_test$DepTime <- dx_test$DepTime/2500

dx_train$Distance <- log10(dx_train$Distance)/4
dx_valid$Distance <- log10(dx_valid$Distance)/4
dx_test$Distance <- log10(dx_test$Distance)/4


Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            ## DEFAULT: activation = "Rectifier", hidden = c(200,200), 
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

## print epochs:  1: best AUC (on validation)  2: early stopping
d_scoring <- md@model$scoring_history
d_scoring[d_scoring$validation_auc==max(d_scoring$validation_auc, na.rm=TRUE),]

#   user  system elapsed
#  2.097   0.032 232.004
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7305076



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(50,50,50,50), input_dropout_ratio = 0.2,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.895   0.009  81.844
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7288232



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(50,50,50,50), 
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.767   0.012  69.534
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7328179



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(20,20),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.625   0.004  49.193
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7319457



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(20),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.576   0.007  40.566
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7266034



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(10),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.602   0.012  43.195
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7307281



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(5),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.519   0.005  37.024
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7282662



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(1),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  0.497   0.005  31.882
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7105887



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), l1 = 1e-5, l2 = 1e-5, 
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  2.145   0.039 231.927
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7311061



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "RectifierWithDropout", hidden = c(200,200,200,200), hidden_dropout_ratios=c(0.2,0.1,0.1,0),
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  3.632   0.084 437.021
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7287979




system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            rho = 0.95, epsilon = 1e-06,  ## default:  rho = 0.99, epsilon = 1e-08
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  1.845   0.018 209.470
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7188023    



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            rho = 0.999, epsilon = 1e-08,  ## default:  rho = 0.99, epsilon = 1e-08
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  2.309   0.020 266.621
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7277615    



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            rho = 0.9999, epsilon = 1e-08,  ## default:  rho = 0.99, epsilon = 1e-08
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  2.188   0.011 252.242
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7253196



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            rho = 0.999, epsilon = 1e-06,  ## default:  rho = 0.99, epsilon = 1e-08
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  2.766   0.024 330.894
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7170915



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            rho = 0.999, epsilon = 1e-09,  ## default:  rho = 0.99, epsilon = 1e-08
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  2.250   0.020 259.506
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7279227



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, ## default: rate = 0.005, rate_decay = 1, momentum_stable = 0,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  3.504   0.040 419.666
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7276413



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.001, momentum_start = 0.5, momentum_ramp = 1e5, momentum_stable = 0.99,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  5.501   0.076 662.537
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7328587



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.01, momentum_start = 0.5, momentum_ramp = 1e5, momentum_stable = 0.99,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  3.377   0.032 403.220
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7234964



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.01, rate_annealing = 1e-05, 
            momentum_start = 0.5, momentum_ramp = 1e5, momentum_stable = 0.99,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  4.503   0.024 534.935
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7340129



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.01, rate_annealing = 1e-04, 
            momentum_start = 0.5, momentum_ramp = 1e5, momentum_stable = 0.99,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#    user   system  elapsed
#  18.461    0.222 2247.658
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7302286



system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.01, rate_annealing = 1e-05, 
            momentum_start = 0.5, momentum_ramp = 1e5, momentum_stable = 0.9,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  4.596   0.081 535.119
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7339773


system.time({
  md <- h2o.deeplearning(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, validation_frame = dx_valid,
            activation = "Rectifier", hidden = c(200,200), 
            adaptive_rate = FALSE, rate = 0.01, rate_annealing = 1e-05, 
            momentum_start = 0.5, momentum_ramp = 1e4, momentum_stable = 0.9,
            epochs = 100, stopping_rounds = 3, stopping_metric = "AUC", stopping_tolerance = 0) 
})
h2o.performance(md, dx_test)@metrics$AUC

#   user  system elapsed
#  4.319   0.048 504.486
#> h2o.performance(md, dx_test)@metrics$AUC
#[1] 0.7336099

So max AUC ~0.73, more exactly 0.734 with some complex rate annealing and momentum trickery, but almost as good 0.732 with just a 2 hidden layers (hidden = c(20,20)).

So slightly lower AUC than with tabnet (0.735).

@szilard
Copy link
Owner Author

szilard commented Mar 10, 2021

With XGBoost:

library(data.table)
library(ROCR)
library(xgboost)
library(Matrix)


d_train <- fread("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv")
d_test <- fread("https://s3.amazonaws.com/benchm-ml--main/test.csv")


X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]

dxgb_train <- xgb.DMatrix(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))
dxgb_test  <- xgb.DMatrix(data = X_test, label = ifelse(d_test$dep_delayed_15min=='Y',1,0))


system.time({
  md <- xgb.train(data = dxgb_train, 
            objective = "binary:logistic", 
            nround = 1000, max_depth = 10, eta = 0.1, 
            tree_method = "hist",
            early_stopping_rounds = 10, watchlist = list(train=dxgb_train, test_EVIL=dxgb_test), eval_metric = "auc",  
            verbose = 1)
})


phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
[1]     train-auc:0.720297      test_EVIL-auc:0.712929
[2]     train-auc:0.723998      test_EVIL-auc:0.714408
[3]     train-auc:0.728140      test_EVIL-auc:0.718114
[4]     train-auc:0.729386      test_EVIL-auc:0.718708
[5]     train-auc:0.731805      test_EVIL-auc:0.719673
[6]     train-auc:0.733261      test_EVIL-auc:0.720840
[7]     train-auc:0.736055      test_EVIL-auc:0.722550
[8]     train-auc:0.737608      test_EVIL-auc:0.723126
[9]     train-auc:0.739149      test_EVIL-auc:0.723742
[10]    train-auc:0.740089      test_EVIL-auc:0.724102
...
[226]   train-auc:0.826693      test_EVIL-auc:0.752589
[227]   train-auc:0.826756      test_EVIL-auc:0.752574
[228]   train-auc:0.826838      test_EVIL-auc:0.752552
[229]   train-auc:0.826894      test_EVIL-auc:0.752574
[230]   train-auc:0.826961      test_EVIL-auc:0.752560
Stopping. Best iteration:
[220]   train-auc:0.826053      test_EVIL-auc:0.752600

   user  system elapsed
267.232   0.757  38.662

or:

nround = 30
[25]    train-auc:0.759808      test_EVIL-auc:0.733537
[26]    train-auc:0.760835      test_EVIL-auc:0.734024
[27]    train-auc:0.761475      test_EVIL-auc:0.734511
[28]    train-auc:0.762286      test_EVIL-auc:0.735012
[29]    train-auc:0.763378      test_EVIL-auc:0.735775
[30]    train-auc:0.764713      test_EVIL-auc:0.736041
   user  system elapsed
 41.371   0.143   5.916
>
>
> phat <- predict(md, newdata = X_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7360415

@szilard
Copy link
Owner Author

szilard commented Mar 10, 2021

m5.2xlarge (8 cores)

1M rows

algo/lib run time (sec) AUC
tabnet epochs 4 200 0.736
h2o FF epochs 2 50 0.732
XGBoost trees 220 40 0.753
XGBoost trees 40 6 0.736

@cpbotha
Copy link

cpbotha commented Jul 22, 2022

EDIT: make sure to set drop_last to True when using OneCycleLR

If you don't do this, you'll run into the following error riiiight at the end of your training run:

ValueError: Tried to step 37202 times. The specified number of total steps is 37200

P.S. Don't ask me how I know this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants