# Self Supervision with pytorch-tabnet

## What is self supervision?

In machine learning, we are talking about self supervision when you train a supervised algorithm with labels which you created artificially.
It's often a reconstruction game, where you don't have labels but you mask some information from the data itself to the model and then try to predict the information you masked. 

Some examples:
- for computer vision, the labels can be "what is the rotation degree applied to this image?" "can you reconstruct the pixels missing on this patch?"
- from video clip: "which one of this to frame comes first in the video?"
- tabular data: "can you reconstruct the missing features?"

## Why self supervision?

Very often, labelling data is either expensive or very time consuming. Only a small percentage of the available data have labels, and you can only use this small portion of data with supervised algorithms. With the help of self supervision you can use the information of your unlabeled data to pretrain your model. What you are hoping for is that in order to be able to solve the puzzle you created on unlabelled data, your model will need to learn some fundamental concept from your data.

## How does TabNetPretrainer works?

The current implementation is very close to the one proposed in tabnet research paper, the only difference being that I decided to take the mean of the given loss per batch in order to get consistant results with different batch sizes (instead of summing everything). See original paper : https://arxiv.org/abs/1908.07442

There is a new class named `TabNetPretrainer` which globally have the same parameters than `TabNetRegressor`, `TabNetClassifier` or `TabNetMultiTaskClassifier`.
There is only one more parameter for the `fit` method which is `pretraining_ratio`, a floating point between 0 and 1 that decides how hard is the puzzle you are trying to solve. If `pretraining_ratio=0.8` it means you are asking the model to reconstruct 80% of the feature from 20% you did not mask. Masks are random and created during training at the batch level.

You need to call `fit` on some dataset to perform self supervised pretraining. You can also give another dataset in order to perform early stopping to avoid overfitting.

Then you can start from the pretrained weights by giving the pretrainer to `from_unsupervised` parameter in the `fit` (note that this would work for `TabNetRegressor`, `TabNetClassifier` or `TabNetMultiTaskClassifier` as only the final layer differs for those networks).

Note : this version of code solves the retraining bug mentionned here : https://www.kaggle.com/c/lish-moa/discussion/196830
Note 2 : you won't be able to resuse previously saved pytorch-tabnet model with this version because the body of the network has been refactorize in order to keep the code clean with this addition.

## Where does the code come from?
The pytorchtabnetpretraining dataset used here is just a copy paste of the code from this Pull Request : https://github.com/dreamquark-ai/tabnet/pull/220

This has not been merged yet because I consider that it has not been tested nor reviewed enough. You are using this at your own risk (but I'm confident that things are working as expected). I'll merge the PR in a few days after careful review, please feel free to share your thoughts about this either in the comments here or directly in the PR : https://github.com/dreamquark-ai/tabnet/pull/220

## What is the boost on performance?

I haven't spend much time on this competition, so I don't have a strong pipeline to share.

However I tried to follow the CV scheme from Chris Deotte (https://www.kaggle.com/c/lish-moa/discussion/195195) so that people can compare this more easily.

From the very small experience I did with the same params and 5 fold CV (without using control group):
- No pretraining : CV 0.01841, LB 0.01951
- With pretraining (pretraining_ratio 0.8): CV 0.01754, LB 0.01887

So it might be useful for some of you who have a stronger pipeline!
Please let me know if it helped!


## Package installation

In [None]:
!pip uninstall -y typing
!pip install ../input/pytorchtabnetpretraining/pytorch_tabnet-2.0.1-py3-none-any.whl
!pip install /kaggle/input/iterative-stratification/iterative-stratification-master/

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import torch.optim as optim
import torch.nn.functional as F
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.model_selection import StratifiedKFold

from pytorch_tabnet.tab_model import TabNetRegressor
import numpy as np
import pandas as pd 

import os
import random
import sys
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
from tqdm import tqdm
from sklearn.metrics import log_loss

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold, MultilabelStratifiedShuffleSplit
from sklearn.metrics import roc_auc_score


In [None]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
seed_everything(42)

# Data and minimal preprocessing

In [None]:
data_path = "../input/lish-moa/"
train = pd.read_csv(data_path+'train_features.csv')
train.drop(columns=["sig_id"], inplace=True)

drug = pd.read_csv(data_path +'train_drug.csv')

train_targets_scored = pd.read_csv(data_path+'train_targets_scored.csv')
#train_targets_scored.drop(columns=["sig_id"], inplace=True)
targets = train_targets_scored.columns[1:]
test = pd.read_csv(data_path+'test_features.csv')
test.drop(columns=["sig_id"], inplace=True)

submission = pd.read_csv(data_path+'sample_submission.csv')

remove_vehicle = True

if remove_vehicle:
    kept_index = train['cp_type']=='trt_cp'
    train = train.loc[kept_index].reset_index(drop=True)
    train_targets_scored = train_targets_scored.loc[kept_index].reset_index(drop=True)

train["cp_type"] = (train["cp_type"]=="trt_cp") + 0
train["cp_dose"] = (train["cp_dose"]=="D1") + 0

test["cp_type"] = (test["cp_type"]=="trt_cp") + 0
test["cp_dose"] = (test["cp_dose"]=="D1") + 0

X_test = test.values


# Taken from Chris Deotte folds

SEED = 42

NB_SPLITS = 5

# Taken from Chris : https://www.kaggle.com/c/lish-moa/discussion/195195
scored = train_targets_scored.merge(drug, on='sig_id', how='left') 
# LOCATE DRUGS
vc = scored.drug_id.value_counts()
vc1 = vc.loc[vc<=18].index.sort_values()
vc2 = vc.loc[vc>18].index.sort_values()

# STRATIFY DRUGS 18X OR LESS
dct1 = {}; dct2 = {}
skf = MultilabelStratifiedKFold(n_splits=NB_SPLITS, shuffle=True, 
          random_state=SEED)
tmp = scored.groupby('drug_id')[targets].mean().loc[vc1]
for fold,(idxT,idxV) in enumerate( skf.split(tmp,tmp[targets])):
    dd = {k:fold for k in tmp.index[idxV].values}
    dct1.update(dd)

# STRATIFY DRUGS MORE THAN 18X
skf = MultilabelStratifiedKFold(n_splits=NB_SPLITS, shuffle=True, 
          random_state=SEED)
tmp = scored.loc[scored.drug_id.isin(vc2)].reset_index(drop=True)
for fold,(idxT,idxV) in enumerate( skf.split(tmp,tmp[targets])):
    dd = {k:fold for k in tmp.sig_id[idxV].values}
    dct2.update(dd)

# ASSIGN FOLDS
scored['fold'] = scored.drug_id.map(dct1)
scored.loc[scored.fold.isna(),'fold'] =\
    scored.loc[scored.fold.isna(),'sig_id'].map(dct2)
scored.fold = scored.fold.astype('int8')


In [None]:
from pytorch_tabnet.metrics import Metric
from sklearn.metrics import roc_auc_score, log_loss

class LogitsLogLoss(Metric):
    """
    LogLoss with sigmoid applied
    """

    def __init__(self):
        self._name = "logits_ll"
        self._maximize = False

    def __call__(self, y_true, y_pred):
        """
        Compute LogLoss of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            LogLoss of predictions vs targets.
        """
        logits = 1 / (1 + np.exp(-y_pred))
        aux = (1-y_true)*np.log(1-logits+1e-15) + y_true*np.log(logits+1e-15)
        return np.mean(-aux)

## Pretraining

This is where I pretrain a model.

Note that here I decided to pretrain on test data and validate on train data.
This means that the pretraining is different in local version and committed version.

I did this in the hope that the model will perform better on private set if it has been pretrain with it.

In [None]:
from pytorch_tabnet.pretraining import TabNetPretrainer

BS=1024
MAX_EPOCH=101

N_D = 128
N_A = 32
N_INDEP = 1
N_SHARED = 1
N_STEPS = 3

tabnet_params = dict(n_d=N_D, n_a=N_A, n_steps=N_STEPS,  #0.2,
                         n_independent=N_INDEP, n_shared=N_SHARED,
                         lambda_sparse=0., optimizer_fn=torch.optim.Adam,
                         optimizer_params=dict(lr=2e-2),
                         mask_type="entmax",
                         scheduler_params=dict(mode="min",
                                               patience=5,
                                               min_lr=1e-5,
                                               factor=0.9,),
                         scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,                         
                         verbose=10,
                         )

pretrainer = TabNetPretrainer(**tabnet_params)

pretrainer.fit(X_train=test.values[:,1:], #  np.vstack([train.values[:,1:], test.values[:,1:]])
          eval_set=[train.values[:,1:]],
          max_epochs=MAX_EPOCH,
          patience=20, batch_size=BS, virtual_batch_size=128, #128,
          num_workers=0, drop_last=True,
          pretraining_ratio=0.8)

## Cross validation starting from pretrained weights

In [None]:
scores_auc_all= []
test_cv_preds = []

oof_preds = []
oof_targets = []
scores = []
scores_auc = []
NB_FOLD = 5
for fold_nb in range(NB_FOLD):
    train_idx = scored[scored.fold!=fold_nb].index
    val_idx = scored[scored.fold==fold_nb].index

    print("FOLDS : ", fold_nb)
    if fold_nb >= NB_FOLD:
        break
    ## model
    X_train, y_train = train.values[train_idx, 1:], train_targets_scored.values[train_idx, 1:].astype(float) #[:,simple_tasks].astype(float)
    X_val, y_val = train.values[val_idx, 1:], train_targets_scored.values[val_idx, 1:].astype(float) # [:,simple_tasks].astype(float)
    MAX_EPOCH=51
    BS=1024

    tabnet_params = dict(n_d=N_D, n_a=N_A, n_steps=N_STEPS,
                         n_independent=N_INDEP, n_shared=N_SHARED,
                         gamma=1.0,
                         lambda_sparse=0., optimizer_fn=torch.optim.Adam, # 
                         optimizer_params=dict(lr=2e-2, # 2e-2
                                               weight_decay=1e-5
                                              ),
                         mask_type="entmax",
#                          scheduler_params=dict(mode="min",
#                                                patience=5,
#                                                min_lr=1e-5,
#                                                factor=0.9,),
#                          scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,
                         scheduler_params=dict(max_lr=0.05,
                                              steps_per_epoch=int(X_train.shape[0] / BS),
                                              epochs=MAX_EPOCH,
                                              is_batch_level=True),
                         scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
                         verbose=10,
                         )

    model = TabNetRegressor(**tabnet_params)

    model.fit(X_train=X_train,
              y_train=y_train,
              eval_set=[(X_val, y_val)],
              eval_name = ["val"],
              eval_metric = ["logits_ll"],
              max_epochs=MAX_EPOCH,
              patience=20, batch_size=BS, virtual_batch_size=128, #128,
              num_workers=1, drop_last=True,
              from_unsupervised=pretrainer,
              # use binary cross entropy as this is not a regression problem              
              loss_fn=torch.nn.functional.binary_cross_entropy_with_logits)
        ## save oof to compute the CV later 
    preds_val = model.predict(X_val)
    # Apply sigmoid to the predictions
    preds =  1 / (1 + np.exp(-preds_val))
    score = np.min(model.history["val_logits_ll"])
    scores.append(score)
    oof_preds.append(preds)
    oof_targets.append(y_val)

#     name = cfg.save_name + f"_fold{fold_nb}"
#     model.save_model(name)    

    # preds on test
    preds_test = model.predict(X_test[:,1:])
    test_cv_preds.append(1 / (1 + np.exp(-preds_test)))

oof_preds_all = np.concatenate(oof_preds)
oof_targets_all = np.concatenate(oof_targets)
test_preds_all = np.stack(test_cv_preds)

In [None]:
aucs = []
for task_id in range(oof_preds_all.shape[1]):
    aucs.append(roc_auc_score(y_true=oof_targets_all[:, task_id],
                              y_score=oof_preds_all[:, task_id]))
print(f"Overall AUC : {np.mean(aucs)}")
print(f"Average CV : {np.mean(scores)}")

In [None]:
all_feat = [col for col in submission.columns if col not in ["sig_id"]]
submission[all_feat] = test_preds_all.mean(axis=0)
# set control to 0
submission.loc[test['cp_type']==0, submission.columns[1:]] = 0
submission.to_csv('submission.csv', index=None)