Please see Optimo's original notebook here: https://www.kaggle.com/optimo/tabnetmultitaskclassifier

The only changes I've made are to turn off internet and use Optimo's branch from a dataset

# About this notebook

TabNetMultiTaskClassifier is still under development, I'm sharing this now in a co-construction approach.

Please share your feedback as comments here or directly in the corresponding PR on the github repo : https://github.com/dreamquark-ai/tabnet/pull/184

Once carefully reviewed I'll release this in the official repo.

If you have questions about the model please first have a look at the README here : https://github.com/dreamquark-ai/tabnet/blob/feat/MultiTaskClassification/README.md

Maybe have a look at this video for more in depth explanations : https://youtu.be/ysBaZO8YmX8 

# About TabNetMultiTaskClassifier

TabNetMultiTaskClassifier is a new class of pytorch-tabnet, it allows you to easily deal with multi task classification problem. (Note: for multi task regression problems you can use TabNetRegressor).

Some of the available features are:

- any number of tasks is allowed
- each task can have any number of labels
- you can pass different loss functions for each task by giving a corresponding list of loss function during fit

# About the competition

I am not going to share a full running submission as I did not started the competition yet and I'd like people to try a solution of their own.

This should be enough for people to start playing around, feel free to fork this in order to get a score.
If you end up with a good scoring kernel with a few twick please share it publicly, we are still very early in the competition and TabNet is easy to use and open source.

Disclaimer : I actually don't know how competitive this simple notebook will be, but this is just a baseline to improve upon!


# About pytorch-tabnet

Pytorch-tabnet is an open source project, feel free to contribute!


**Good luck to all, have fun!**

# Installing pytorch-tabnet from the MultiTask branch (not official branch)

In [None]:
!pip uninstall -y typing # this should avoid  AttributeError: type object 'Callable' has no attribute '_abc_registry'

In [None]:
import sys
sys.path.insert(0, "../input/tabnetfeatmultitaskclassification/tabnet-feat-MultiTaskClassification")

# Import libraries

In [None]:
from pytorch_tabnet.multitask import TabNetMultiTaskClassifier

import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, log_loss

import pandas as pd
import numpy as np
np.random.seed(0)

from tqdm.notebook import tqdm

import os

from matplotlib import pyplot as plt
%matplotlib inline

# Download data

In [None]:
dataset_name = "lish-moa"
train = pd.read_csv("../input/lish-moa/train_features.csv")
train_targets = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
train_targets.drop(columns=["sig_id"], inplace=True)

test = pd.read_csv('../input/lish-moa/test_features.csv')

# Random Split, do something smarter if you want)

In [None]:
np.random.seed(42)
if "Set" not in train.columns:
    train["Set"] = np.random.choice(["train", "valid"], p =[.8, .2], size=(train.shape[0],))

train_indices = train[train.Set=="train"].index
valid_indices = train[train.Set=="valid"].index

# Simple preprocessing


Copy/pasted from my example notebooks, don't know if we have missing values.

Label encode categorical features and fill empty cells.


Do any smarter preprocessing if you want.

In [None]:
# Encoding train set and test set

nunique = train.nunique()
types = train.dtypes

categorical_columns = []
categorical_dims =  {}
for col in tqdm(train.columns):
    if types[col] == 'object' or nunique[col] < 200:
        print(col, train[col].nunique())
        l_enc = LabelEncoder()
        train[col] = train[col].fillna("VV_likely")
        train[col] = l_enc.fit_transform(train[col].values)
        try:
            test[col] = test[col].fillna("VV_likely")
            test[col] = l_enc.transform(test[col].values)
        except:
            print(f"Column {col} does not exist in test set")
        categorical_columns.append(col)
        categorical_dims[col] = len(l_enc.classes_)
    else:
        training_mean = train.loc[train_indices, col].mean()
        train.fillna(training_mean, inplace=True)
        test.fillna(training_mean, inplace=True)

# Define categorical features for categorical embeddings

In [None]:
unused_feat = ['Set', 'sig_id'] # Let's not use splitting sets and sig_id

features = [ col for col in train.columns if col not in unused_feat] 

cat_idxs = [ i for i, f in enumerate(features) if f in categorical_columns]

cat_dims = [ categorical_dims[f] for i, f in enumerate(features) if f in categorical_columns]


# Creating train/valid/test matrices

In [None]:

X_train = train[features].values[train_indices]
y_train = train_targets.values[train_indices]

X_valid = train[features].values[valid_indices]
y_valid = train_targets.values[valid_indices]

X_test = test[features].values


# Define network and parameters

### This is a set of basic parameters, happy tuning!

In [None]:

clf = TabNetMultiTaskClassifier(cat_idxs=cat_idxs,
                                cat_dims=cat_dims,
                                cat_emb_dim=1,
                                optimizer_fn=torch.optim.Adam,
                                optimizer_params=dict(lr=2e-2),
                                scheduler_params={"step_size":50, # how to use learning rate scheduler
                                                  "gamma":0.9},
                                scheduler_fn=torch.optim.lr_scheduler.StepLR,
                                mask_type='entmax', # "sparsemax",
                                lambda_sparse=0, # don't penalize for sparser attention
                       
                      )

# Training

In [None]:
max_epochs = 1000
clf.fit(
    X_train=X_train, y_train=y_train,
    X_valid=X_valid, y_valid=y_valid,
    max_epochs=max_epochs ,
    patience=50, # please be patient ^^
    batch_size=1024,
    virtual_batch_size=128,
    num_workers=1,
    drop_last=False,
)

# scores displayed here are -average of log loss

# TabNet is not as fast as XGBoost (at least for binary classification and regression problems)
# If you wish to speed things up you could play with batch_size, virtual_batch_size and num_workers (or create a smaller network with less steps)
# Another way to speed things up is to improve the source code : please contribute here https://github.com/dreamquark-ai/tabnet/issues/183

In [None]:
# plot losses (drop first epochs to have a nice plot)
plt.plot(clf.history['train']['loss'][5:])
plt.plot(clf.history['valid']['loss'][5:])

In [None]:
# plot learning rates
plt.plot([x for x in clf.history['train']['lr']][5:])

# Validation

I don't know if people have been looking at these AUCs plots but some tasks are harder than others!

In [None]:
preds_valid = clf.predict_proba(X_valid) # This is a list of results for each task

# We are here getting rid of tasks where only 0 are available in the validation set
valid_aucs = [roc_auc_score(y_score=task_pred[:,1], y_true=y_valid[:, task_idx])
             for task_idx, (task_pred, n_pos) in enumerate(zip(preds_valid, y_valid.sum(axis=0))) if n_pos > 0]

valid_logloss = [log_loss(y_pred=task_pred[:,1], y_true=y_valid[:, task_idx])
             for task_idx, (task_pred, n_pos) in enumerate(zip(preds_valid, y_valid.sum(axis=0))) if n_pos > 0]

plt.scatter(y_valid.sum(axis=0)[y_valid.sum(axis=0)>0], valid_aucs)

In [None]:
# Valid score should match mean log loss - They don't match exactly because we removed some tasks
print(f"BEST VALID SCORE FOR {dataset_name} : {clf.best_cost}")
print(f"VALIDATION MEAN LOGLOSS SCORES FOR {dataset_name} : {np.mean(valid_logloss)}")
print(f"VALIDATION MEAN AUC SCORES FOR {dataset_name} : {np.mean(valid_aucs)}")

## Predictions

In [None]:
preds = clf.predict_proba(X_test)

# Save and load Model

Just an example of how to save and load models in order to use them later

In [None]:
# save tabnet model
saving_path_name = "./TabNetMultiTaskClassifier_baseline"
saved_filepath = clf.save_model(saving_path_name)

In [None]:
# define new model with basic parameters and load state dict weights (all parameters will be updated)
loaded_clf = TabNetMultiTaskClassifier()
loaded_clf.load_model(saved_filepath)

In [None]:
loaded_preds = loaded_clf.predict_proba(X_test)

# Make sure that this is working as expected
np.testing.assert_array_equal(preds, loaded_preds)

# Global explainability : feat importance summing to 1

In [None]:
clf.feature_importances_

# Local explainability and masks for test set

Explain matrix is not normalized so rows don't sum to 1, feel free to normalize them yourself

You can see that attention is quite sparse and this visualization with so many columns is not the best

In [None]:
explain_matrix, masks = clf.explain(X_test)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20,20))

for i in range(3):
    axs[i].imshow(masks[i][:500])
    axs[i].set_title(f"mask {i}")


# End of notebook

Hope this will be usefull!