# In this Notebook, you will find the following

- SelfSupervised TabNet Implementation Examples
- TabNet Feature Importance, Mask, output examples
- Comparison of accuracy with TabNet with pretrain, TabNet without pretrain, LighGBM, and NN
- Comparison of feature importance with TabNet with pretrain, TabNet without pretrain, LighGBM
- Comparison of Output distribution for TabNet with pretrain, TabNet without pretrain, LighGBM, NN

# Summary of the Results

### Comparison OOF_ROC-AUC, OOF_Accuracy, LB_Accuracy

 | - | TabNet with Pretrain| TabNet without Pretrain| LightGBM | NN |
| ------------ | ------------- | ------------- | ------------- | ------------- |
| OOF ROC-AUC | 0.8620 | 0.8354 | 0.8750 | 0.8643 |
| OOF Accuracy | 0.8114 | 0.7564 | 0.8260 | 0.8249 |
| LB Accuracy | 0.7775 | 0.7560 | 0.7608 |0.7656 |
| Time(s) | 34.6 | 37.3 | 0.24 | 6.86 | 

- TabNet is lower than LightGBM and NN for the accuracy of OOF.
- TabNet is higher than LightGBM and NN for the accuracy of LB.
- TabNet has a smaller difference between OOF and LB accuracy than LightGBM and NN.
- TabNet with Pretrain has a higher accuracy of OOF and a lower accuracy of LB than TabNet without Pretrain.
- When using CPUs, TabNet is more time consuming

### Comparison Feature Importance

![Feature Importance for each Model](https://storage.googleapis.com/zenn-user-upload/ezsqiabborqn8jmil0odvrevipbc)

- TabNet (with Pretrain or without Pretrain) has a smaller difference in importance between features than LightGBM.

- In both TabNet with Pretrain and LightGBM, Fare ranks first. On the other hand, there is a rlarge variance for features between TabNet with Pretrain and LightGBM

### TabNet's Mask


![Feature Importance for each Model](https://storage.googleapis.com/zenn-user-upload/k7wzkbishovexicgah2hex4mwoig)

- This Mask is created each time you make a decision (deciding which features to use), and the number of times you make a decision is determined by
It can be specified by n_steps. In this case, n_steps=3, so the number of Masks is 3.

- The horizontal axis is the features (0='Pclass', 1='Sex', 2='Age', 3='SibSp', 4='Parch', 5='Fare', 6='Embarked') and the vertical axis is the number of test lines (25 lines from the top).

- At a glance, we can see that 5='Fare' is strongly colored for mask2. This seems to be a reasonable result, considering that Fare came first in the feature importance.

### Comparison Predict Distribution

<img src="https://storage.googleapis.com/zenn-user-upload/mts8q51v792anu3e0cwkbsq0dob6" width=75%>

- The output distribution of TabNet has a shape similar to that of NN.

### Summary

- TabNet is a highly interpretive model by using Mask and Feature Importance.

- In Titanic data, the accuracy of TabNet was equivalent to that of LightBM and NN. Furthermore, since the dissociation between OOF and LB values is small, it may be a model that is less prone to overlearning.

- The top features in TabNet's Feature Importance were different from those of LightGBM. Therefore, TabNet can be expected to be useful for Ensemble.

### Original Article(Japanese)
https://zenn.dev/sinchir0/articles/9228eccebfbf579bfdf4

# Implementation

# Ref

https://www.kaggle.com/optimo/selfsupervisedtabnet/data?select=pytorch_tabnet-2.0.1-py3-none-any.whl

https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb

# Install

In [None]:
!pip install ../input/tabnet/pytorch_tabnet-2.0.1-py3-none-any.whl

# Setting

In [None]:
N_FOLDS = 5
SEED = 33

# Import

In [None]:
import os, random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import torch.optim as optim
import torch.nn.functional as F
from torch.optim.lr_scheduler import ReduceLROnPlateau

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, accuracy_score

from pytorch_tabnet.pretraining import TabNetPretrainer
from pytorch_tabnet.tab_model import TabNetRegressor
from pytorch_tabnet.tab_model import TabNetClassifier

import lightgbm as lgb

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
sub = pd.read_csv('../input/titanic/gender_submission.csv')

# Fix seed

In [None]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
seed_everything(SEED)

# Preprocess

In [None]:
data = pd.concat([train, test], sort=False)

data['Sex'] = data['Sex'].replace(['male','female'], [0, 1])
data['Embarked'] = data['Embarked'].fillna(('S'))
data['Embarked'] = data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
data['Fare'] = data['Fare'].fillna(np.mean(data['Fare']))
data['Age'] = data['Age'].fillna(data['Age'].median())

In [None]:
del_col = ['Name', 'PassengerId','Ticket', 'Cabin']
data.drop(del_col, axis=1, inplace=True)

In [None]:
train = data[:len(train)]
test = data[len(train):]

# Feature cols

In [None]:
feature_col = [col for col in train.columns.tolist() if col != 'Survived']

# Define categorical features for categorical embeddings

In [None]:
categorical_columns = ['Pclass','Sex','SibSp','Parch','Embarked']

# class num
categorical_dims = {}

for col in categorical_columns:
    categorical_dims[col] = train[col].nunique()

In [None]:
cat_idxs = [ i for i, f in enumerate(feature_col) if f in categorical_columns]

cat_dims = [ categorical_dims[f] for i, f in enumerate(feature_col) if f in categorical_columns]

# TabNet Pretraining

In [None]:
# comment out when Unsupervised

tabnet_params = dict(n_d=8, n_a=8, n_steps=3, gamma=1.3,
                     n_independent=2, n_shared=2,
                     seed=SEED, lambda_sparse=1e-3, 
                     optimizer_fn=torch.optim.Adam, 
                     optimizer_params=dict(lr=2e-2),
                     mask_type="entmax",
                     scheduler_params=dict(mode="min",
                                           patience=5,
                                           min_lr=1e-5,
                                           factor=0.9,),
                     scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,
                     verbose=10
                    )

pretrainer = TabNetPretrainer(**tabnet_params)

pretrainer.fit(
    X_train=train.drop('Survived',axis=1).values,
    eval_set=[train.drop('Survived',axis=1).values],
    max_epochs=200,
    patience=20, batch_size=256, virtual_batch_size=128,
    num_workers=1, drop_last=True)

# Make Folds

In [None]:
skf = StratifiedKFold(n_splits=N_FOLDS)
for f, (t_idx, v_idx) in enumerate(skf.split(train.drop('Survived',axis=1), train['Survived'])):
    train.loc[v_idx, 'fold'] = int(f)

# TabNet Training

In [None]:
%%time
oof = np.zeros((len(train),))
test_preds_all = np.zeros((len(test),))
models = []

for fold_num in range(N_FOLDS):
    train_idx = train[train.fold != fold_num].index
    valid_idx = train[train.fold == fold_num].index

    print("FOLDS : ", fold_num)

    ## model
    X_train, y_train = train[feature_col].values[train_idx,], train['Survived'].values[train_idx,].astype(float)
    X_valid, y_valid = train[feature_col].values[valid_idx,], train['Survived'].values[valid_idx,].astype(float)
    
    tabnet_params = dict(n_d=8, n_a=8, n_steps=3, gamma=1.3,
                         n_independent=2, n_shared=2,
                         seed=SEED, lambda_sparse=1e-3,
                         optimizer_fn=torch.optim.Adam,
                         optimizer_params=dict(lr=2e-2,
                                               weight_decay=1e-5
                                              ),
                         mask_type="entmax",
                         scheduler_params=dict(max_lr=0.05,
                                               steps_per_epoch=int(X_train.shape[0] / 256),
                                               epochs=200,
                                               is_batch_level=True
                                              ),
                         scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
                         verbose=10,
                         cat_idxs=cat_idxs, # comment out when Unsupervised
                         cat_dims=cat_dims, # comment out when Unsupervised
                         cat_emb_dim=1 # comment out when Unsupervised
                        )

    model = TabNetClassifier(**tabnet_params)

    model.fit(X_train=X_train,
              y_train=y_train,
              eval_set=[(X_valid, y_valid)],
              eval_name = ["valid"],
              eval_metric = ["auc"],
              max_epochs=200,
              patience=20, batch_size=256, virtual_batch_size=128,
              num_workers=0, drop_last=False,
              from_unsupervised=pretrainer # comment out when Unsupervised
             )
    
    # Make Oof
    oof[valid_idx] = model.predict_proba(X_valid)[:,1]
    # Model
    models.append(model)

    # for save weight
    # name = f"fold{fold_num}"
    # model.save_model(name)    

    # preds on test
    preds_test = model.predict_proba(test[feature_col].values)[:,1]
    test_preds_all += preds_test / N_FOLDS

In [None]:
print('ROC-AUC')
print(roc_auc_score(train['Survived'].ravel(), oof.ravel()))
print('Accuracy')
print(accuracy_score(train['Survived'].ravel(), (oof > 0.5).astype('int').ravel()))

# Global Explainability : Feature Importance

In [None]:
for fold_num, model in enumerate(models):
    # Feature Importance
    feat_imp_fold = pd.DataFrame(model.feature_importances_,index=feature_col, columns= [f'imp_{fold_num}'])
    if fold_num == 0:
        feature_importance = feat_imp_fold.copy()
    else:
        feature_importance = pd.concat([feature_importance, feat_imp_fold], axis=1)
        
feature_importance['imp_mean'] = feature_importance.mean(axis=1)
feature_importance = feature_importance.sort_values('imp_mean')

plt.tick_params(labelsize=18)
plt.barh(feature_importance.index.values,feature_importance['imp_mean']);
plt.title('feature_importance',fontsize=18);

# Local explainability and masks

In [None]:
explain_matrix, masks = model.explain(test[feature_col].values)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(10,7))

for i in range(3):
    axs[i].imshow(masks[i][:25])
    axs[i].set_title(f"mask {i}")

In [None]:
# 0 = 'Pclass', 1 = 'Sex', 2 = 'Age',3 =  'SibSp', 4 = 'Parch', 5 = 'Fare', 6 = 'Embarked'

In [None]:
plt.hist(test_preds_all);
plt.xlim(0.0,1.0)
plt.title('Output proba hist');

In [None]:
sub['Survived'] = (test_preds_all > 0.5).astype('int')
sub.to_csv("submission.csv", index=False)

# LightGBM(For comparison)

In [None]:
# change to category
for col in categorical_columns:
    train[col] = train[col].astype('category')
    test[col] = test[col].astype('category')

In [None]:
%%time
oof = np.zeros((len(train),))
test_preds_all = np.zeros((len(test),))
models = []

for fold_num in range(N_FOLDS):
    train_idx = train[train.fold != fold_num].index
    valid_idx = train[train.fold == fold_num].index

    print("FOLDS : ", fold_num)

    ## model
    X_train, y_train = train[feature_col].values[train_idx,], train['Survived'].values[train_idx,].astype(float)
    X_valid, y_valid = train[feature_col].values[valid_idx,], train['Survived'].values[valid_idx,].astype(float)
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

    params = {
        'objective': 'binary'
    }

    model = lgb.train(
        params, lgb_train,
        valid_sets=[lgb_train, lgb_eval],
        verbose_eval=10,
        num_boost_round=1000,
        early_stopping_rounds=10
    )
    
    # Calc Oof
    oof[valid_idx] = model.predict(X_valid, num_iteration=model.best_iteration)
    # Model
    models.append(model)

    # for save weight
    # name = f"fold{fold_num}"
    # model.save_model(name)    

    # preds on test
    test_preds_all += model.predict(test[feature_col], num_iteration=model.best_iteration) / N_FOLDS

In [None]:
print('ROC-AUC')
print(roc_auc_score(train['Survived'].ravel(), oof.ravel()))
print('Accuracy')
print(accuracy_score(train['Survived'].ravel(), (oof > 0.5).astype('int').ravel()))

In [None]:
# Feature Importance

for fold_num, model in enumerate(models):
    # Feature Importance
    feat_imp_fold = pd.DataFrame(model.feature_importance(), index=feature_col, columns= [f'imp_{fold_num}'])
    if fold_num == 0:
        feature_importance = feat_imp_fold.copy()
    else:
        feature_importance = pd.concat([feature_importance, feat_imp_fold], axis=1)
        
feature_importance['imp_mean'] = feature_importance.mean(axis=1)
feature_importance = feature_importance.sort_values('imp_mean')

plt.tick_params(labelsize=18)
plt.barh(feature_importance.index.values,feature_importance['imp_mean']);
plt.title('feature_importance',fontsize=18);

In [None]:
plt.hist(test_preds_all);
plt.xlim(0.0,1.0)
plt.title('Output proba hist');

In [None]:
# sub['Survived'] = (test_preds_all > 0.5).astype('int')
# sub.to_csv("submission.csv", index=False)

# NN(For comparison)

In [None]:
target_cols = ['Survived']

In [None]:
class TitanicDataset:
    def __init__(self, features, targets):
        self.features = features
        self.targets = targets
        
    def __len__(self):
        return (self.features.shape[0])
    
    def __getitem__(self, idx):
        dct = {
            'x' : torch.tensor(self.features[idx, :], dtype=torch.float),
            'y' : torch.tensor(self.targets[idx], dtype=torch.float)            
        }
        return dct
    
class TestDataset:
    def __init__(self, features):
        self.features = features
        
    def __len__(self):
        return (self.features.shape[0])
    
    def __getitem__(self, idx):
        dct = {
            'x' : torch.tensor(self.features[idx, :], dtype=torch.float)
        }
        return dct

In [None]:
def train_fn(model, optimizer, scheduler, loss_fn, dataloader, device):
    model.train()
    final_loss = 0
    
    for data in dataloader:
        optimizer.zero_grad()
        inputs, targets = data['x'].to(device), data['y'].to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        final_loss += loss.item()
        
    final_loss /= len(dataloader)
    
    return final_loss


def valid_fn(model, loss_fn, dataloader, device):
    model.eval()
    final_loss = 0
    valid_preds = []
    
    for data in dataloader:
        inputs, targets = data['x'].to(device), data['y'].to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        
        final_loss += loss.item()
        valid_preds.append(outputs.sigmoid().detach().cpu().numpy())
        
    final_loss /= len(dataloader)
    valid_preds = np.concatenate(valid_preds)
    
    return final_loss, valid_preds

def inference_fn(model, dataloader, device):
    model.eval()
    preds = []
    
    for data in dataloader:
        inputs = data['x'].to(device)

        with torch.no_grad():
            outputs = model(inputs)
        
        preds.append(outputs.sigmoid().detach().cpu().numpy())
        
    preds = np.concatenate(preds)
    
    return preds

In [None]:
DEVICE = ('cuda' if torch.cuda.is_available() else 'cpu')
EPOCHS = 25
NFOLDS = 5

BATCH_SIZE = 128
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-5
EARLY_STOPPING_STEPS = 10
EARLY_STOP = True

num_features=len(feature_col)
num_targets=1
hidden_size =150

In [None]:
class Model(nn.Module):
    def __init__(self, num_features, num_targets, hidden_size):
        super(Model, self).__init__()
        self.batch_norm1 = nn.BatchNorm1d(num_features)
        self.dropout1 = nn.Dropout(0.25)
        self.dense1 = nn.utils.weight_norm(nn.Linear(num_features, hidden_size))
        
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(0.25)
        self.dense2 = nn.utils.weight_norm(nn.Linear(hidden_size, hidden_size))
        
        self.batch_norm3 = nn.BatchNorm1d(hidden_size)
        self.dropout3 = nn.Dropout(0.25)
        self.dense3 = nn.utils.weight_norm(nn.Linear(hidden_size, num_targets))
    
    def forward(self, x):
        x = self.batch_norm1(x)
        x = self.dropout1(x)
        x = F.relu(self.dense1(x))
        
        x = self.batch_norm2(x)
        x = self.dropout2(x)
        x = F.relu(self.dense2(x))
        
        x = self.batch_norm3(x)
        x = self.dropout3(x)
        x = self.dense3(x)
        
        return x

In [None]:
def run_training(fold_num, seed, train, test, target_cols):
    
    seed_everything(seed)
    
    print("FOLDS : ", fold_num)
    
    train_idx = train[train.fold != fold_num].index
    valid_idx = train[train.fold == fold_num].index

    ## model
    X_train, y_train = train[feature_col].values[train_idx,], train[['Survived']].values[train_idx,].astype(float)
    X_valid, y_valid = train[feature_col].values[valid_idx,], train[['Survived']].values[valid_idx,].astype(float)
    
    train_dataset = TitanicDataset(X_train, y_train)
    valid_dataset = TitanicDataset(X_valid, y_valid)
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    validloader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)
    
    model = Model(
        num_features=num_features,
        num_targets=num_targets,
        hidden_size=hidden_size
    )
    
    model.to(DEVICE)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)        
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer=optimizer, pct_start=0.1, div_factor=1e3, 
                                          max_lr=1e-2, epochs=EPOCHS, steps_per_epoch=len(trainloader))
    
    loss_fn = nn.BCEWithLogitsLoss()
    loss_tr = nn.BCEWithLogitsLoss()
    
    early_stopping_steps = EARLY_STOPPING_STEPS
    early_step = 0
   
    oof = np.zeros((len(train), num_targets))
    predictions = np.zeros((len(test), num_targets))
    
    best_loss = np.inf
    
    for epoch in range(EPOCHS):
        
        train_loss = train_fn(model, optimizer,scheduler, loss_tr, trainloader, DEVICE)
        valid_loss, valid_preds = valid_fn(model, loss_fn, validloader, DEVICE)
        print(f"seed{seed}, FOLD: {fold_num}, EPOCH: {epoch}, train_loss: {train_loss}, valid_loss: {valid_loss}")
        
        if valid_loss < best_loss:
            print('Update best loss')
            best_loss = valid_loss
            oof[valid_idx] = valid_preds
            torch.save(model.state_dict(), f"FOLD{fold_num}_seed{seed}_.pth")
        
        elif(EARLY_STOP == True):
            early_step += 1
            if (early_step >= early_stopping_steps):
                print(f'Early Stop epoch{epoch} early_step{early_step}')
                break
            
    
    #--------------------- PREDICTION---------------------
    x_test = test[feature_col].values
    testdataset = TestDataset(x_test)
    testloader = torch.utils.data.DataLoader(testdataset, batch_size=BATCH_SIZE, shuffle=False)

    model = Model(
        num_features=num_features,
        num_targets=num_targets,
        hidden_size=hidden_size,

    )

    model.load_state_dict(torch.load(f"FOLD{fold_num}_seed{seed}_.pth"))
    model.to(DEVICE)

    predictions = inference_fn(model, testloader, DEVICE)
    
    return oof, predictions

In [None]:
def run_k_fold(NFOLDS, seed):
    oof = np.zeros((len(train), len(target_cols)))
    predictions = np.zeros((len(test), len(target_cols)))

    for fold_num in range(NFOLDS):
        oof_, pred_ = run_training(fold_num, seed, train, test, target_cols)

        predictions += pred_ / NFOLDS
        oof += oof_
        
    return oof, predictions

In [None]:
%%time
SEED = [33]
oof = np.zeros((len(train), len(target_cols)))
predictions = np.zeros((len(test), len(target_cols)))

print("######################## Training ############################")
for seed in SEED:
    oof_, predictions_ = run_k_fold(NFOLDS, seed)
    oof += oof_ / len(SEED)
    predictions += predictions_ / len(SEED)

In [None]:
print('ROC-AUC')
print(roc_auc_score(train['Survived'].ravel(), oof.ravel()))
print('Accuracy')
print(accuracy_score(train['Survived'].ravel(), (oof > 0.5).astype('int').ravel()))

In [None]:
plt.hist(predictions);
plt.xlim(0.0,1.0)
plt.title('Output proba hist');

In [None]:
# sub['Survived'] = (predictions > 0.5).astype('int')
# sub.to_csv("submission.csv", index=False)