# Simple and efficient TabNet approach

## Why TabNet for time series?

In my opinion TabNet, as XGBoost or LGBM is NOT the best algorithm for time series.

However, pytorch-tabnet natively implements multi regression which is handy to make a time serie prediction.


This notebook shows one elegant and simple way to work with time series with TabNet:
- pick a "memory size" (here it's the entire serie size of 80 points) and give all the features as input to the model
- pick a "horizon time" for prediction (here it's again the entire serie size) and give this as multi-regression targets

It would be really painful to use such an approach with XGBoost, since you'll need to have one model per time_step.

This is a very simple baseline, no feature was created, no preprocessing, nothing : just reformating.

## Results

The results are quite good, competing with simple yet strong LSTM baseline like this one: https://www.kaggle.com/theoviel/deep-learning-starter-simple-lstm

I find the simplicity and flexibility of tabnet really beautiful, I hope you'll like it too.

## Pushing further

If you want to try to push this baseline further here is a list of things one could try:
- play with the parameters
- try pretraining on test set (might be quite heavy)
- create as many features as you like

I am sure the score can be improved, however I still doubt that it can compete with time series Deep Learning models like TCN or LSTM. It might be useful for ensembling however.

In [None]:
! pip install --user pytorch-tabnet

In [None]:
import pandas as pd
import numpy as np
import torch

from pytorch_tabnet.tab_model import TabNetRegressor
from matplotlib import pyplot as plt

In [None]:
# Load data

train = pd.read_csv("../input/ventilator-pressure-prediction/train.csv")
train["date"] = train.groupby("breath_id").time_step.rank(axis=0).astype(np.int)

test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')
test["date"] = test.groupby("breath_id").time_step.rank(axis=0).astype(np.int)

In [None]:
def flatten_df(df, has_target = True):
    """
        This switch from time step observation to full series
    """
    features = ["R", "C",] + [f"u_in_{n}" for n in range(80)] + [f"u_out_{n}" for n in range(80)]
    targets =  [f"u_pressure_{n}" for n in range(80)]
    
    if has_target:
        values = np.hstack([df.R.unique(),
                            df.C.unique(),
                            df.u_in.values,
                            df.u_out.values,
                            df.pressure.values])
        columns = features + targets
    else:
        values = np.hstack([df.R.unique(),
                            df.C.unique(),
                            df.u_in.values,
                            df.u_out.values])
        columns = features
    
    result_df = pd.Series({col: val for col, val in zip(columns, values)})
    return result_df

In [None]:
X = train.groupby("breath_id").apply(lambda df : flatten_df(df)).reset_index(drop=False)
X_test = test.groupby("breath_id").apply(lambda df : flatten_df(df, has_target=False)).reset_index(drop=False)

In [None]:
# Define features and targets
features = ["R", "C"] + [f"u_in_{n}" for n in range(80)] + [f"u_out_{n}" for n in range(80)]
targets =  [f"u_pressure_{n}" for n in range(80)]

# Define TabNet params

In [None]:
cat_idxs = [] # R and C could be categorical
cat_dims = []


BS = 2**12
virtual_BS = 256

# commented params
tabnet_params = dict(
    cat_idxs=cat_idxs,
    cat_dims=cat_dims,
    cat_emb_dim=1,
    n_d = 256, 
    n_a = 256, 
    n_steps = 5,
    gamma = 1.5,
    n_independent = 2,
    n_shared = 2,
    lambda_sparse = 1e-5,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type = "entmax",
    seed = 42,
    verbose = 5
    
)

In [None]:
# Make sure to comply with the competition metric
# here we take advantage of the fact that pressure is positive

from pytorch_tabnet.metrics import Metric

class filtered_MAE(Metric):
        def __init__(self):
            self._name = "filtered_mae"
            self._maximize = False

        def __call__(self, y_true, y_score):
            weights = y_true >-1
            mae = weights * np.abs(y_true - y_score)
            mae = mae.sum() / weights.sum()
            return mae
        
def filtered_loss(y_pred, y_true):
    weights = (y_true >-1)+0.
    mae = weights * torch.abs(y_true - y_pred)
    mae = torch.sum(mae) / torch.sum(weights)
    return mae

# Train

In [None]:
from sklearn.model_selection import GroupKFold
import copy

N_SPLITS = 5
kfold = GroupKFold(n_splits=N_SPLITS)

# How many folds do you want to train?
N_MODELS = 5
assert(N_MODELS <= N_SPLITS)
max_epochs =  500

# Create out of folds array
oof_predictions = np.zeros((X.shape[0], len(targets)))
test_preds = np.zeros((X_test.shape[0], len(targets)))

for fold, (trn_ind, val_ind) in enumerate(kfold.split(X, groups=X.breath_id)):
    print(f'Training fold {fold + 1}')
    X_train, X_val = X.loc[trn_ind][features].values, X.loc[val_ind][features].values
    y_train, y_val = X.loc[trn_ind][targets].values, X.loc[val_ind][targets].values
    
    # mask unnecessary outputs with -1
    y_train[X.loc[trn_ind, [c for c in X.columns if c.startswith("u_out")]].values==1] = -1
    y_val[X.loc[val_ind, [c for c in X.columns if c.startswith("u_out")]].values==1]=-1
    
    
    params = copy.deepcopy(tabnet_params)
    params["scheduler_fn"]=torch.optim.lr_scheduler.OneCycleLR
    params["scheduler_params"]={"is_batch_level":True,
                                "max_lr":5e-2,
                                "steps_per_epoch":int(X_train.shape[0] / BS)+1,
                                "epochs":max_epochs}
    

    clf =  TabNetRegressor(**params)
    clf.fit(
      X_train, y_train,
      eval_set=[(X_train, y_train), (X_val, y_val)],
      eval_name=['train', 'val'],
      max_epochs = max_epochs,
      patience = 200,
      batch_size = BS,  
      virtual_batch_size = virtual_BS, 
      num_workers = 0,
      drop_last = False, 
      eval_metric=["filtered_mae"],
      loss_fn=filtered_loss,
      )
    
    del X_train
      
    oof_predictions[val_ind] = clf.predict(X_val)
    
    del X_val
    
    test_preds += clf.predict(X_test[features].values) / N_MODELS
    if fold+1 >=N_MODELS:
        break

# Have a look at out of fold predictions

In [None]:
for serie_nb in range(50):
    plt.plot(oof_predictions[serie_nb, :])
    plt.plot(X.loc[serie_nb][targets].values, color="green")
    plt.show()

# Make submission

In [None]:
sample_submission = pd.read_csv("../input/ventilator-pressure-prediction/sample_submission.csv")
sample_submission["pressure"] = test_preds.reshape(-1, 1)
sample_submission.to_csv("submission.csv", index=False)