Pretraining makes a huge difference in many fields envolving deep learning. TabNet use very clever unsupervised pretraining, which is manages to improve the score.
It still not as good as GBMs but they are synergy well.
I thought if there is a way to pretrain with GBM, may be there is a way to leverage other models.

Let me introduce you a way to pretrain the data with LightGBM. Technically it is a transformation. What I actually do is:
* Train a lightgbm model. I suggest to engineer your features if possible and optimize the parameters. 
* Extract shap values for unseen fold
* Repeat for all folds and combine

As a result, you end up with a new dataset, which is:
* Normalized
* Linearized - kind of. Features transformed into their importances
* Categorical features encoded smarter! Encoding is not linear and depends on other features of the sample.
* Missing values a handled smarter!

I suggest you to read about shap values before you try.

I choosed lightgbm because it fast, good, and super-lazy: no need to worry about categories, missing values etc.
You may use other tree-based models.

In [None]:
pip install pytorch_tabnet

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
import lightgbm as lgb
from tqdm.autonotebook import tqdm
from sklearn import metrics
import shap
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer
import torch
import random
import os
shap.initjs()

In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv').set_index('id')
test_df = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv').set_index('id')
sample_submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

My params are optunized. You do not have to, but I believe it will improve the result.

In [None]:
params = {}

N_SPLITS = 3

I used feature extraction from another notebooks, just removed missing value imputation. Based on my expirience, lightgbm produce better scores for NaNs, rather than imputed values.

In [None]:
all_df = pd.concat([train_df, test_df])

for col in all_df.select_dtypes(['object']).columns:
    all_df[col] = all_df[col].astype('category')

X = all_df[all_df.index.isin(train_df.index)]
y = X.pop('target')

x_tst = all_df[~all_df.index.isin(train_df.index)].drop(columns='target')

In [None]:
folds = KFold(n_splits = N_SPLITS)
oof = np.zeros(X.shape[0])
predictions = np.zeros(x_tst.shape[0])
shap_list = []
shap_tst_list = []
for fold_, (trn_idx, val_idx) in tqdm(enumerate(folds.split(X, y)), total=folds.n_splits):
    print("Fold {}".format(fold_))
    x_trn = X.iloc[trn_idx]
    y_trn = y[trn_idx]
    x_val = X.iloc[val_idx]
    y_val = y[val_idx]
    model = lgb.LGBMClassifier(**params, random_state=42, n_estimators=9999999)
    model.fit(x_trn, y_trn, 
            eval_set=[(x_trn, y_trn),(x_val, y_val)],
#             eval_metric='auc', 
            early_stopping_rounds=500, 
            verbose=500
           )
    oof[val_idx] = model.predict_proba(x_val, num_iteration=model.best_iteration_)[:,1]
    predictions += model.predict_proba(x_tst, num_iteration=model.best_iteration_)[:,1] / folds.n_splits
    shap_explainer = shap.TreeExplainer(model)
    shap_val = pd.DataFrame(shap_explainer.shap_values(x_val)[1], index=x_val.index, columns=x_val.columns)
    shap_list.append(shap_val)
    shap_tst = pd.DataFrame(shap_explainer.shap_values(x_tst)[1], index=x_tst.index, columns=x_tst.columns)
    shap_tst_list.append(shap_tst)
submission1 = pd.Series(predictions, index=x_tst.index).to_frame('target').reset_index()
lgb.plot_importance(model)
model1_score = metrics.log_loss(y, oof)

Then I create the transformed dataset and apply 3 model on it:
* Lightgbm. I use the same params. Optimized should perform better.
* Tabnet w/o unsupervised pretraining
* Tabnet with unsupervised pretraining


In [None]:
X = pd.concat(shap_list).join(y)
y = X.pop('target')
x_tst = pd.concat(shap_tst_list).groupby(level=0).mean()

In [None]:
oof = np.zeros(X.shape[0])
predictions = np.zeros(x_tst.shape[0])

for fold_, (trn_idx, val_idx) in tqdm(enumerate(folds.split(X, y)), total=folds.n_splits):
    print("Fold {}".format(fold_))
    x_trn = X.iloc[trn_idx]
    y_trn = y[trn_idx]
    x_val = X.iloc[val_idx]
    y_val = y[val_idx]
    model = lgb.LGBMClassifier(**params, random_state=42, n_estimators=9999999)
    model.fit(x_trn, y_trn, 
            eval_set=[(x_trn, y_trn),(x_val, y_val)],
            eval_metric='auc', 
            early_stopping_rounds=500, 
            verbose=500
           )
    oof[val_idx] = model.predict_proba(x_val, num_iteration=model.best_iteration_)[:,1]
    predictions += model.predict_proba(x_tst, num_iteration=model.best_iteration_)[:,1] / folds.n_splits
submission2 = pd.Series(predictions, index=x_tst.index).to_frame('target').reset_index()
lgb.plot_importance(model)
model2_score = metrics.log_loss(y, oof)

In [None]:
oof = np.zeros(X.shape[0])
predictions = np.zeros(x_tst.shape[0])

for fold_, (trn_idx, val_idx) in tqdm(enumerate(folds.split(X, y)), total=folds.n_splits):
    print("Fold {}".format(fold_))
    x_trn = X.iloc[trn_idx]
    y_trn = y[trn_idx]
    x_val = X.iloc[val_idx]
    y_val = y[val_idx]
    model = TabNetClassifier()
    model.fit(
        x_trn.values, y_trn, 
#         eval_metric=['accuracy'],
        eval_set=[(x_val.values, y_val)]
    )
    oof[val_idx] = model.predict_proba(x_val.values)[:,1]
    predictions += model.predict_proba(x_tst.values)[:,1] / folds.n_splits
submission3 = pd.Series(predictions, index=x_tst.index).to_frame('target').reset_index()
model3_score = metrics.log_loss(y, oof)

In [None]:
oof = np.zeros(X.shape[0])
predictions = np.zeros(x_tst.shape[0])

for fold_, (trn_idx, val_idx) in tqdm(enumerate(folds.split(X, y)), total=folds.n_splits):
    print("Fold {}".format(fold_))
    x_trn = X.iloc[trn_idx]
    y_trn = y[trn_idx]
    x_val = X.iloc[val_idx]
    y_val = y[val_idx]
    unsupervised_model = TabNetPretrainer(optimizer_fn=torch.optim.Adam,
                                          optimizer_params=dict(lr=2e-2),
                                          mask_type='entmax' # "sparsemax"
                                         )
    
    unsupervised_model.fit(X_train=x_trn.values,
                           eval_set=[x_val.values],
                           pretraining_ratio=0.8,
                          )
    
    model = TabNetClassifier()
    model.fit(x_trn.values, y_trn, 
#               eval_metric=['accuracy'],
              eval_set=[(x_val.values, y_val)],
              from_unsupervised=unsupervised_model
             )
    oof[val_idx] = model.predict_proba(x_val.values)[:,1]
    predictions += model.predict_proba(x_tst.values)[:,1] / folds.n_splits
submission4 = pd.Series(predictions, index=x_tst.index).to_frame('target').reset_index()
model4_score = metrics.log_loss(y, oof)

In [None]:
print('lgbm model:', model1_score)
print('lgbm model(shap-pretrained):', model2_score)
print('tabnet model(shap-pretrained):', model3_score)
print('tabnet model(shap and tabnet pretrained):', model4_score)

In [None]:
submission1.to_csv('submission1.csv', index=False)
submission2.to_csv('submission2.csv', index=False)
submission3.to_csv('submission3.csv', index=False)
submission4.to_csv('submission4.csv', index=False)

Thanks for reading

Now working on unsupervised GBM-pretraining...