# What is PyTorch Tabular?

![PyTorch Tabular](https://deepandshallowml.files.wordpress.com/2021/01/pytorch_tabular_logo.png)

PyTorch Tabular is a framework/ wrapper library which aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike. The core principles behind the design of the library are:

- Low Resistance Usability
- Easy Customization
- Scalable and Easier to Deploy

Instead of starting from scratch, the framework has been built on the shoulders of giants like **PyTorch**(obviously), and **PyTorch Lightning**.

It also comes with state-of-the-art deep learning models that can be easily trained using pandas dataframes.

The high-level config driven API makes it very quick to use and iterate. You can just use a **pandas dataframe** and all of the heavy lifting for normalizing, standardizing, encoding categorical features, and preparing the dataloader is handled by the library.

The `BaseModel` class provides an easy to extend abstract class for implementing custom models and still leverage the rest of the machinery packaged with the library.
State-of-the-art networks like **Neural Oblivious Decision Ensembles(NODE)** for Deep Learning on Tabular Data, and **TabNet**: Attentive Interpretable Tabular Learning are implemented. See examples from the [documentation](https://pytorch-tabular.readthedocs.io/en/latest/) for how to use them.

By using PyTorch Lightning for the training, PyTorch Tabular inherits the flexibility and scalability that Pytorch Lightning provides

- GitHub: [https://github.com/manujosephv/pytorch_tabular](https://github.com/manujosephv/pytorch_tabular)
- Documentation: [https://pytorch-tabular.readthedocs.io/en/latest/](https://pytorch-tabular.readthedocs.io/en/latest/)
- Accompanying Blog: [PyTorch Tabular â€“ A Framework for Deep Learning for Tabular Data](https://deep-and-shallow.com/2021/01/27/pytorch-tabular-a-framework-for-deep-learning-for-tabular-data/)


# How to use PyTorch Tabular?

In [None]:
# install PyTorch Tabular first
!pip install pytorch_tabular
# This is for a custom optimizer. PyTorch Tabular is flexible enough to use custom optimizers
!pip install torch_optimizer

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# NODE and ML tools
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig, NodeConfig, TabNetModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
from pytorch_tabular.categorical_encoders import CategoricalEmbeddingTransformer
from torch_optimizer import QHAdam
import category_encoders as ce
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# Reading and PreProcessing the Data

In [None]:
# load training data
df_train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
display(df_train.head())
# load test data
df_test = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
display(df_test.head())

features = ['cont1', 'cont2', 'cont3', 'cont4', 'cont5',
            'cont6', 'cont7', 'cont8', 'cont9', 'cont10',
            'cont11', 'cont12', 'cont13', 'cont14']

## Binning the Continuous Features

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
enc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy="quantile")
enc.fit(df_train[features])
binned_df_train = enc.transform(df_train[features])
binned_df_test = enc.transform(df_test[features])
for i, feature in enumerate(features):
    df_train[f"{feature}_binned"] = binned_df_train[:,i]
    df_test[f"{feature}_binned"] = binned_df_test[:,i]

## Defining the configs for the data, training, model, and optimizer

In [None]:
def get_configs(train):
    epochs = 25
    batch_size = 512
    steps_per_epoch = int((len(train)//batch_size)*0.9)
    data_config = DataConfig(
        target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
        continuous_cols=['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7',
           'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14'],
        categorical_cols=['cont1_binned', 'cont2_binned', 'cont3_binned',
       'cont4_binned', 'cont5_binned', 'cont6_binned', 'cont7_binned',
       'cont8_binned', 'cont9_binned', 'cont10_binned', 'cont11_binned',
       'cont12_binned', 'cont13_binned', 'cont14_binned'],
        continuous_feature_transform="quantile_normal"
    )
    trainer_config = TrainerConfig(
        auto_lr_find=False, # Runs the LRFinder to automatically derive a learning rate
        batch_size=batch_size,
        max_epochs=epochs,
        gpus=1, #index of the GPU to use. 0, means CPU
    )
    optimizer_config = OptimizerConfig(lr_scheduler="OneCycleLR", lr_scheduler_params={"max_lr":0.005, "epochs": epochs, "steps_per_epoch":steps_per_epoch})
    # model_config = CategoryEmbeddingModelConfig(
    #     task="regression",
    #     layers="200-100",  # Number of nodes in each layer
    #     activation="ReLU", # Activation between each layers
    #     learning_rate = 1e-3,
    #     batch_norm_continuous_input=True,
    #     use_batch_norm =True,
    #     dropout=0.0,
    #     embedding_dropout=0.0,
    #     initialization="kaiming"
    # )

    model_config = NodeConfig(
        task="regression",
        num_layers=2, # Number of Dense Layers
        num_trees=1024, #Number of Trees in each layer
        depth=6, #Depth of each Tree
        embed_categorical=True, #If True, will use a learned embedding, else it will use LeaveOneOutEncoding for categorical columns
        learning_rate = 1e-3,
        target_range=[(df_train[col].min(),df_train[col].max()) for col in ['target']]
    )
    return data_config, trainer_config, optimizer_config, model_config

## Cross Validated Bagging Model Run

Here, I am running a 5 fold validation and training the model on these five folds and predicting on the test set.

The models are:
1. NODE
2. LGBM
3. CatBoost

For LGBM and CatBoost, **we use the categorical encoding which was trained as part of NODE for categorical binned columns**. This is easily done using **PyTorch Tabular**

In [None]:
# random seeds
rnd_seed_cv = 1234
rnd_seed_reg = 1234
# cross validation
kf = KFold(n_splits=5, random_state=rnd_seed_cv, shuffle=True)
df_train.drop(columns='id', inplace=True)
df_test.drop(columns='id', inplace=True)
df_test['target'] = 0

In [None]:
def node(train, valid, df_test):
    data_config, trainer_config, optimizer_config, model_config = get_configs(train)
    tabular_model = TabularModel(
        data_config=data_config,
        model_config=model_config,
        optimizer_config=optimizer_config,
        trainer_config=trainer_config
    )
    # fit model
    tabular_model.fit(train=train, validation=valid, optimizer=QHAdam, 
                  optimizer_params={"nus": (0.7, 1.0), "betas": (0.95, 0.998)})
    result = tabular_model.evaluate(valid)
    return np.sqrt(result[0]["test_mean_squared_error"]), tabular_model.predict(valid)["target_prediction"].values, tabular_model.predict(df_test)["target_prediction"].values, tabular_model

def lgbm(train, valid, df_test):
    lgb_model = LGBMRegressor(n_estimators=10000,
                              learning_rate=0.005,
                              early_stopping_rounds=50,
                          feature_pre_filter = False,
                          num_leaves=102, 
                          min_child_samples=20,
                          colsample_bytree = 0.4,
                          subsample = 1,
                          subsample_freq = 0,
                          lambda_l1 = 4.6,
                          lambda_l2 = 1.9,
                          random_state=42)
    lgb_model.fit(train.drop(columns='target'), train['target'], eval_set=(valid.drop(columns='target'),valid.loc[:,'target']))
    lgb_preds = lgb_model.predict(valid.drop(columns='target'))
    score = mean_squared_error(valid['target'].values, lgb_preds, squared=False)
    return score, lgb_model.predict(valid.drop(columns='target')), lgb_model.predict(df_test.drop(columns='target'))

def cb(train, valid, df_test):
    best_params = {
    'grow_policy': 'Lossguide', 
    'boosting_type': 'Plain', 
    'depth': 20, 
    'l2_leaf_reg': 3.699746597668451,
    'min_data_in_leaf': 4,
    'random_strength': 4.9263987954247455, 
    'rsm': 1.0,
#     "eval_metric": "RMSE:hints=skip_train~false",
    "n_estimators": 10000,
    "learning_rate": 0.5,
    "od_type": "Iter",
    "od_wait": 50,
    
}
    catboost_model = CatBoostRegressor(**best_params)
    catboost_model.fit(train.drop(columns='target'), train['target'], eval_set=(valid.drop(columns='target'),valid.loc[:,'target']))
    cb_preds = catboost_model.predict(valid.drop(columns='target'))
    score = mean_squared_error(valid['target'].values, cb_preds, squared=False)
    return score, catboost_model.predict(valid.drop(columns='target')), catboost_model.predict(df_test.drop(columns='target'))

In [None]:
# train

CV_node = []
CV_lgb = []
CV_cb = []
preds_train_node = []
preds_train_lgb = []
preds_train_cb = []
preds_test_node = []
preds_test_lgb = []
preds_test_cb = []
cross_validated_preds = []

t1 = time.time()
for train_index, test_index in kf.split(df_train):
    train = df_train.iloc[train_index]
    valid = df_train.iloc[test_index]
    cv_val = valid.copy()
    #NODE
    node_score, node_train_pred, node_test_pred, tabular_model = node(train, valid, df_test)
    CV_node.append(node_score)
    cv_val['pred_node'] = node_train_pred
    preds_train_node.append(node_train_pred)
    preds_test_node.append(node_test_pred)
    # Using the trained Embeddings to replace categorical features
    transformer = CategoricalEmbeddingTransformer(tabular_model)
    train_transform = transformer.fit_transform(train)
    val_transform = transformer.transform(valid)
    df_test_transform = transformer.transform(df_test)
    #LGBM
    lgbm_score, lgbm_train_pred, lgbm_test_pred = lgbm(train_transform, val_transform, df_test_transform)
    CV_lgb.append(lgbm_score)
    cv_val['pred_lgb'] = lgbm_train_pred
    preds_train_lgb.append(lgbm_train_pred)
    preds_test_lgb.append(lgbm_test_pred)
    #Catboost
    cb_score, cb_train_pred, cb_test_pred = cb(train_transform, val_transform, df_test_transform)
    CV_cb.append(cb_score)
    cv_val['pred_cb'] = cb_train_pred
    preds_train_cb.append(cb_train_pred)
    preds_test_cb.append(cb_test_pred)
    cross_validated_preds.append(cv_val)
t2 = time.time()
print('Elapsed time [s]: ', t2-t1)

In [None]:
# Cross Validation performance
print('CV performance [RMSE]: ', np.mean(CV_node, axis=0))
print('CV performance [RMSE]: ', np.mean(CV_lgb, axis=0))
print('CV performance [RMSE]: ', np.mean(CV_cb, axis=0))

In [None]:
cross_val_pred_df = pd.concat(cross_validated_preds, sort=False)
cross_val_pred_df.to_csv("cross_val_preds.csv")
import joblib
joblib.dump(preds_test_node, "preds_test_node.sav")
joblib.dump(preds_test_lgb, "preds_test_lgb.sav")
joblib.dump(preds_test_cb, "preds_test_cb.sav")

## Weighted Average the Predictions

The weights are derived by running Linear Regression on the Cross Validated Predictions.

In [None]:
avg_cb_pred = np.mean(preds_test_cb, axis=0)
avg_lgb_pred = np.mean(preds_test_lgb, axis=0)
avg_node_pred = np.mean(preds_test_node, axis=0)
pred_test = np.average([avg_node_pred, avg_lgb_pred, avg_cb_pred], axis=0, weights=[-0.15447081,  1.1021915 ,  0.06145868])

## Submission

In [None]:
# prepare submission
df_sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')
df_sub.target = pred_test
df_sub.head()

In [None]:
# save to file for submission
df_sub.to_csv('submission.csv', index=False)