# What is PyTorch Tabular?

![PyTorch Tabular](https://deepandshallowml.files.wordpress.com/2021/01/pytorch_tabular_logo.png)

PyTorch Tabular is a framework/ wrapper library which aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike. The core principles behind the design of the library are:

- Low Resistance Usability
- Easy Customization
- Scalable and Easier to Deploy

Instead of starting from scratch, the framework has been built on the shoulders of giants like **PyTorch**(obviously), and **PyTorch Lightning**.

It also comes with state-of-the-art deep learning models that can be easily trained using pandas dataframes.

The high-level config driven API makes it very quick to use and iterate. You can just use a **pandas dataframe** and all of the heavy lifting for normalizing, standardizing, encoding categorical features, and preparing the dataloader is handled by the library.

The `BaseModel` class provides an easy to extend abstract class for implementing custom models and still leverage the rest of the machinery packaged with the library.
State-of-the-art networks like **Neural Oblivious Decision Ensembles(NODE)** for Deep Learning on Tabular Data, and **TabNet**: Attentive Interpretable Tabular Learning are implemented. See examples from the [documentation](https://pytorch-tabular.readthedocs.io/en/latest/) for how to use them.

By using PyTorch Lightning for the training, PyTorch Tabular inherits the flexibility and scalability that Pytorch Lightning provides

- GitHub: [https://github.com/manujosephv/pytorch_tabular](https://github.com/manujosephv/pytorch_tabular)
- Documentation: [https://pytorch-tabular.readthedocs.io/en/latest/](https://pytorch-tabular.readthedocs.io/en/latest/)
- Accompanying Blog: [PyTorch Tabular â€“ A Framework for Deep Learning for Tabular Data](https://deep-and-shallow.com/2021/01/27/pytorch-tabular-a-framework-for-deep-learning-for-tabular-data/)


In [None]:
# install PyTorch Tabular first
!pip install pytorch_tabular
# This is for a custom optimizer. PyTorch Tabular is flexible enough to use custom optimizers
!pip install torch_optimizer
!pip install pandas==1.1.5

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# NODE and ML tools
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig, NodeConfig, TabNetModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
from pytorch_tabular.categorical_encoders import CategoricalEmbeddingTransformer
from torch_optimizer import QHAdam
import category_encoders as ce
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

In [None]:
# load training data
df_train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
display(df_train.head())
# load test data
df_test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
display(df_test.head())

In [None]:
df_train.columns

## Defining the configs for the data, training, model, and optimizer

In [None]:
def get_configs(train):
    epochs = 25
    batch_size = 1024
    steps_per_epoch = int((len(train)//batch_size)*0.9)
    data_config = DataConfig(
        target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
        continuous_cols=['cont0', 'cont1', 'cont2', 'cont3', 'cont4',
       'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10'],
        categorical_cols=['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7',
       'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15',
       'cat16', 'cat17', 'cat18'],
        continuous_feature_transform="quantile_normal"
    )
    trainer_config = TrainerConfig(
        auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
        batch_size=batch_size,
        max_epochs=epochs,
#         gpus=1, #index of the GPU to use. 0, means CPU
    )
    optimizer_config = OptimizerConfig(lr_scheduler="OneCycleLR", lr_scheduler_params={"max_lr":0.005, "epochs": epochs, "steps_per_epoch":steps_per_epoch})
    model_config = CategoryEmbeddingModelConfig(
        task="classification",
        layers="500-200",  # Number of nodes in each layer
        activation="ReLU", # Activation between each layers
        learning_rate = 1e-3,
        batch_norm_continuous_input=True,
        use_batch_norm =True,
        dropout=0.1,
        embedding_dropout=0.05,
        initialization="kaiming",
        metrics=["auroc"],
        metrics_params = [{}]
    )
    return data_config, trainer_config, optimizer_config, model_config

In [None]:
data_config, trainer_config, optimizer_config, model_config = get_configs(df_train)
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config
)
# fit model
tabular_model.fit(train=df_train, optimizer=QHAdam, 
              optimizer_params={"nus": (0.7, 1.0), "betas": (0.95, 0.998)})

In [None]:
pred_df = tabular_model.predict(df_test)

In [None]:
pred_df

In [None]:
# prepare submission
df_sub = pd.read_csv('../input/tabular-playground-series-mar-2021/sample_submission.csv')
df_sub.target = pred_df['1_probability'].values
df_sub.head()

In [None]:
df_sub.to_csv("submission.csv", index=False)