# Training of a DNN Tabular model with categorical embeddings using FastAI

FastAI provides a huge number of convenient functions on top of PyTorch for Deep Learning tasks. Within this notebook, I'll demonstrate just how extremely easy it is to get a Deep Learning Tabular model up and running.

In [None]:
import fastai

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from fastai.tabular.all import *

## 1. Load our data

In [None]:
data_dir = "/kaggle/input/tabular-playground-series-feb-2021/"
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
train_df.head()

## 2. Data preprocessing and creation of dataloaders

Lets preprocess our data into a suitable form for training. We'll encode categorical variables, standardise numerical features, and fill missing values (if there are any) within the dataset. We can do this extremely easily using the TabulerPandas class, like so:

In [None]:
processing_funcs = [Categorify, FillMissing, Normalize]
cat_cols = [x for x in train_df.columns.values if x.startswith('cat')]
num_cols = [x for x in train_df.columns.values if x.startswith('cont')]

In [None]:
nn_df = TabularPandas(train_df, cat_names=cat_cols, cont_names=num_cols, procs=processing_funcs, y_names='target')

In [None]:
# turn our tabular data into a dataloader, batch size 1024
train_dl = nn_df.dataloaders(1024)

In [None]:
# preview some of our data from the dataloader
train_dl.show_batch()

Its so easy you almost feel like you've cheated somehow! 

I must admit, this is something that put me off using FastAI initially, however after the pain and effort of doing all of this manually many times with Keras, Tensorflow and PyTorch imeplementations, the ease of this method is highly appreciated.

We could also have performed exactly the same as above, but straight from TabularDataLoaders, like so:

In [None]:
dls = TabularDataLoaders.from_df(train_df, path='.', y_names="target",  
                                 cat_names = cat_cols, 
                                 cont_names = num_cols, 
                                 procs=processing_funcs)

## 3. Production of our DNN model

Since we're performing regression, we need to provide our model with the possible bounds of the output. We need to do this since fastai uses a sigmoid activation on the final layer, rather than a dense layer with no activation. In general across many regression problems, we find that sigmoid tends to outperm just a raw dense layer, provided we precisely know the maximum and minimum outputs of our regression model.

We can do this and find the maximum / minimum output values based on our training data like so:

In [None]:
y = dls.train.y
y.min(), y.max()

Now that we've got the basics working with our dataloader and preprocessers produced, we can get on to model training:

In [None]:
tab_learn = tabular_learner(dls, y_range=(0, 11), layers=[500, 250], n_out=1, metrics=rmse)

We can view the architecture of our model like so:

In [None]:
tab_learn.model

Before starting training, its helpful to find an appropriate learning rate for our model. This is as simple in FastAI as calling the lr_find() function, like so:

In [None]:
tab_learn.lr_find()

From the plot, a learning rate of around 1e-3 should work well in this case.

Lets train our model for 5 epochs, and see how well it performs.

In [None]:
tab_learn.fit_one_cycle(5, lr_max=1e-2)

In [None]:
tab_learn.recorder.plot_loss()

## 4. Test set predictions

Preprocess our test set and make predictions using our trained model:

In [None]:
test_dl = tab_learn.dls.test_dl(test_df)
test_dl.show_batch()

In [None]:
preds, test_labels = tab_learn.get_preds(dl=test_dl)

In [None]:
preds

Great! Lets submit these to the competition and see how well the predictions perform:

In [None]:
final_preds = preds.numpy()

In [None]:
submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
submission_df['target'] = final_preds
submission_df.to_csv('submission.csv', index=False)

Overall, its remarkable how easy this process is, especially when compared to doing all of the low-level features yourself. I think going through the process of doing these low-level implementations is extremely important for learning, and is essential when you need to perform something a bit more specific for a data science problem.

However, once you've been through this process, and can appreciate what is going on under the hood, FastAI becomes hugely convenient and an asset for quickly experimenting on different data-based problems. Tabular data is just one tiny aspect of this, as done simply within this notebook.

To expand nicely from this work, we could extract the learned embeddings from our model above for each categorical feature, and feed these into a gradient booasting model, such as CatBoost for even better performance on this competition.

I hope you enjoyed this short piece of work anyway - thanks for reading!