# Training of a DNN Tabular model with categorical embeddings using FastAI

FastAI provides a huge number of convenient functions on top of PyTorch for Deep Learning tasks. 

Within this notebook, I'll quickly demonstrate a simple process that can be used to perform binary classification with a Deep Learning Tabular model that uses categorical embeddings and standardised numerical features as inputs.

In [None]:
import fastai

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from fastai.tabular.all import *

---

## 1. Load our data

In [None]:
data_dir = "/kaggle/input/tabular-playground-series-mar-2021/"
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
train_df.head()

We've got a varied mix of categorical and numerical features.

Lets preprocess our data into a suitable form for training. We'll encode categorical variables, standardise numerical features, and fill missing values (if there are any) within the dataset. We can do this extremely easily using the TabulerPandas class, like so:

In [None]:
train_df['target'].value_counts().plot.bar()
plt.show()

In [None]:
train_df['target'].value_counts()

Our problem is a slightly imbalanced classification problem.

---

## 2. Data preprocessing and creation of dataloader

Lets preprocess our data into a suitable form for training. We'll encode categorical variables, standardise numerical features, and fill missing values (if there are any) within the dataset. We can do this extremely easily using the TabulerPandas class, like so:

In [None]:
processing_funcs = [Categorify, FillMissing, Normalize]
cat_cols = [x for x in train_df.columns.values if x.startswith('cat')]
num_cols = [x for x in train_df.columns.values if x.startswith('cont')]

In [None]:
nn_df = TabularPandas(train_df, cat_names=cat_cols, cont_names=num_cols, procs=processing_funcs, y_names='target', y_block = CategoryBlock())

A key thing with tabular classification problems is to pass in y_block = CategoryBlock() above, since this will inform our model to perform classification rather than regression.

In [None]:
train_dl = nn_df.dataloaders(1024)

In [None]:
# preview some of our data from the dataloader
train_dl.show_batch()

Its so easy you almost feel like you've cheated somehow!

I must admit, this is something that put me off using FastAI initially, however after the pain and effort of doing all of this manually many times with Keras, Tensorflow and PyTorch imeplementations, the ease of this method is highly appreciated for quick experimentation and research.

We could also have performed exactly the same as above, but straight from TabularDataLoaders, like so:

In [None]:
dls = TabularDataLoaders.from_df(train_df, path='.', y_names="target",  
                                 cat_names = cat_cols, 
                                 cont_names = num_cols, 
                                 procs=processing_funcs, 
                                 y_block = CategoryBlock())

---

## 3. Production of our DNN model

We're performing basic binary classification for this challenge, so we only need to inform our model that its output bounds lie between 0 and 1. This will create a sigmoid output layer, from which we can classify our targets as either 0 and 1 depending on the chosen threshold.

In [None]:
tab_learn = tabular_learner(dls, layers=[500, 250], metrics=[accuracy, error_rate, Recall(), Precision()])

We can get a quick preview of our model before training:

In [None]:
tab_learn.model

Its also helpful to find an appropriate learning rate for our model prior to training:

In [None]:
tab_learn.lr_find()

In [None]:
tab_learn.fit_one_cycle(2, lr_max=5e-3)

In [None]:
tab_learn.recorder.plot_loss()

In [None]:
interpret = ClassificationInterpretation.from_learner(tab_learn)
interpret.plot_confusion_matrix()

In [None]:
interpret.print_classification_report()

---

## 4. Test set predictions

Preprocess our test set and make predictions using our trained model:

In [None]:
test_dl = tab_learn.dls.test_dl(test_df)
test_dl.show_batch()

In [None]:
preds, test_labels = tab_learn.get_preds(dl=test_dl)

In [None]:
preds 

We need to take the argmax of these resultant predictions in order to obtain the final hard class output labels. We'll do this, and then make a submission to the competition:

In [None]:
final_preds = preds.numpy()
final_preds = np.argmax(final_preds, axis=1)

In [None]:
submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
submission_df['target'] = final_preds
submission_df.to_csv('submission.csv', index=False)

Overall, its amazing how easy this process is, especially when compared to doing all of the low-level features yourself.