# DNN with categorical embeddings

This notebook demonstrates a method of producing categorical embeddings for a neural network model when working with tabular data.

The categorical and numerical features of the data are preprocessed and formed into seperate input layers for the model, which are then concatenated into a composite model that performs regression for the target variable.

This is performed using keras and the functional API, but could equally be performed in PyTorch using a similar process.

This process of creating categorical embeddings can be very useful in practice, and can help provide additional inputs into simpler models as a form of pre-training, such as gradient boosting machines.

Hopefully you can find this notebook useful!

**Table of Contents:**

1. [Load Data and Analyse Overall Dataset Features](#load)
2. [Data Preparation and Preprocessing](#data-preprocessing) 
3. [Deep ANN Models](#ann-models)
4. [Improving our DNN model using Monte Carlo Dropout](#ann-model-2)
5. [Test Set Predictions](#test-predictions)
6. [Examining our categorical embeddings](#categorical-embeddings)

In [None]:
import keras
import keras.backend as K
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import tensorflow as tf

from collections import defaultdict

from keras.layers import Dense, Embedding, Flatten, LSTM, GRU, \
        SpatialDropout1D, Bidirectional, Conv1D, MaxPooling1D, BatchNormalization
from keras.models import Sequential, load_model
from keras import models
from keras import layers

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline

from tqdm import tqdm

---

<a id="load"></a>
## 1. Load Data

In [None]:
data_dir = "/kaggle/input/tabular-playground-series-feb-2021/"
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
train_df.head()

---

<a id="data-preprocessing"></a>
## 2. Data Preprocessing: Creation of a data loader and preprocessor

It's important that we handle our numerical and categorical features appropriately prior to producing our models.

We'll put together some preprocessing functions to encode our categorical features and standardise our numerical features. Whilst doing this, we'll also add support for combining some of the minority categories within our data features (since some are very imbalanced), and add support for producing additional dimensionality-reduced features (using PCA) to our dataset.

These extra features will allow us to experiment and tune to find the best combinations of feature engineering to perform for this problem.

In [None]:
class DataProcessor(object):
    def __init__(self):
        self.enc_dict = None
        self.standard_scaler = None
        self.num_cols = None
        self.cat_cols = None
        
    def preprocess(self, data_df, train=True, 
                   combine_min_cats=False, add_pca_feats=False):
        """ Preprocess train / test as required """
        
        # if training, fit our transformers
        if train:
            self.train_ids = data_df.loc[:, 'id']
            train_cats = data_df.loc[:, data_df.dtypes == object]
            self.cat_cols = train_cats.columns
            
            # if selected, combine minority categorical feats
            if combine_min_cats:
                self._find_minority_cats(train_cats)
                train_cats = self._combine_minority_feats(train_cats)
            
            # encode all of our categorical variables
            self.enc_dict = defaultdict(LabelEncoder)
            train_cats_enc = train_cats.apply(lambda x: self.enc_dict[x.name].fit_transform(x))
            
            # standardise all numerical columns
            train_num = data_df.loc[:, data_df.dtypes != object].drop(columns=['target', 'id'])
            self.num_cols = train_num.columns
            self.standard_scaler = StandardScaler()
            train_num_std = self.standard_scaler.fit_transform(train_num)
            
            # add pca reduced num feats if selected, else just combine num + cat feats
            if add_pca_feats:
                pca_feats = self._return_num_pca(train_num_std)
                self.final_num_feats = list(self.num_cols)+list(self.pca_cols)
                
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std, pca_feats)), 
                        columns=list(self.cat_cols)+self.final_num_feats)
            else:
                # set final list of all num cols and form final combined df
                self.final_num_feats = list(self.num_cols)
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std)), 
                        columns=list(self.cat_cols)+list(self.num_cols))
        
        # otherwise, treat as test data
        else:
            # transform categorical and numerical data
            self.test_ids = data_df.loc[:, 'id']
            cat_data = data_df.loc[:, self.cat_cols]
            if combine_min_cats:
                cat_data = self._combine_minority_feats(cat_data)
            cats_enc = cat_data.apply(lambda x: self.enc_dict[x.name].transform(x))
            num_data = data_df.loc[:, self.num_cols]
            num_std = self.standard_scaler.transform(num_data)
            
            if add_pca_feats:
                pca_feats = self._return_num_pca(num_std, train=False)
                
                X = pd.DataFrame(np.hstack((cats_enc, num_std, pca_feats)), 
                        columns=list(self.cat_cols) + self.final_num_feats)
            
            else:
                X = pd.DataFrame(np.hstack((cats_enc, num_std)), 
                        columns=list(self.cat_cols)+list(self.num_cols)) 
        return X
    
    
    def _find_minority_cats(self, data_df, composite_category='z', threshold=0.05):
        """ Find minority categories for each feature column, and create a 
            dictionary that maps those to selected composite category """
        self.min_col_dict = {}
        self.min_cat_mappings = {}
    
        # find all feature categories with less than 5% proportion
        for feature in self.cat_cols:
            self.min_col_dict[feature] = []
            self.min_cat_mappings[feature] = {}
        
            for category, proportion in data_df[feature].value_counts(normalize=True).iteritems():
                if proportion < threshold:
                    self.min_col_dict[feature].append(category)
                
                    # map those minority cats to chosen composite feature
                    self.min_cat_mappings[feature] = {x : composite_category for x 
                                                    in self.min_col_dict[feature]}
    
    
    def _combine_minority_feats(self, data_df, replace=False):
        """ Combine minority categories into composite for each cat feature """
        new_df = data_df.copy()
        for feat in self.cat_cols:
            col_label = f"{feat}" if replace else f"{feat}_new"
            new_df[feat] = new_df[feat].replace(self.min_cat_mappings[feat])
        return new_df
    
    
    def _return_num_pca(self, num_df, n_components=0.85, train=True):
        """ return dim reduced numerical features using PCA """
        if train:
            self.pca = PCA(n_components=n_components)
            num_rd = self.pca.fit_transform(num_df)
            
            # create new col names for our reduced features
            self.pca_cols = [f"pca_{x}" for x in range(num_rd.shape[1])]
            
        else:
            num_rd = self.pca.transform(num_df)
        
        return pd.DataFrame(num_rd, columns=self.pca_cols)

Although we've added support for minority category modifications and addition of PCA dimensionality reduced features, we'll keep it simple for this example, and just use our training and test data with basic categorical encoding and numerical standardisation.

In [None]:
data_proc = DataProcessor()
X = data_proc.preprocess(train_df, add_pca_feats=True)
y = train_df.loc[:, 'target']
X_test = data_proc.preprocess(test_df, train=False, add_pca_feats=True)

print(f"X: {X.shape} \ny: {y.shape} \nX_test: {X_test.shape}")

Our preprocessing provides us with a list of the final numerical columns (including original standardised features + pca reduced features if chosen):

In [None]:
data_proc.final_num_feats

Note that from above, we only fit our label encoder and standard scaler transformers to our training set, and then use this to transform (and not fit) to our test data.

We next need to break this down into training and validation splits. Our training split allows us to train each of our models through the optimisation of our objective function, which is specific to the model used. Our validation split allows us to analyse the estimate performance of our trained models and lets us make refinements to improve and maximise their performance.

Throughout this entire process, we should not touch our test set until the very end, at which point we make predictions using our final model and submit these to the competition.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
print(f"X_train: {X_train.shape} \ny_train: {y_train.shape} \nX_val: {X_val.shape}, \ny_val: {y_val.shape}")

These splits of data are still very large, and so in practice we might reduce these significantly into much smaller sub-sets. These can then be used to quickly train and evaluate a range of models, which can be iteratively improved and then tested on the full splits we produced above.

For this example we'll just use the large sub-sets obtained above.

---

<a id="ann-models"></a>
## 3. Production of a DNN model with categorical embeddings and numerical inputs

There are lots of ways of creating embedding and numerical input layers for a neural network model. For this, we'll keep it relatively simple and create arrays to store each of our categorical and numerical feature inputs respectively.

With each of these arrays containing the categorical and numerical input layers, we can concatanate these together and produce a composite model that exploits both categorical embeddings and numerical features. We'll use the keras functional API for this.

In [None]:
# arrays to store individual input layers for each cat & num feature
categorical_inputs = []
numerical_inputs = []

# arrays to store all our categorical embeddings & layer names
cat_embeddings = []
emb_layer_names = []

# embedding dimension - this is a hyper-parameter that can be tuned
emb_n = 3

In [None]:
# create embeddings for each of our categorical features
for cat_col in data_proc.cat_cols:
    _in = layers.Input(shape=[1], name=cat_col)
    _emb = layers.Embedding(int(X_train[cat_col].max()) + 1, 
                            emb_n, name=cat_col + '_emb')(_in)
    categorical_inputs.append(_in)
    cat_embeddings.append(_emb)
    emb_layer_names.append(cat_col + '_emb')
    
# input layers for the numeric features
for num_col in data_proc.final_num_feats: 
    numeric_input = layers.Input(shape=(1,), name=num_col)
    numerical_inputs.append(numeric_input)
    
# merge all our numeric inputs into one layer
combined_num_inputs = layers.concatenate(numerical_inputs)

# Merge embedding layers, apply dropout for regularisation, and flatten
merged_inputs = layers.concatenate(cat_embeddings)
spatial_dropout = layers.SpatialDropout1D(0.2)(merged_inputs)
flat_embed = layers.Flatten()(spatial_dropout)

# concatenate all of our categorical and numerical features
composite_feats = layers.concatenate([flat_embed, combined_num_inputs])

# custom DNN for regression
x = layers.Dropout(0.3)(layers.Dense(200, activation='elu', 
                                      kernel_initializer='he_normal')(composite_feats))
x = layers.Dropout(0.3)(layers.Dense(100, activation='elu', 
                                      kernel_initializer='he_normal')(x))
x = layers.Dropout(0.3)(layers.Dense(50, activation='elu', 
                                      kernel_initializer='he_normal')(x))

# define model inputs and outputs
output = layers.Dense(1)(x)
model = models.Model(inputs=categorical_inputs + numerical_inputs, outputs=output)
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

With our model compiled, we can view it, along with all of the input layers and embeddings we created, like so:

In [None]:
model.summary()

We'll create an early stopper callback so that if we start over-fitting our model automatically stops at an appropriate point:

In [None]:
# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper]

Since we're using the keras functional api, with a range of input layers as shown in the model summary above, we need to pass our data into the keras model in a more specific way compared to usual. We can do this using a dictionary, which contains the feature names as the keys, and input features for each as the values, like so:

In [None]:
def get_keras_dataset(df):
    """ Return a dictionary of feature names and associated arrays for
        input into our functional model """
    X = {str(col) : np.array(df[col]) for col in df.columns}
    return X

Now we're ready to train our model:

In [None]:
history = model.fit(x=get_keras_dataset(X_train), y=y_train, 
                    epochs=30, 
                    batch_size=512, 
                    validation_data=(get_keras_dataset(X_val), y_val), 
                    callbacks=trg_callbacks)

In [None]:
def plot_history_results(history, metric='mse', figsize=(12,5)):
    """ Helper function for plotting history from keras model """
    
    # gather desired features
    trg_loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(trg_loss) + 1)

    # plot losses and accuracies for training and validation 
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(1, 1, 1)
    plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
    plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
    plt.title("Training / Validation Loss")
    ax.set_ylabel("Loss")
    ax.set_xlabel("Epochs")
    plt.legend(loc='best')
    plt.tight_layout()
    plt.ylim(0.0, 2.0)
    plt.show()

In [None]:
plot_history_results(history)

In [None]:
val_preds = model.predict(get_keras_dataset(X_val))
mse = mean_squared_error(y_val, val_preds)
print(f'Validation MSE: {mse:.4f}')

0.7432 with pca reduced features.

---

<a id="ann-model-2"></a>
## 4. Improving our DNN model with Monte Carlo Dropout

Lets add monte carlo dropout to this model, and use this as a means of making ensembled predictions with our model:

In [None]:
class MonteCarloDropout(layers.Dropout):
    """ Class that overrides default call function used by standard 
        dropout to keep dropout active during inference """
    def call(self, inputs):
        return super().call(inputs, training=True)
    
def pred_mc_dropout(model, test_inputs, n_samples=50):
    """ Make a large number of predictions (equal to n_samples) using the 
        passed model and input features """
    pred_probs = [model.predict(test_inputs) for samples in range(n_samples)]
    return np.mean(pred_probs, axis=0)

We simply produce the model in exactly the same way as before, except replacing dropout with MonteCarloDropout instead, like so:

In [None]:
def tabular_dnn(X_df, cat_cols, num_cols, embedding_dim=3, dropout=0.3):
    """ Tabular DNN model that supports both categorical and numerical variables. 
        Categorical embeddings are used for categorical embeddings."""
    # arrays to store individual input layers for each cat & num feature
    categorical_inputs = []
    numerical_inputs = []

    # arrays to store all our categorical embeddings & layer names
    cat_embeddings = []
    emb_layer_names = []

    # embedding dimension - this is a hyper-parameter that can be tuned
    emb_n = embedding_dim
    
    # create embeddings for each of our categorical features
    for cat_col in cat_cols:
        _in = layers.Input(shape=[1], name=cat_col)
        _emb = layers.Embedding(int(X_df[cat_col].max()) + 1, 
                            emb_n, name=f"{cat_col}_emb")(_in)
        categorical_inputs.append(_in)
        cat_embeddings.append(_emb)
        emb_layer_names.append(f"{cat_col}_emb")
    
    # input layers for the numeric features
    for num_col in num_cols: 
        numeric_input = layers.Input(shape=(1,), name=num_col)
        numerical_inputs.append(numeric_input)
    
    # merge all our numeric inputs into one layer
    combined_num_inputs = layers.concatenate(numerical_inputs)

    # Merge embedding layers, apply dropout for regularisation, and flatten
    merged_inputs = layers.concatenate(cat_embeddings)
    spatial_dropout = layers.SpatialDropout1D(0.2)(merged_inputs)
    flat_embed = layers.Flatten()(spatial_dropout)

    # concatenate all of our categorical and numerical features
    composite_feats = layers.concatenate([flat_embed, combined_num_inputs])

    # custom DNN for regression
    x = MonteCarloDropout(dropout)(layers.Dense(200, activation='elu', 
                                      kernel_initializer='he_normal')(composite_feats))
    x = MonteCarloDropout(dropout)(layers.Dense(100, activation='elu', 
                                      kernel_initializer='he_normal')(x))
    x = MonteCarloDropout(dropout)(layers.Dense(50, activation='elu', 
                                      kernel_initializer='he_normal')(x))

    # define model inputs and outputs
    output = layers.Dense(1)(x)
    model = models.Model(inputs=categorical_inputs + numerical_inputs, outputs=output)
    model.compile(loss='mse', optimizer='adam', metrics=['mse'])
    
    return model

In [None]:
model = tabular_dnn(X_train, data_proc.cat_cols, data_proc.final_num_feats)

In [None]:
# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper]

In [None]:
history = model.fit(x=get_keras_dataset(X_train), y=y_train, 
                    epochs=40, 
                    batch_size=512, 
                    validation_data=(get_keras_dataset(X_val), y_val), 
                    callbacks=trg_callbacks)

In [None]:
plot_history_results(history)

When we make predictions with this model, we need to call out monte carlo predict function above, rather than the native model prediction. This is because we want to call model.predict a large number of times, which provides us with ensembled predictions from which we average to obtain our overall final predictions:

In [None]:
val_preds = pred_mc_dropout(model, get_keras_dataset(X_val), n_samples=25)
mse = mean_squared_error(y_val, val_preds)
print(f'Validation MSE: {mse:.4f}')

**Note:** This takes much longer than it normally would for inference, since our monte carlo prediction function is calling model.predict a large number of times (equal to n_samples argument above). Thus, we have a compromise that we need to make between inference time, and the number of ensembled predictions we want to make.

Generally, we can obtain reasonable performance with 25-50 samples of predictions of monte carlo dropout models, but this varies depending on the complexity of the underlying model. The more diverse and varying our neural network models are (proportional to the number of monte carlo dropout layers throughout the model) then the more samples we will need to produce for our ensembled predictions accordingly.

---

<a id="test-predictions"></a>
## 5. Final Training and Test Set Predictions

Using the model defined above, lets train on the entire training set and make a set of final predictions.

In [None]:
model = tabular_dnn(X, data_proc.cat_cols, data_proc.final_num_feats)

# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper]

In [None]:
history = model.fit(x=get_keras_dataset(X), y=y, 
                    epochs=40, 
                    batch_size=512,  
                    callbacks=trg_callbacks)

In [None]:
test_preds = pred_mc_dropout(model, get_keras_dataset(X_test), n_samples=50)

In [None]:
# save submission in csv format
submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
submission_df['target'] = test_preds
submission_df.to_csv('submission.csv', index=False)

---

<a id="categorical-embeddings"></a>
## 6. Using our categorical embeddings from the trained model

We can view the trained embeddings for each of our input features quite easily:

In [None]:
for embedding in cat_embeddings:
    print(embedding)

In [None]:
model.layers[14].embeddings

In [None]:
train_df['cat4'].value_counts()

As shown, we have embeddings of the selected dimension size for each unique category within each feature. When using a large number of categories this can be very useful, and we can do things like extract these embeddings for use in classical models and ensemble-tree based models for greater performance.

For this dataset, we likely do not have a large enough cardinality (number of unique categories for each feature) for this to be useful. In practice, one hot encoding is easier and more convenient for this challenge.