# Representation Learning Background

Representation learning has been used in many competitions on kaggle.  [The Porto Seguro first place finish](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629) is a wonderful example of the power to be found in representation learning.  Not only in Porto Seguro, but representation learning has played an outsized role in the tabular monthly comps.  [January's first place finish](https://www.kaggle.com/springmanndaniel/1st-place-turn-your-data-into-daeta), [February's first place finish](https://www.kaggle.com/c/tabular-playground-series-feb-2021/discussion/222745), [March's first place finish](https://www.kaggle.com/c/tabular-playground-series-mar-2021/discussion/229833), and [April's first place finish](https://www.kaggle.com/c/tabular-playground-series-apr-2021/discussion/235739) all leveraged representational learning to win.

All the above notebooks used a model known as the denoising autoencoder (DAE) in order to get representations of the data.  The DAE is given corrupted input and is forced to reconstruct it in the final layer.  The final layer of reconstructions is useless, but the hidden layers' activations are extracted to create a new dataset.  Subsequent feed forward neural networks are then trained using the new dataset created by the hidden layers' activations.  If want to know more, I strongly recommend you read [January's first place solution](https://www.kaggle.com/springmanndaniel/1st-place-turn-your-data-into-daeta).  This terrific notebook is filled with insight into the workings of the DAE.

# What's New

Why do we need another DAE?  I have 2 things to offer.  

**1) This model is developed in tensorflow.**   Not that this is any better than pytorch (not trying to start any fights here!), but that some of us are more familiar with it and can more easily modify this code.  This implementation is model centric, so easy to read in my opinion.  Of course, I wrote it so maybe I'm biased...

**2) This model has some additional noise.**  I recently read [this paper](https://arxiv.org/pdf/2106.01342.pdf) and found an interesting way to add noise.  In the paper, the authors created noise as a two stage process.  First, they used swap noise (also known as [cutmix](https://arxiv.org/abs/1905.04899) )on the raw data.  After an embedding layer, they used [mixup](https://arxiv.org/pdf/1710.09412.pdf) to blend samples together.  My DAE has recreated their approach to noise creation.  

The model first trains on 80% of the data with early stopping.  Once it finds the optimal stopping point, it then retrains from epoch 0 on all data for 10% more epochs than the early stopping.  

In [None]:
###############################
###############################
#Hyperparams
###############################
###############################
#Change this as desired to create your own representations!
#The main sources of noise are DROP_RATE, MIXUP_RATE, and CUTMIX_RATE.  For tuning, I would make sure that the sum of 
# MIXUP_RATE and CUTMIX_RATE < .5.  But that is just me.  You do whatever you want!

LAYER_SIZE=500 #Number of embeddings to be extracted from train and test data.  Size of the hidden dense layers.
BATCH_SIZE = 10000 
EPOCHS = 150 #The network typically early stops far below this number. 
DROP_RATE = .05 #dropout rate for hidden dense layers
PATIENCE = 15 #Number of epochs with no improvement before early stopping kicks in
RESTARTS = 3 #Number of times to fit a model to the data.  I models representations are averaged.  I was inspired by Chris Deotte in this https://www.kaggle.com/c/lish-moa/discussion/200680
NUM_BLOCKS = 3#Number of hidden dense layers in the DAE.  This is effectively the size of the network.
MIXUP_RATE = .3 #percent of input to be randomly swapped out
CUTMIX_RATE = .2 #weight to be given to random sample mixed with sample.  Called alpha in 
EMBEDDED_DIMS = 5 #Number of embeddings each input feature is initially given. i.e. tensor shape (None, num_features) becomes (None, num_features, EMBEDDED_DIMS).
                  #This is the very first layer of the DAE after loading the data and using swapnoise (cutmix).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sys
import os
from lightgbm import LGBMRegressor, LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from tqdm.notebook import tqdm
from time import time
import wandb
import tensorflow_addons as tfa
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm

import psutil
import tensorflow as tf
import gc
from sklearn.preprocessing import QuantileTransformer

In [None]:
###################################################
#Layers I took from my github.com/Ottpocket/Neural
####################################################
class CutMix(tf.keras.layers.Layer):
    '''
    Implementation of CutMix for Tabular data

    Args
    _____
    noise: (R in [0,1)) probability that a value is not sampled from distribution

    Application
    ____________
    CM = CutMix(.2)
    x = tf.reshape(tf.range(0,10, dtype=tf.float32), (5,2))
    print(x.numpy())

    y = CM(x,True)
    print(y.numpy())
    '''
    def __init__(self, noise, **kwargs):
        super(CutMix, self).__init__(**kwargs)
        self.noise = noise

    def get_config(self):
        config = super(CutMix, self).get_config()
        config.update({"noise": self.noise})
        return config

    def call(self, inputs, training=None):
        if training:
            shuffled = tf.stop_gradient(tf.random.shuffle(inputs))
            #print(shuffled.numpy())

            msk = tf.keras.backend.random_bernoulli(tf.shape(inputs), p=1 - self.noise, dtype=tf.float32)
            #print(msk)
            return msk * inputs + (tf.ones_like(msk) - msk) * shuffled
        return inputs


class EmbeddingLayer(tf.keras.layers.Layer):
    '''
    Implementation of the Embedding layer.  Embeds all features into `num_dims`
    dimensions.  Takes in (None, FEATURES) tensor and outputs (None, FEATURES * num_dims) size tensor.

    ARGUMENTS
    _____
    num_dims: (int) the number of embedded dimensions.  If None, skips embedding

    Application
    ______________
    EL = EmbeddingLayer(3)
    x = tf.reshape(tf.range(0,10, dtype=tf.float32), (5,2))
    print(x.numpy())

    y = EL(x)
    print(y.numpy())
    '''
    def __init__(self, num_dims=None, **kwargs):
        super(EmbeddingLayer, self).__init__(**kwargs)
        self.num_dims = num_dims
        if self.num_dims is not None:
            self.emb = tf.keras.layers.Conv1D(filters=self.num_dims, kernel_size=1, activation='relu')
            self.Flatten = tf.keras.layers.Flatten()

    def get_config(self):
        config = super(EmbeddingLayer, self).get_config()
        config.update({"num_dims": self.num_dims})
        return config

    def call(self, inputs):
        if self.num_dims is None:
            return inputs

        return self.Flatten(self.emb(tf.expand_dims(inputs, -1)))


class MixUp(tf.keras.layers.Layer):
    '''
    Implementation of MixUp 

    Args
    _____
    alpha: (R in [0,1)) percentage of random sample to input  used

    Application
    ____________
    MU = MixUp(.1)
    x = tf.reshape(tf.range(0,10, dtype=tf.float32), (5,2))
    y = MU(x)
    print(x.numpy())
    print(y.numpy())
    '''
    def __init__(self, alpha, **kwargs):
        super(MixUp, self).__init__(**kwargs)
        self.alpha = alpha
        self.alpha_constant = tf.constant(self.alpha)
        self.one_minus_alpha = tf.constant(1.) - self.alpha

    def get_config(self):
        config = super(MixUp, self).get_config()
        config.update({"alpha": self.alpha})
        return config

    def call(self, inputs, training=None):
        if training:
            shuffled = tf.stop_gradient(tf.random.shuffle(inputs))
            #print(shuffled.numpy())
            return self.alpha_constant * inputs + self.one_minus_alpha * shuffled
        return inputs


In [None]:
train = pd.read_feather('/kaggle/input/september-feather/train_rg_min')
test = pd.read_feather('/kaggle/input/september-feather/test_rg_min')
ss = pd.read_feather('/kaggle/input/september-feather/sample_solution')
FEATURES = [feat for feat in train.columns if 'f' in feat] + ['nan_count']
TARGET = 'loss'

#Getting data ready for pretraining
combined = pd.concat([train[FEATURES], test[FEATURES]]).values
msk = np.random.choice([True, False], size = (combined.shape[0],), p=[.8,.2])
X = combined[msk]
val = combined[~msk]
#Ending with 3.4


#Getting dae-columns ready for predictions
DAE_LAYERS = []
for layer in tqdm(range(LAYER_SIZE)):
    name = f'dae_{layer}'
    train[name] = 0
    test[name] = 0
    train[name] = train[name].astype(np.float16)
    test[name] = test[name].astype(np.float16)
    DAE_LAYERS.append(name)

In [None]:
def Batch_Drop_Dense(x, name, drop_rate, layer_size, activation = 'relu'):
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(drop_rate)(x)
    x = tfa.layers.WeightNormalization(tf.keras.layers.Dense(layer_size, activation=activation), name= f'Dense_{name}')(x)
    #x = tf.keras.layers.Dense(layer_size, activation=activation, name= f'Dense_{name}')(x)
    return x

#Custom model.fit for DAE structure
#stolen from https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit#a_first_simple_example
class DaeModel(tf.keras.Model):
    def train_step(self, data):
        with tf.GradientTape() as tape:
            pred = self(data, training=True)  # Forward pass
            loss = self.compiled_loss(data, pred, regularization_losses=self.losses)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(data, pred)
        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}
    
    def test_step(self, data):
        #data = data_adapter.expand_1d(data)
        #x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)

        y_pred = self(data, training=False)
        # Updates stateful loss metrics.
        self.compiled_loss(
            data, y_pred, regularization_losses=self.losses)
        self.compiled_metrics.update_state(data, y_pred)
        # Collect metrics to return
        return_metrics = {}
        for metric in self.metrics:
            result = metric.result()
            if isinstance(result, dict):
                return_metrics.update(result)
            else:
                return_metrics[metric.name] = result
        return return_metrics

def dae(num_input_columns, layer_size = 1000, BLOCKS = 3, drop_rate=.05, cutmix_rate=.2,
                mixup_rate=.1, num_embedded_dims=None):
    '''
    Creates a DAE based on architecture and noise from `SAINT: Improved Neural Networks for Tabular Data
    via Row Attention and Contrastive Pre-Training`

    ARGUMENTS:
    -----------
    num_input_columns: (int) number of input (and output) columns in model
    layer_size: (int) # of neurons in hidden layers
    BLOCKS: (int) # BN, dropout, Dense layer blocks
    drop_rate: (float [0,1)) dropout rate
    cutmix_rate: (float [0,1)) percent of input randomly cutmixed
    mixup_rate: (float [0,1)]) percent to blend embeddings.
    num_embedded_dims: (int) how many embedded dimensions do you want the embedder to have.
          If None, just skips the embedder
    '''

    inp = tf.keras.layers.Input(num_input_columns)
    x = CutMix(cutmix_rate, name='CutMix')(inp)
    x = EmbeddingLayer(num_dims=num_embedded_dims, name= 'EmbeddingLayer')(x)
    x = Batch_Drop_Dense(x, name='zero', drop_rate= drop_rate, layer_size=layer_size)
    x = MixUp(alpha = mixup_rate, name='MixUp')(x)
    for name, i in enumerate(range(BLOCKS-1)):
        x = Batch_Drop_Dense(x, name+1, drop_rate= drop_rate, layer_size=layer_size)
    x = Batch_Drop_Dense(x, layer_size=num_input_columns, activation = None, name='Final_layer', drop_rate= drop_rate, )
    model = DaeModel(inputs=inp, outputs=x)
    model.compile(optimizer=tfa.optimizers.Lookahead(tf.optimizers.Adam(), sync_period=10),
                  loss='MeanSquaredError',
                  )
    return model

In [None]:
dae_args = dict(num_input_columns= len(FEATURES), layer_size = LAYER_SIZE, BLOCKS = NUM_BLOCKS, drop_rate=DROP_RATE, cutmix_rate=CUTMIX_RATE,
                mixup_rate=MIXUP_RATE, num_embedded_dims=EMBEDDED_DIMS)

In [None]:
#Callbacks
#Memory was a concern to say the least.  A lots of the callbacks were just to call me down as the network trained.  
# I wanted to be sure that I was not running out of memory as training progressed
class MemoryUsage(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        keys = list(logs.keys())
        print(f"Epoch {epoch+1}: {psutil.virtual_memory()[3] / 1024**3 :.4f} gb used.")

class CleanMem(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        gc.collect()
        
ES = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=PATIENCE, verbose=0,
                                                                restore_best_weights=True)

# Eccentricities

The below code is the training loop.  Why didn't I just `model.predict(train)` instead of the bizarre `for` loop?  Memory was kinda tight towards the end of this.  Having a (~1million , LAYER_SIZE) np.float32 ndarray typically gave me memory overload.  As a workaround, I did the predict in batches where each batch I turned the predictions to np.float16.  

In [None]:
#######################################################################################################
#Training loop
#######################################################################################################

for i in range(RESTARTS):
    repeat_start = time()
    print(f'Starting Repeat {i+1} of {RESTARTS}')
    model = dae(**dae_args)
    model.save_weights('no_train.h5')
    embedder = tf.keras.Model(inputs=model.inputs,
                              outputs=model.get_layer(name=f'Dense_{dae_args["BLOCKS"]-1}').output)

    ##########################################################################   
    #Finding number of iterations before early stopping 
    ##########################################################################
    print(f'Mem before training: {psutil.virtual_memory()[3] / 1024**3 :.4f}')        
    H = model.fit(x=X, validation_data = (val,), callbacks = [ES, MemoryUsage(), CleanMem()],
                  epochs=EPOCHS, batch_size=BATCH_SIZE)
    print(f'Mem after training: {psutil.virtual_memory()[3] / 1024**3 :.4f}')

    ##########################################################################   
    #Trainig model 10% longer than early stop on all data
    ##########################################################################    
    num_epochs = len(H.history['loss']) - PATIENCE
    num_epochs = int(num_epochs * 1.1)
    model.load_weights('no_train.h5')

    print(f'Mem before training: {psutil.virtual_memory()[3] / 1024**3 :.4f}')
    model.fit(x=combined, epochs= num_epochs, batch_size=BATCH_SIZE, 
              callbacks = [ES, MemoryUsage(), CleanMem()])

    #DONT USE!!!!!! Memory overflow
    #train[DAE_LAYERS] = embedder.predict(train[FEATURES].values, batch_size=10000).astype(np.float16)
    #test[DAE_LAYERS] = embedder.predict(test[FEATURES].values, batch_size= 10000).astype(np.float16)
    
    ####################################################
    #Getting embeddings for train
    ####################################################
    increment = 100000
    columns = [position for position, name in enumerate(train.columns) if name in DAE_LAYERS]
    print(f'Mem usage: {psutil.virtual_memory()[3] / 1024**3 :.4f}')
    for i in tqdm(range(train.shape[0]//increment + 1)):
        start = i*increment
        finish = (i+1)*increment
        train.iloc[start:finish, columns] += embedder.predict(train[FEATURES].iloc[start:finish].values, batch_size=increment).astype(np.float16) / RESTARTS

    print(f'Mem usage: {psutil.virtual_memory()[3] / 1024**3 :.4f}')

    ####################################################
    #Getting embeddings for test
    ####################################################
    columns = [position for position, name in enumerate(test.columns) if name in DAE_LAYERS]
    for i in tqdm(range(test.shape[0]//increment + 1)):
        start = i*increment
        finish = (i+1)*increment
        test.iloc[start:finish, columns] += embedder.predict(test[FEATURES].iloc[start:finish].values, batch_size=increment).astype(np.float16) / RESTARTS

    print(f'Mem usage: {psutil.virtual_memory()[3] / 1024**3 :.4f}')
    
    tf.keras.backend.clear_session() 
    print(f'Repeat {i+1}: {time() - repeat_start :.2f} seconds')

In [None]:
del X, val, combined; gc.collect()
print(f'Mem usage: {psutil.virtual_memory()[3] / 1024**3 :.4f}')

In [None]:
train[DAE_LAYERS].isnull().sum().sum()

In [None]:
for feat in FEATURES:
    del train[feat], test[feat]; gc.collect()

In [None]:
test.head()

In [None]:
train.to_feather(f'train_daeta_{LAYER_SIZE}_{RESTARTS}_{MIXUP_RATE}_{CUTMIX_RATE}_{EMBEDDED_DIMS}')
test.to_feather(f'test_daeta_{LAYER_SIZE}_{RESTARTS}_{MIXUP_RATE}_{CUTMIX_RATE}_{EMBEDDED_DIMS}')