# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint

### Additional Notebook (Ungraded) : Variational Autoencoder

## Learning Objectives

At the end of the experiment, you will be able to

* perform unsupervised pretraining using autoencoder
* know what is convolutional autoencoder
* know what are Variational autoencoders
* generate Fashion MNIST images using Variational autoencoder

### Unsupervised Pretraining Using Stacked Autoencoders

If we have a large dataset but most of it is unlabeled, we can first train a stacked autoencoder using all the data, then reuse the lower layers to create a neural network for the actual task and train it using the labeled data.

For example, the below figure shows how to use a stacked autoencoder to perform **unsupervised pretraining** for a classification neural network.
<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/Unsupervised%20pre-training%20Autoencoders.png" width=550px/>
</center>

$\hspace{8.6cm} \text{Unsupervised pretraining using autoencoders}$
<br><br>

For the implementation: just train an autoencoder using all the training data (labeled plus unlabeled), then reuse its encoder layers to create a new neural network.

Let’s look at a few techniques for training stacked autoencoders:

* Tying weights
* Training one autoencoder at a time

### Import required packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Flatten, Dense, Reshape, Conv2D, MaxPool2D, Conv2DTranspose

from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Flatten, Dense
import tensorflow.keras.backend as K

from tensorflow.keras.layers import Layer

#### Tying weights

When an autoencoder is neatly symmetrical, a common technique is to tie the weights of the decoder layers to the weights of the encoder layers. This halves the number of weights in the model, speeding up training and limiting the risk of overfitting.

In [None]:
# Load fasion mnist dataset
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Scale dataset
X_train_full = X_train_full.astype(np.float32) / 255
X_test = X_test.astype(np.float32) / 255

# Training and validation set
X_train = X_train_full[:-5000]
X_valid = X_train_full[-5000:]
y_train = y_train_full[:-5000]
y_valid = y_train_full[-5000:]

To tie weights between layers using Keras, let’s define a custom layer:

In [None]:
class DenseTranspose(keras.layers.Layer):
    def __init__(self, dense, activation=None, **kwargs):
        self.dense = dense
        self.activation = keras.activations.get(activation)
        super().__init__(**kwargs)
    def build(self, batch_input_shape):
        # Access input_shape from dense layer's weights
        self.biases = self.add_weight(name="bias",
                                      shape=[self.dense.weights[0].shape[0]], # get the input shape from weights
                                      initializer="zeros")
        super().build(batch_input_shape)
    def call(self, inputs):
        z = tf.matmul(inputs, self.dense.weights[0], transpose_b=True)
        return self.activation(z + self.biases)

This custom layer acts like a regular Dense layer, but it uses another Dense layer’s weights, transposed. However, it uses its own bias vector. Next, we can build a new stacked autoencoder, with the decoder’s Dense layers tied to the encoder’s Dense layers:

In [None]:
# Create tied stacked autoencoder
dense_1 = Dense(100, activation="selu")
dense_2 = Dense(30, activation="selu")

tied_encoder = Sequential([
                           Flatten(input_shape=[28, 28]),
                           dense_1,
                           dense_2
                           ])

tied_decoder = Sequential([
                           DenseTranspose(dense_2, activation="selu"),
                           DenseTranspose(dense_1, activation="sigmoid"),
                           Reshape([28, 28])
                           ])

tied_ae = Sequential([tied_encoder, tied_decoder])

# Compile model
def rounded_accuracy(y_true, y_pred):
    return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred))

tied_ae.compile(loss = "binary_crossentropy", optimizer = keras.optimizers.SGD(learning_rate=1.5), metrics = [rounded_accuracy])

In [None]:
# Train model on training set
history = tied_ae.fit(X_train, X_train, epochs=10, validation_data=(X_valid, X_valid))

In [None]:
# Visualize reconstructions

def show_reconstructions(model, images=X_valid, n_images=5):
    ''' Compare inputs and outputs of model using n_images from X_valid dataset '''

    reconstructions = model.predict(images[:n_images])

    fig = plt.figure(figsize=(n_images * 1.5, 3))
    for img_idx in range(n_images):
        plt.subplot(2, n_images, 1 + img_idx)
        plt.imshow(images[img_idx], cmap='binary')
        plt.axis("off")

        plt.subplot(2, n_images, 1 + n_images + img_idx)
        plt.imshow(reconstructions[img_idx], cmap='binary')
        plt.axis("off")

show_reconstructions(model = tied_ae)

From the above results, we can see that this model achieves a very slightly lower reconstruction error than the previous model, with almost half the number of parameters.

#### Training One Autoencoder at a Time

Rather than training the whole stacked autoencoder in one go, it is possible to train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder, as shown in the figure below. This technique is known as **greedy layerwise training**.

<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/training%20one%20auto-encoder.png" width=500px/>
</center>

$\hspace{9.4cm} \text{Training one autoencoder at a time}$
<br><br>

Steps including in this method are:

* During the first phase of training, the first autoencoder learns to reconstruct the inputs.

* Then we encode the whole training set using this first autoencoder, and this gives us a new (compressed) training set.

* We then train a second autoencoder on this new dataset. This is the second phase of training.

* Finally, we first stack the hidden layers of each autoencoder, then the output layers in reverse order. This gives us the final stacked autoencoder.

In [None]:
def train_autoencoder(n_neurons, X_train, X_valid, loss, optimizer, n_epochs=10, output_activation=None, metrics=None):
    ''' Return encoder and decoder submodels trained on X_train with specified parameters'''

    n_inputs = X_train.shape[-1]
    encoder = Sequential([Dense(n_neurons, activation="selu", input_shape=[n_inputs])])
    decoder = Sequential([Dense(n_inputs, activation=output_activation)])

    autoencoder = Sequential([encoder, decoder])
    autoencoder.compile(optimizer, loss, metrics=metrics)
    autoencoder.fit(X_train, X_train, epochs=n_epochs, validation_data=(X_valid, X_valid))

    return encoder, decoder, encoder(X_train), encoder(X_valid)

In [None]:
# Reshape training and validation set
X_train_flat = X_train.reshape(-1, 28*28)
X_valid_flat = X_valid.reshape(-1, 28*28)

# First phase of training
enc1, dec1, X_train_enc1, X_valid_enc1 = train_autoencoder(100, X_train_flat, X_valid_flat, loss="binary_crossentropy",
                                                           optimizer=keras.optimizers.SGD(learning_rate=1.5),
                                                           output_activation="sigmoid", metrics=[rounded_accuracy])
# Second phase of training
enc2, dec2, _, _ = train_autoencoder(30, X_train_enc1, X_valid_enc1, "mse",
                                     keras.optimizers.SGD(learning_rate=0.05), output_activation="selu")

In [None]:
# Final stacked autoencoder
stacked_ae_1_by_1 = Sequential([
                                Flatten(input_shape=[28, 28]),
                                enc1, enc2, dec2, dec1,
                                Reshape([28, 28])
                                ])

In [None]:
# Visualize reconstructions
show_reconstructions(model = stacked_ae_1_by_1)

In [None]:
# Training final stacked autoencoder
stacked_ae_1_by_1.compile(loss = "binary_crossentropy",
                          optimizer = keras.optimizers.SGD(learning_rate=0.1), metrics = [rounded_accuracy])

history = stacked_ae_1_by_1.fit(X_train, X_train, epochs=10, validation_data=(X_valid, X_valid))

In [None]:
# Visualize reconstructions
show_reconstructions(model = stacked_ae_1_by_1)

Autoencoders are not limited to dense networks: we can also build convolutional
autoencoders.

### Convolutional Autoencoders

When we are dealing with images, the autoencoders we have seen so far will not work well: we will need to build a **convolutional autoencoder**.

For convolutional autoencoder:

* The **encoder** is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps).

* The **decoder** must do the reverse (upscale the image and reduce its depth back to the original dimensions), and for this we can use transpose convolutional layers (alternatively, we could combine upsampling layers with convolutional layers).

Let's see a simple convolutional autoencoder for Fashion MNIST:

In [None]:
# Create convolutional encoder
conv_encoder = Sequential([
                           Reshape([28, 28, 1], input_shape=[28, 28]),
                           Conv2D(16, kernel_size=3, padding="SAME", activation="selu"),
                           MaxPool2D(pool_size=2),
                           Conv2D(32, kernel_size=3, padding="SAME", activation="selu"),
                           MaxPool2D(pool_size=2),
                           Conv2D(64, kernel_size=3, padding="SAME", activation="selu"),
                           MaxPool2D(pool_size=2)
                           ])

# Create convolutional decoder
conv_decoder = Sequential([
                           Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]),
                           Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"),
                           Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"),
                           Reshape([28, 28])
                           ])

conv_ae = Sequential([conv_encoder, conv_decoder])

# Compile model
conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(learning_rate=1.0), metrics=[rounded_accuracy])

In [None]:
# Model summary for convolutional encoder and decoder
conv_encoder.summary()
conv_decoder.summary()

In [None]:
# Train convolutional encoder on training set
history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=(X_valid, X_valid))

In [None]:
# Visualize reconstructions
show_reconstructions(conv_ae)

### Variational Autoencoders

Another important category of autoencoders was introduced in 2013 by Diederik
Kingma and Max Welling and quickly became one of the most popular types of
autoencoders: variational autoencoders.

They are quite different from all other autoencoders, in these particular ways:

* They are probabilistic autoencoders, i.e, that their outputs are partly determined by chance, even after training.

* They are generative autoencoders, i.e, that they can generate new instances that look like they were sampled from the training set.

**Working of Variational autoencoders:**

The below figure (left) shows a variational autoencoder. The basic structure is similar to all autoencoders, with an encoder followed by a decoder, but instead of directly producing a coding for a given input, the encoder produces a **mean coding μ** and a **standard deviation σ**. The actual coding is then sampled randomly from a Gaussian distribution with mean $μ$ and standard deviation $σ$. After that the decoder decodes the sampled coding normally.

The right part of the diagram shows a training instance going through this autoencoder. First, the encoder produces $μ$ and $σ$, then a coding is sampled randomly, and finally this coding is decoded; the final output resembles the training instance.

<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/Variational_AE.png" width=500px/>
</center>

$\hspace{6cm} \text{Variational autoencoder (left) and an instance going through it (right)}$
<br><br>

After training a variational autoencoder, we can easily generate a new instance: just sample a random coding from the Gaussian distribution and decode it.

The cost function is composed of two parts.

* The first is the usual reconstruction loss that pushes the autoencoder to reproduce its inputs.

* The second is the latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution.

The latent loss can be computed using Equation:

Variational autoencoder’s latent loss,

$$L = -\frac{1}{2}\Sigma_{i=1}^{n}[1 + log(\sigma_i^2) - \sigma_i^2 - \mu_i^2]$$

where, $L$ is the latent loss, $n$ is the codings’ dimensionality, and $μ_i$ and $σ_i$ are the mean and standard deviation of the $i^{th}$ component of the codings. The vectors $μ$ and $σ$ are output by the encoder, as shown in
the above figure (left).

A common tweak to the variational autoencoder’s architecture is to make the encoder output $γ = log(σ^2)$ rather than $σ$. The latent loss can then be computed as:

$$L = -\frac{1}{2}\Sigma_{i=1}^{n}[1 + \gamma_i - exp(\gamma_i) - \mu_i^2]$$

This approach is more numerically stable and speeds up training.

Let’s start building a variational autoencoder for Fashion MNIST using the $γ$ tweak.

First, we will need a custom layer to sample the codings, given $μ$ and $γ$:

In [None]:
# Custom Sampling Layer
class Sampling(tf.keras.layers.Layer):
    def call(self, inputs):
        mean, log_var = inputs
        epsilon = tf.random.normal(shape=tf.shape(log_var))
        return mean + epsilon * tf.exp(0.5 * log_var)

This Sampling layer takes two inputs: mean ($μ$) and log_var ($γ$). It uses the function `tf.random.normal()` to sample a random vector (of the same shape as γ) from the Normal distribution, with mean 0 and standard deviation 1. Then it multiplies it by $exp(γ / 2)$ (which is equal to $σ$), and finally it adds $μ$ and returns the result. This samples a codings vector from the Normal distribution with mean $μ$ and standard deviation $σ$.

Next, we can create the encoder, using the Functional API:

In [None]:
codings_size = 10
# Variational Encoder
inputs = Input(shape=[28, 28])
z = Flatten()(inputs)
z = Dense(150, activation="selu")(z)
z = Dense(100, activation="selu")(z)

In [None]:
# Custom Latent Loss Layer
class LatentLossLayer(Layer):
    def call(self, inputs):
        codings_mean, codings_log_var = inputs
        latent_loss = -0.5 * K.sum(
            1 + codings_log_var - K.exp(codings_log_var) - K.square(codings_mean),
            axis=-1
        )
        self.add_loss(K.mean(latent_loss) / 784.0)  # Add to the model's loss
        return codings_mean  # Forward pass

In [None]:
# Variational Encoder with Latent Loss
codings_mean = Dense(codings_size)(z) # without Latent Loss
codings_log_var = Dense(codings_size)(z)
codings_mean = LatentLossLayer()([codings_mean, codings_log_var])  # Attach Latent Loss
codings = Sampling()([codings_mean, codings_log_var])

variational_encoder = Model(inputs, [codings_mean, codings_log_var, codings])

Note that the Dense layers that output codings_mean ($μ$) and codings_log_var ($γ$) have the same inputs. We then pass both codings_mean and codings_log_var to the Sampling layer. Finally, the variational_encoder model has three outputs. The only output we will use is the last one (codings).

Now let’s build the decoder:

In [None]:
# Variational Decoder
decoder_inputs = Input(shape=[codings_size])
x = Dense(100, activation="selu")(decoder_inputs)
x = Dense(150, activation="selu")(x)
x = Dense(28 * 28, activation="sigmoid")(x)
outputs = tf.keras.layers.Reshape([28, 28])(x)

variational_decoder = Model(decoder_inputs, outputs)

Finally, let’s build the variational autoencoder model.

Also, we must add the latent loss and the reconstruction loss. First apply the latent loss Equation for each instance in the batch. Then we compute the mean loss over all the instances in the batch, and scale it by divide the result by 784 to ensure it has the appropriate scale compared to the reconstruction loss.

In [None]:
# Create variational autoencoder using Functional API
_, _, codings = variational_encoder(inputs)
reconstructions = variational_decoder(codings)
variational_ae = Model(inputs=[inputs], outputs=[reconstructions])

In [None]:
# Compile model
variational_ae.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=[rounded_accuracy])

In [None]:
# Train variational autoencoder on training set
history = variational_ae.fit(X_train, X_train, epochs=25, batch_size=128, validation_data=(X_valid, X_valid))

In [None]:
# Visualize reconstructions
show_reconstructions(variational_ae)

#### Generating Fashion MNIST Images using Variational autoencoder

We can use the variational autoencoder to generate images that look like fashion items by sampling random codings from a Gaussian distribution and decode them:

In [None]:
def plot_multiple_images(images, n_cols=None):
    ''' Plot multiple images '''

    n_cols = n_cols or len(images)
    n_rows = (len(images) - 1) // n_cols + 1
    plt.figure(figsize=(n_cols, n_rows))
    for index, image in enumerate(images):
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(image, cmap="binary")
        plt.axis("off")

In [None]:
# Generate a few random codings, decode them and plot the resulting images
codings = tf.random.normal(shape = [12, codings_size])
images = variational_decoder(codings).numpy()
plot_multiple_images(images, 4)