# Autoencoders

Autoencoders are artificial neural networks capable of learning efficient representations of the input data, called codings, without supervision (i.e. training set is unlabeled). The advantages of autoencoders are:

- Dimensionality reduction
- Feature detection
- New data generation

The internal representation of the autoencoder has a lower dimensionality than the input data hence the autoencoder is said to be *undercomplete*. An undercomplete autoencoder cannot trivially output its inputs to the codings, yet it must find a way to output a copy of its inputs. It is forced to learn the most important features in the input data (and drop the unimportant ones).


## Performing PCA with an Undercomplete Linear Autoencoder

If the autoencoder uses only linear activation functions and the cost function is the MSE then it can be shown that it ends up performing Principal Component Analysis.

The following code builds a simple linear autoencoder to perform PCA on a 3D dataset, projecting it to 2D:

In [1]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected

In [4]:
n_inputs = 3 #3D inputs
n_hidden = 2 #2D codings
n_outputs = n_inputs

learning_rate = 0.01

X = tf.placeholder(tf.float32, shape = [None, n_inputs])
hidden = fully_connected(X, n_hidden, activation_fn = None)
outputs = fully_connected(hidden, n_outputs, activation_fn = None)

reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) #MSE

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(reconstruction_loss)
init = tf.global_variables_initializer()

This code is similar the the MLP codes we have built in past training books. Two things to note are:
a) number of outputs equal number of inputs
b) To perform PCA we set activation_fn = None and cost function MSE.

Now let's look at the execution phase:

In [None]:
X_train, X_test = [...] #load dataset

n_iterations = 1000
codings = hidden #the output of the hidden layer provides codings

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        training_op.run(feed_dict = {X: X_train}) #no labels {unsupervised}
    codings_val = codings.eval(feed_dict = {X: X_test})

## Stacked Autoencoders

Similar to NN we can stack hidden layers to get more complex encodings for our autoencoder. The architecture of the stacked autoencoder is typically symmetrical with regards to the hidden layer. 

### TensorFlow Implementation

Following code builds a stacked autoencoder for MNISt using He initialization, the ELU activation function and regularization. The only difference is there are no labels (no y):

In [23]:
n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 150 #codings
n_hidden3 = 300
n_outputs = n_inputs 


learning_rate = 0.01
l2_reg = 0.001

X = tf.placeholder(tf.float32, shape = [None, n_inputs])
with tf.contrib.framework.arg_scope(
    [fully_connected],
    activation_fn = tf.nn.elu,
    weights_initializer = tf.contrib.layers.variance_scaling_initializer(),
    weights_regularizer = tf.contrib.layers.l2_regularizer(l2_reg)):
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2) #codings
    hidden3 = fully_connected(hidden2, n_hidden3)
    outputs = fully_connected(hidden3, n_outputs, activation_fn = None)
    
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) #MSe
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([reconstruction_loss] + reg_losses)

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

You can then train the model normally. note that digit labels (y_batch are unused)

In [19]:
from tensorflow.examples.tutorials.mnist import input_data

In [22]:
mnist = input_data.read_data_sets("tmp/data/")
X_test = mnist.test.images.reshape((-1, n_inputs))

Extracting tmp/data/train-images-idx3-ubyte.gz
Extracting tmp/data/train-labels-idx1-ubyte.gz
Extracting tmp/data/t10k-images-idx3-ubyte.gz
Extracting tmp/data/t10k-labels-idx1-ubyte.gz


In [24]:
n_epochs = 5
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_inputs))
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict = {X: X_batch})

### Tying Weights

For symmetrical autoencoders we can tie the weights of the decoder and encoder layers thereby halving the number of weights for the model. This leads to a reduced risk of overfitting as well as speed boost.

Implementing tied weights in TensorFlow using fully_connected() function is a bit cumbersome. We have to define the layers manually as follows:

In [28]:
activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape = [None, n_inputs])

weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])

weights1 = tf.Variable(weights1_init, dtype = tf.float32, name = "weights1")
weights2 = tf.Variable(weights2_init, dtype = tf.float32, name = "weights2")
weights3 = tf.transpose(weights2, name = "weights3") #tied weights
weights4 = tf.transpose(weights1, name = "weights4") #tied weights

biases1 = tf.Variable(tf.zeros(n_hidden1), name = "biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name = "biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name = "biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name = "biases4")

hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
output = tf.matmul(hidden3, weights4) + biases4

reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))
reg_loss = regularizer(weights1) + regularizer(weights2)
loss - reconstruction_loss + reg_loss

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()




The important thing to note about the code above:

- weights3 and weights4 are not variables, they are respectively the transpose of weights2 and weights1 (they are "tied" to them).

- Since they are not variables, it is no use regularizing them: we only regularize weights1 and weights2

- Biases are never tied, and never regularized


## Training One Autoencoder at a Time

Training different components of a stacked autoencoder separately. 

The code looks as follows:

In [None]:
[...] #Build the whole stacked autoencoder normally.
      #In this example the weights are not tied

optimizer = tf.train.AdamOptimizer(learning_rate)

with tf.name_scope("phase1"):
    phase1_outputs = tf.matmul(hidden1, weights4) + biases4
    phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
    phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
    phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
    phase1_training_op = optimizer.minimize(phase1_loss)
    
with tf.name_scope("phase2"):
    phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
    phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
    phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
    train_vars = [weights2, biases2, weights3, biases3]
    phase2_training_op = optimizer.minimize(phase2_loss, var_list = train_vars)

The first phase is rather straightforward: we just create an output layer that skips the hidden layers 2 and 3, then build the training operations to minimize the distance between the outputs and the inputs (plus some regularization).

The second phase just adds operations needed to minimize the distance between the output of hidden layer 3 and hidden layer 1 (also with some regularization). Most importantly, we provide a list  of trainable variables to the minimize() method, making sure to leave out weights1 and biases1; this effectively freezes hidden layer 1 during phase 2.

During the execution phase, all you need to do is run the phase 1 training op for number of epochs, then phase 2 training op for some more epochs.

## Visualizing the Reconsutrctions

One way to ensure that ana utoencoder is properly trained is to compare the inputs and outputs. They must be fairly similar and the differences should be unimportant details. lets plot two random digits and their reconstruction:

In [None]:
n_test_digits = 2
X_test = mnist.test.images[:n_test_digits]


with tf.Session() as sess:
    [...] #Train the Autoencoder
    #Check: https://github.com/ageron/handson-ml/blob/master/15_autoencoders.ipynb
    outputs.val = outputs.eval(feed_dict = {X: X_test})
    
    def plot_image(image, shape = [28,28]):
        plt.imshow(image.reshape(shape), cmap = "Greys", interpolation = "nearest")
        plt.axis("off")
        
for digit_index in range(n_test_digits):
    plt.subplot(n_test_digits, 2, digit_index*2 + 1)
    plt_image(X_Test[digit_index])
    plt.subplot(n_test_digits, 2 , digit_index * 2 + 2)
    plot_image(outputs_val[digit_index])
    

## Visualizing Features

For each neuron in the first hidden layer, you can create an image where pixel's intensity corresponds to the weight of the connection to the given neuron. For example, following code plots the features learned by five neurons in the first hidden layer:

In [None]:
with tf.Session as sess:
    [...] #train autoencoder
    weights1_val = weights1.eval()

In [None]:
for i in range(5):
    plt.subplot(1,5, i + 1)
    plot_image(weights1_val.T[i])

## Denoising Autoencoders

Autoencoders can be denoised (adding noise) in two ways. If we add a pure Gaussian noise to our inputs or if we randomly switch off some inputs similar to dropout.

Following code is adding Gaussian noise to inputs. The method is similar to regular autoencoders, except you add noise to the inputs and reconstruction loss is calculated based on original inputs:

In [None]:
X = tf.placeholder(tf.float32, shape = [None, n_inputs])
X_noisy = X + tf.random_normal(tf.shape(X)) 
#using tf.shape(X) instead of X because X is only partially defined
#during construction phase
[...]
hidden1 = activation(tf.matmul(X_noisy, weights1) + biases1)
[...]
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) #MSE
[...]

Implementing the dropout version, which is more common, is also easy:

In [None]:
from tensorflow.contrib.layers import dropout

keep_prob = 0.7

is_training = tf.placeholder_with_default(False, shape = (), name = 'is_training')
X = tf.placeholder(tf.float32, shape = [None, n_inputs])
X_drop = dropout(X, kee_prob, is_training = is_training)
[...]
hidden1 = activation(tf.matmul(X_drop, weights1) + biases1)
[...]
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) #MSE
[...]

#During training we must set is_training to True using feed_dict

sess.run(training_op, feed_dict = {X:X_batch, is_training: True})

## Sparse Autoencoders

By using appropriate loss function (KL divergence) along with a hyperparameter (sparsity_weight) on our cost function, we try and enforce only the most important neurons in coding layer to activate. The sparsity objective is manually set and there is a fine balance between making the model too sparse, and hence useless in predicting and not sparse enough thereby unable to take advantage of the sparse autoencoder's properties. 

The implementation of Sparse autonecoder is below:

In [42]:
def kl_divergence(p,q):
    return p* tf.log(p/q) + (1-p)*tf.log(1-p)/(1-p)

In [None]:
learning_rate = 0.01
sparsity_target = 0.1
sparsity_weight = 0.2

[...] #Build a normal autoencoder (coding layer is hidden1 here)

optimizer = tf.train.AdamOptimizer(learning_rate)

hidden1_mean = tf.reduce_mean(hidden1, axis = 0) #batch mean
sparsity_loss = tf.reduce_sum(kl_divergence(sparsity_target, hidden1_mean))
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) #MSE
loss = reconstruction_loss + sparsity_weight * sparsity_loss
training_op = optimizer.minimize(loss)

An important detail is the fact that the activations of the coding layer must be between 0 and 1 (and not equal to 0 or 1), else KL divergence will return NaN. A simple solution is to use the logistic function for the coding layer:

In [None]:
hidden1 = tf.nn.sigmoid(tf.matmul(X, weights1) + biases1)

One simple trick to speed up convergence is using cross entropy as reconstruction loss instead of MSE. To use it, we need to normalize inputs to make them take on values from 0 to 1, and use the logistic activation function in the output layer so the outputs also take on values from 0 to 1. 

TensorFlow's sigmoid_cross_entropy_with_logits function takes care of efficiently applying the logistic (sigmoid) activation function to the outputs and computing cross entropy:

In [None]:
[...]
logits = (tf.matmul(hidden1, weights2) + biases2)
outputs = tf.nn.sigmoid(logits)

reconstruction_loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels = X, logits = logits))



## Variational Autoencoders

Sampling codings after representing inputs using a gaussian distribution with mean $\mu$ and std deviation $\sigma$. The cost function optimizes over a reconstruction loss which tries to make input look as similar to output as possible and latent loss which tries to represents inputs as a gaussian distribution.

The latent loss equation can be coded below:

In [None]:
eps = 1e-10 #smoothing term to avoid computing log(0)
latent_loss = 0.5*tf.reduce_sum(tf.square(hidden3_sigma + tf.square(hidden3_mean)
                                         - 1 - tf.log(eps + tf.square(hidden3_sigma))))

Or more simply:

In [None]:
latent_loss = 0.5 + tf.reduce_sum(tf.exp(hidden3_gamma + tf.square(hidden3_mean) 
                                         - 1 - hidden3_gamma))

THe following code builds the variational autoencoder using $log(\sigma^{2})$ variant:

In [44]:
n_inputs = 28 * 28 #for MNIST
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20 #codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs

learning_rate = 0.001

In [None]:
with tf.contrib.framework.arg_scope(
    [fully_connected],
    activation_fn = tf.nn.elu,
    weights_initializer = tf.contrib.layers.variance_scaling_initializer()):
    X = tf.placeholder(tf.float32, [None, n_inputs])
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2)
    hidden3_mean = fully_connected(hidden2, n_hidden3, activation_fn = None)
    hidden3_gamma = fully_connected(hidden2, n_hidden3, activation_fn = None)
    hidden3_sigma = tf.exp(0.5*hidden3_gamma)
    noise = tf.random_normal(tf.shape(hidden3_sigma), dtype = tf.float32)
    hidden3 = hidden3_mean + hidden3_sigma*noise
    hidden4 = fully_connected(hidden3, n_hidden4)
    hidden5 = fully_connected(hidden4, n_hidden5)
    logits = fully_connected(hidden5, n_outputs, activation_fn = None)
    outputs = tf.sigmoid(logits)
    
reconstruction_loss = tf.reduce_sum(
    tf.nn.sigmoid_cross_entropy_with_logits(targets=X, logits = logits))

latent_loss = 0.5*tf.reduce_sum(
    tf.exp(hidden3_gamma) + tf.square(hidden3_mean) - 1 - hidden3_gamma)

cost = reconstruction_loss + latent_loss

optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)
training_op = optimizer.minimize(cost)

init = tf.global_variables_initializer()

### Generating Digits

Using VAE to generate images that look like handwritten digits.

In [56]:
import numpy as np

In [57]:
n_digits = 60
n_epochs = 50
batch_size = 150

with tf.Session() as sess:
    init.run()
    for each in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict = {X: X_batch})
            
coding_rnd = np.random.normal(size = [n_digits, n_hidden3])
outputs_val = outputs.eval(feed_dict = {hidden3: codings_rnd})

KeyboardInterrupt: 

Now we can see the "handwritten" digits produced by the autoencoder looks like:

In [None]:
for iteration in range(n_digits):
    plt.subplot(n_digits, 10, iteration + 1)
    plot_image(output_val[iteration])