# Variational Autoencoders

Variational autoencoders (VAE) are generative networks that incorporate neural networks and latent variable models.  It consists of an encoder and a decoder that seeks to reconstruct the input from a compressed latent represenation of the input.  It is a great example of how we can use variational inference to solve machine learning problems. We demonstrate how to easily implement a VAE in Pyro.


First let's import the packages and modules we need. We really only require Pytorch and Pyro. Everything else is optional if you prefer to use your own data loader, visualization library, etc.

In [2]:
import argparse
import numpy as np
import torch
import pyro
from torch.autograd import Variable
from pyro.infer.kl_qp import KL_QP
from pyro.distributions import DiagNormal, Normal
from pyro.util import ng_zeros, ng_ones

# modules from torch
import torch.nn as nn
from torch.autograd import Variable
import torchvision.datasets as dset
import torchvision.transforms as transforms
import torch.nn.functional as F
import torch.optim as optim

import visdom # (optional) for visualization

First let's load some data. We'll use [MNIST](http://yann.lecun.com/exdb/mnist/) for simplicity.

In [3]:
# path to data
root = './data'
download = True
trans = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))])
train_set = dset.MNIST(
    root=root,
    train=True,
    transform=trans,
    download=download)
test_set = dset.MNIST(root=root, train=False, transform=trans)

# Use batch size of 128
batch_size = 128
kwargs = {'num_workers': 1, 'pin_memory': True}
train_loader = torch.utils.data.DataLoader(
    dataset=train_set,
    batch_size=batch_size,
    shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    dataset=test_set,
    batch_size=batch_size,
    shuffle=False, **kwargs)

## Encoder

We first define our encoder, which takes the input image of size, and encodes it to a latent representation of size.

In [None]:
class Encoder(nn.Module):

    def __init__(self):
        super(Encoder, self).__init__()
        self.fc1 = nn.Linear(784, 200)
        self.fc21 = nn.Linear(200, 20)
        self.fc22 = nn.Linear(200, 20)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(-1, 784)
        h1 = self.relu(self.fc1(x))
        return self.fc21(h1), torch.exp(self.fc22(h1))

## Decoder

Next, our decoder will reproduce the original input from the latent represenation produced by the encoder in the previous step.

In [5]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        self.fc3 = nn.Linear(20, 200)
        self.fc4 = nn.Linear(200, 2 * 784)
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()

    def forward(self, z):
        h3 = self.relu(self.fc3(z))
        rv = (self.fc4(h3))

        # reshape to capture mu, sigma params for every pixel
        rvs = rv.view(z.size(0), -1, 2)

        # send back two params
        return rvs[:, :, 0], torch.exp(rvs[:, :, 1])


NameError: name 'nn' is not defined

# Model
Now we want to define our model and guide to do inference. 

# Inference
For this example, we are going to use variational inference to approximate the posterior by minimizing the [Kullback–Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) which maximizes the evidence lower bound (ELBO). In Pyro, this can by using `pyro.infer.ELBO`.

In [None]:
kl_optim = Elbo(model, guide, pyro.optim(optim.Adam, per_param_args))

We now have all the components we need for the VAE. To run the program, we just iterate over the number of epochs with our minibatch.

In [None]:
def main():
    for i in range(num_epochs):
        epoch_loss = 0.
        for ix, batch_start in enumerate(all_batches[:-1]):
            batch_end = all_batches[ix + 1]
            # get batch
            batch_data = mnist_data[batch_start:batch_end]
            epoch_loss += kl_optim.step(batch_data)
        print("epoch avg loss {}".format(epoch_loss / float(mnist_size)))

Insert results and visualizations here

##Data
We use training data from MNIST, which consists of 55,000 $28\times
28$ pixel images (LeCun, Bottou, Bengio, & Haffner, 1998). Each image is represented
as a flattened vector of 784 elements, and each element is a pixel
intensity between 0 and 1.

![GAN Fig 0](https://raw.githubusercontent.com/blei-lab/edward/master/docs/images/gan-fig0.png)


The goal is to build and infer a model that can generate high quality
images of handwritten digits.

During training we will feed batches of MNIST digits. We instantiate a
TensorFlow placeholder with a fixed batch size of $M$ images.

## Model

GANs posit generative models using an implicit mechanism. Given some
random noise, the data is assumed to be generated by a deterministic
function of that noise.

Formally, the generative process is

\begin{align*}
\mathbf{\epsilon} &\sim p(\mathbf{\epsilon}), \\
\mathbf{x} &= G(\mathbf{\epsilon}; \theta),
\end{align*}

where $G(\cdot; \theta)$ is a neural network that takes the samples
$\mathbf{\epsilon}$ as input. The distribution
$p(\mathbf{\epsilon})$ is interpreted as random noise injected to
produce stochasticity in a physical system; it is typically a fixed
uniform or normal distribution with some latent dimensionality.

In Edward, we build the model as follows, using TensorFlow Slim to
specify the neural network. It defines a 2-layer fully connected neural
network and outputs a vector of length $28\times28$ with values in
$[0,1]$.

## Inference

A key idea in likelihood-free methods is to learn by
comparison (e.g., Rubin (1984; Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012)): by
analyzing the discrepancy between samples from the model and samples
from the true data distribution, we have information on where the
model can be improved in order to generate better samples.

In GANs, a neural network $D(\cdot;\phi)$ makes this comparison,
known as the discriminator.
$D(\cdot;\phi)$ takes data $\mathbf{x}$ as input (either
generations from the model or data points from the data set), and it
calculates the probability that $\mathbf{x}$ came from the true data.
\begin{equation*}
\min_\theta \max_\phi~
\mathbb{E}_{p^*(\mathbf{x})} [ \log D(\mathbf{x}; \phi) ]
+ \mathbb{E}_{p(\mathbf{x}; \theta)} [ \log (1 - D(\mathbf{x}; \phi)) ].
\end{equation*}

In Edward, we use the following discriminative network. It is simply a
feedforward network with one ReLU hidden layer. It returns the
probability in the logit (unconstrained) scale.

We'll use ADAM as optimizers for both the generator and discriminator.
We'll run the algorithm for 15,000 iterations and print progress every
1,000 iterations.

We now form the main loop which trains the GAN. At each iteration, it
takes a minibatch and updates the parameters according to the
algorithm. At every 1000 iterations, it will print progress and also
saves a figure of generated samples from the model.