In [None]:
%matplotlib inline

# load some utilities (for loading MNIST and plotting)
# also imports most Python modules
%run utils.py

# load MNIST training and test data sets
train_dataset, test_dataset = MNIST_datasets()

# load tensors with all MNIST images and labels
train_images, train_labels = MNIST()
test_images, test_labels = MNIST(test=True)

# Variational Autoencoder

Is it reasonable to assume that we can generate images similar to the ones in the MNIST data set by a linear transformation of low dimensional latent variables?

Let us reflect on the limitations of PCA and PPCA. In both models, data is encoded in a lower dimensional space by a linear (or rather affine) transformation, and the lower dimensional representation can be decoded
by another affine function. One implication is that relative distances are preserved between the original space and the lower-dimensional latent space: if two data samples are "close" to each other, then also their latent encodings are relatively "close" to each other, and similarly two latent encodings that are "close" to each other are decoded to two relatively "close" data points. This affects our ability to reconstruct images from their encodings and to generate new images.

## Task 1

Answer Question 4.7 in the lab instructions.

Let us consider a more flexible non-linear model that is given by 
\begin{align*}
  p(\mathbf{x} \,|\, \mathbf{z}) &= \mathcal{N}\left(\mathbf{x}; f(\mathbf{z}; \boldsymbol{\phi}), \sigma^2 \mathbf{I}_{784}\right), \\
  p(\mathbf{z}) &= \mathcal{N}(\mathbf{z}; 0, \mathbf{I}_2),
\end{align*}
where $\sigma^2 > 0$ and $f(\cdot; \boldsymbol{\phi}) \colon \mathbb{R}^2 \to \mathbb{R}^{784}$ is a nonlinear function with parameters $\boldsymbol{\phi}$. In this lab we model $f(\cdot; \boldsymbol{\phi})$ by a neural network with parameters $\boldsymbol{\phi}$, but other models could be used equally well.

This nonlinear model looks similar to the PPCA model above. However, in contrast to the PPCA model, for most classes of functions $f(\cdot; \boldsymbol{\phi})$ the marginal distribution of $\mathbf{x}$ is not a normal distribution anymore, and typically there exists not even a closed form expression for its density.
Thus usually we can not learn the parameters $\boldsymbol{\xi} = (\boldsymbol{\phi}, \sigma^2)$ in the same straightforward way as for the PPCA model by minimizing the negative log-likelihood with gradient descent.

## Training: First approach ("naive")

If there exists no closed-form expression for $p(\mathbf{x}; \boldsymbol{\xi})$,
alternatively one could approximate the integral
\begin{equation*}
  p(\mathbf{x}; \boldsymbol{\xi}) = \int p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi}) p(\mathbf{z}) \,\mathrm{d}\mathbf{z}
\end{equation*}
with the Monte Carlo method by a finite sum, i.e., one could work with the
estimation
\begin{equation}\label{eq:MC}
  p(\mathbf{x}; \boldsymbol{\xi}) \approx \frac{1}{K} \sum_{n=1}^K p(\mathbf{x} \,|\, \mathbf{z}_n; \boldsymbol{\xi}),
\end{equation}
where $\mathbf{z}_1, \ldots, \mathbf{z}_K$ are i.i.d. samples of $\mathbf{z}$. Thus for our nonlinear model we would obtain
\begin{equation*}
  \begin{split}
  \log p(\mathbf{x}; \boldsymbol{\xi}) &\approx \log{\left(\sum_{n=1}^K \mathcal{N}(\mathbf{x}; f(\mathbf{z}_n; \boldsymbol{\phi}), \sigma^2 \mathbf{I}_{784}) \right)} - \log K \\
  &= \log{\left({(2\pi \sigma^2)}^{-392} \sum_{n=1}^K \exp{\left(-\frac{1}{2\sigma^2} {\|\mathbf{x} - f(\mathbf{z}_n; \boldsymbol{\phi})\|}_2^2 \right)}\right)} - \log K \\
  &= - 392 (\log \sigma^2 + \log{(2\pi)}) + \log{\left(\sum_{n=1}^K \exp{\left(-\frac{1}{2\sigma^2} {\|\mathbf{x} - f(\mathbf{z}_n; \boldsymbol{\phi})\|}_2^2 \right)}\right)} - \log K.
  \end{split}
\end{equation*}
Hence similar to the PPCA model, with a training data set $\mathbf{x}_1, \ldots, \mathbf{x}_N$ we can try to minimize the cost function
\begin{equation*}
  J(\boldsymbol{\xi}) = 392 \log \sigma^2 - \frac{1}{N} \sum_{n'=1}^N \log{\left(\sum_{n=1}^K \exp{\left(-\frac{1}{2\sigma^2} {\|\mathbf{x}_{n'} - f(\mathbf{z}_n; \boldsymbol{\phi})\|}_2^2 \right)}\right)}
\end{equation*}
using gradient descent.

We now consider the following nonlinear model. The function $f(\cdot; \boldsymbol{\phi})$ is modeled by a shallow neural network. The `decode` function outputs the representative decoding $f(\mathbf{z}; \boldsymbol{\phi})$ for a batch of encodings $\mathbf{z}$.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class NonLinearModel(nn.Module):

    def __init__(self):
        super(NonLinearModel, self).__init__()

        # linear parts of the nonlinear function f
        self.decoder_fc1 = nn.Linear(2, 400)
        self.decoder_fc2 = nn.Linear(400, 784)

        # logarithm of variance sigma^2
        self.logsigma2 = nn.Parameter(torch.zeros(1))

    def decode(self, z):
        h1 = F.relu(self.decoder_fc1(z))
        return self.decoder_fc2(h1)

## Task 2

Read through the definition of the nonlinear model and try to understand how $f(\cdot; \boldsymbol{\phi})$ is defined.

We start by implementing the cost function that we derived above. One important part in the evaluation of the cost function is the sampling of $\mathbf{z}_1, \ldots, \mathbf{z}_K$ and the computation of the decodings $f(\mathbf{z}_i; \boldsymbol{\phi})$.

## Task 3

Implement the following function `sample_decode(model, K)` that returns a matrix of size $K \times 784$ whose rows are the decodings $f(\mathbf{z}_i; \boldsymbol{\phi}) \in \mathbb{R}^{784}$ of samples $\mathbf{z}_1, \ldots, \mathbf{z}_K \in \mathbb{R}^2$ from $\mathcal{N}(\boldsymbol{0}, \mathbf{I}_2)$, where $\boldsymbol{\phi}$ are the parameters of the nonlinear model `model`.

*Hint*: You can sample a PyTorch matrix of size $n \times m$ with standard normally distributed entries with [`torch.randn(n, m)`](https://pytorch.org/docs/stable/torch.html#torch.randn).

In [None]:
def sample_decode(model, K):
    # sample z
    # WRITE YOUR CODE HERE
    
    # compute f(z)
    # WRITE YOUR CODE HERE
    
    return # WRITE YOUR CODE HERE

We use the function `sample_decode` to evaluate the cost function.

In [None]:
def cost_function(X, model, K):
    # decodings f(z) of samples z
    fz = sample_decode(model, K)
    
    # compute y with y[j, i] = - 1/(2 * sigma^2) * || x_j - f(z_i)) ||^2_2
    y = - 0.5 * torch.exp(-model.logsigma2) * (X.view(-1, 1, 784) - fz.view(1, -1, 784)).pow(2).sum(dim=2)
    
    # compute loss
    return 392 * model.logsigma2 - torch.logsumexp(y, dim=1).mean()

We train the nonlinear model with stochastic gradient descent using batches of 500 images of the MNIST training data set for 20 epochs. In every iteration we use $K=5$ randomly sampled latent vectors $\mathbf{z}_n$ to evaluate the cost function.

In [None]:
import torch.optim as optim

# define the data loaders
train_data = torch.utils.data.DataLoader(train_dataset, batch_size=500, shuffle=True)
test_data = torch.utils.data.DataLoader(test_dataset, batch_size=500)

# define the model
model = NonLinearModel()

# define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)

# track the training and test loss
training_loss = []
test_loss = []

# optimize parameters for 20 epochs
for i in range(20):

    # for each minibatch
    for x, _ in train_data:

        # evaluate the cost function on the training data set
        loss = cost_function(x, model, 5)

        # update the statistics
        training_loss.append(loss.item())
        test_loss.append(float('nan'))

        # perform backpropagation
        loss.backward()

        # perform a gradient descent step
        optimizer.step()
        
        # reset the gradient information
        optimizer.zero_grad()

    # evaluate the model after every epoch
    with torch.no_grad():

        # evaluate the cost function on the test data set
        accumulated_loss = 0
        for x, _ in test_data:
            loss = cost_function(x, model, 5)
            accumulated_loss += loss.item()
            
        # update the statistics
        test_loss[-1] = accumulated_loss / len(test_data)
            
    print(f"Epoch {i + 1:2d}: training loss {training_loss[-1]: 9.3f}, "
          f"test loss {test_loss[-1]: 9.3f}")
        
# plot loss
plt.figure()
iterations = np.arange(1, len(training_loss) + 1)
plt.scatter(iterations, training_loss, label='training loss')
plt.scatter(iterations, test_loss, label='test loss')
plt.legend()
plt.xlabel('iteration')
plt.show()

## Task 4

Read through and try to understand the implementation of the training procedure above. How does it differ from the implementation of the training procedure of the PPCA model?

We can use the trained model to generate new images, in the same way as we did with the PPCA model. Again we sample 25 encodings $\mathbf{z}_1, \ldots, \mathbf{z}_{25}$ from $\mathcal{N}(\boldsymbol{0}, \mathbf{I}_{2})$, and plot their representative decodings $f(\mathbf{z}_n; \boldsymbol{\phi})$.

In [None]:
with torch.no_grad():
  x = sample_decode(model, 25)

plot_images(x)

## Training: second approach (variational autoencoder)

### Introduction

There are at least two problems with the nonlinear model and our training procedure so far:
- As mentioned in [Doersch's
tutorial](https://arxiv.org/pdf/1606.05908.pdf), for most samples of
$\mathbf{z}$, $p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi})$ is almost zero. Hence these
terms will not contribute much to the estimation of $p(\mathbf{x}; \boldsymbol{\xi})$,  which can slow down the training procedure.
- An even more fundamental problem with the nonlinear model is the fact that
usually we can not compute an analytical expression for
$p(\mathbf{z} \,|\, \mathbf{x}; \boldsymbol{\xi})$. Thus we can not encode
data in the lower dimensional latent space and analyze its structure.

The main idea of a so-called variational autotoencoder (VAE) is to resolve the first issue by attempting to obtain samples
$\mathbf{z}$ for which $p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi})$ is large.

### ELBO

Let us define an encoding distribution $q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})$ that may depend on $\mathbf{x}$ and some parameters $\boldsymbol{\zeta}$. As shown in the lecture, then we have
\begin{equation*}
    \log p(\mathbf{x}; \boldsymbol{\xi}) =
    \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})}\big[\log p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi})\big]
    + \mathrm{KL}\big[q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) \,\big\|\, p(\mathbf{z} \,|\,\mathbf{x}; \boldsymbol{\xi})\big] - \mathrm{KL}\big[q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) \,\big\|\, p(\mathbf{z}; \boldsymbol{\xi})\big],
\end{equation*}
for all $\mathbf{x}$ (we ignore here that the KL-divergence $\mathrm{KL}(p \,\|\, q)$ is only defined if $q(x) = 0$ implies $p(x) = 0$).

The KL divergence of two distributions is always non-negative, and zero if and only if the two distributions are equal. Hence we have for all $\mathbf{x}$
\begin{equation*}
  \log p(\mathbf{x}; \boldsymbol{\xi}) \geq
  \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})}\big[\log p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi})\big]
  - \mathrm{KL}\big[q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) \,\big\|\, p(\mathbf{z}; \boldsymbol{\xi})\big]
\end{equation*}
with equality if and only if $q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})$ is equal to
the distribution $p(\mathbf{z} \,|\, \mathbf{x}; \boldsymbol{\xi})$. Since the
right-hand side of this inequality is a lower bound of the evidence $\log p(\mathbf{x}; \boldsymbol{\xi})$, it is called evidence lower bound (ELBO).

Remember that we know $p(\mathbf{z}; \boldsymbol{\xi}) = \mathcal{N}(\mathbf{z}; \boldsymbol{0}, \mathbf{I}_2)$ from the model specification. Moreover, $q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})$ is an
arbitrary distribution that we can define in such a way that we can
compute $\mathrm{KL}\big[q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) \,\big\|\, p(\mathbf{z}; \boldsymbol{\xi})\big]$
analytically. From Question 3.5 in the lab instructions, we know that we obtain
\begin{equation*}
    \mathrm{KL}\big[q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) \,\|\, p(\mathbf{z}; \boldsymbol{\xi})\big] = \frac{1}{2} \left(\sum_{n=1}^2 \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{n=1}^2 \log \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta})\right),
\end{equation*}
if we choose the encoding distribution
\begin{equation}\label{eq:encoder}
  q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta}) = \mathcal{N}(\mathbf{z}; \mu(\mathbf{x}; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}; \boldsymbol{\zeta}))),
\end{equation}
where $\mu(\cdot; \boldsymbol{\zeta}) \colon \mathbb{R}^{784} \to \mathbb{R}^2$ defines the mean of the normal distribution and $\sigma^2_1(\mathbf{x}; \boldsymbol{\zeta})$ and $\sigma^2_2(\mathbf{x}; \boldsymbol{\zeta})$ are the entries on the diagonal of the covariance matrix.

Thus for this encoding distribution the ELBO is
\begin{equation}
 \mathbb{E}_{\mathbf{z} \sim \mathcal{N}(\mathbf{z}; \mu(\mathbf{x}; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}; \boldsymbol{\zeta})))}\big[\log p(\mathbf{x} \,|\, \mathbf{z}; \boldsymbol{\xi})\big]
 - \frac{1}{2} \left(\sum_{n=1}^2 \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{n=1}^2 \log \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta})\right),
\end{equation}
which we can approximate with the Monte Carlo estimate
\begin{equation}
  \frac{1}{K} \sum_{n=1}^K \log p(\mathbf{x} \,|\, \mathbf{z}_n; \boldsymbol{\xi}) - \frac{1}{2} \left(\sum_{n=1}^2 \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{n=1}^2 \log \sigma^2_n(\mathbf{x}; \boldsymbol{\zeta})\right),
\end{equation}
where $\mathbf{z}_1, \ldots, \mathbf{z}_K$ are independent samples from $q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})= \mathcal{N}(\mathbf{z}; \mu(\mathbf{x}; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}; \boldsymbol{\zeta})))$.  The fundamental difference to the Monte Carlo estimation of the log-likelihood above is that before we had to sample from $p(\mathbf{z}; \boldsymbol{\xi}) = \mathcal{N}(\mathbf{z}; \boldsymbol{0}, \mathbf{I}_2)$ whereas now we sample them from the encoding distribution $q$. Thus hopefully we can tune the parameters $\boldsymbol{\zeta}$ of the encoding distribution such that $p(\mathbf{x} \,|\, \mathbf{z}_n; \boldsymbol{\xi})$ increases, i.e., that it becomes more likely to reconstruct $\mathbf{x}$ from the samples $\mathbf{z}_n$.

This observation motivates the idea of maximizing the ELBO by training parameters $\boldsymbol{\xi}$ and $\boldsymbol{\zeta}$ simultaneously instead of maximizing the evidence $\log p(\mathbf{x}; \boldsymbol{\xi})$ by training only $\boldsymbol{\xi}$. If the encoding distribution is flexible enough, we might even be able to obtain the distribution $p(\mathbf{z} \,|\, \mathbf{x}; \boldsymbol{\xi})$, in which case the ELBO is actually
equal to the evidence.

Additionally, if we manage that distribution $q$ becomes equal to the distribution $p(\mathbf{z} \,|\, \mathbf{x}; \boldsymbol{\xi})$ (or at least close to it), we have found a way to encode our data: for a given data sample $\mathbf{x}$, we can just sample the latent encoding from $q(\mathbf{z}; \mathbf{x}, \boldsymbol{\zeta})$. So by introducing the encoding distribution and maximizing the ELBO, we might be able to solve both problems of the nonlinear model.

### Cost function

In the same way, for a set of training data $\mathbf{x}_1, \ldots, \mathbf{x}_N$ we obtain
\begin{equation*}
    \begin{split}
    \log p(\mathbf{x}_1, \ldots, \mathbf{x}_N; \boldsymbol{\xi}) &= \sum_{n=1}^N \log p(\mathbf{x}_n; \boldsymbol{\xi}) \\
    &\geq \sum_{n=1}^N \mathbb{E}_{\mathbf{z} \sim \mathcal{N}(\mathbf{z}; \mu(\mathbf{x}_n; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}_n; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}_n; \boldsymbol{\zeta})))}\big[\log p(\mathbf{x}_n \,|\, \mathbf{z}; \boldsymbol{\xi})\big] \\
   &\quad - \frac{1}{2} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta})\right),
    \end{split}
\end{equation*}
and thus we can estimate the joint ELBO by
\begin{equation*}
    \begin{split}
     & \sum_{n=1}^N \frac{1}{K} \sum_{n'=1}^K \log p(\mathbf{x}_n \,|\, \mathbf{z}_{n,n'}; \boldsymbol{\xi})
  - \frac{1}{2} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}; \boldsymbol{\zeta})\right) \\
     ={}& - \frac{1}{2}\left(784 N \left(\log{(2\pi)} + \log{\sigma^2})\right) + \frac{1}{\sigma^2}\sum_{n=1}^N \frac{1}{K} \sum_{n'=1}^K \|f(\mathbf{z}_{n,n'}; \boldsymbol{\xi}) - \mathbf{x}_n\|^2_2 \right) \\
     & - \frac{1}{2} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - 2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta})\right),
     \end{split}
\end{equation*}
where $\mathbf{z}_{n,1}, \ldots, \mathbf{z}_{n,N}$ are independent samples of the encoding distribution $\mathcal{N}(\mathbf{z}; \mu(\mathbf{x}_n; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}_n; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}_n; \boldsymbol{\zeta})))$.

Since we use a stochastic optimization algorithm anyway, we choose $K = 1$, i.e., we use only one Monte Carlo sample for each training data sample. After neglecting additive constant terms and scaling by $N/2$, we obtain the cost function
\begin{equation*}
 J(\boldsymbol{\xi}, \boldsymbol{\zeta}) = 784 \log{\sigma^2} + \frac{1}{N \sigma^2} \sum_{n=1}^N  \|f(\mathbf{z}_n; \boldsymbol{\xi}) - \mathbf{x}_n\|^2_2 + \frac{1}{N} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta})\right),
\end{equation*}
where $\mathbf{z}_n$ are samples from $\mathcal{N}(\mathbf{z}; \mu(\mathbf{x}_n; \boldsymbol{\zeta}), \mathrm{diag}(\sigma^2_1(\mathbf{x}_n; \boldsymbol{\zeta}), \sigma^2_2(\mathbf{x}_n; \boldsymbol{\zeta})))$. We will minimize this cost function with stochastic gradient descent.

### Reparameterization trick

However, there is one problem: it is not clear how to differentiate through the sampling operation of $\mathbf{z}_n$ with respect to the parameters $\boldsymbol{\zeta}$ that determine the mean and variance of the distribution we sample from. We can generate samples from an arbitrary normal distribution by an affine transformation of standard normally distributed samples. Thus we can rewrite the cost function as
\begin{equation*}
    \begin{split}
 J(\boldsymbol{\xi}, \boldsymbol{\zeta}) &= 784 \log{\sigma^2} + \frac{1}{N \sigma^2} \sum_{n=1}^N  \|f(\mu(\mathbf{x}_n; \boldsymbol{\zeta}) + \mathrm{diag}(\sigma_1(\mathbf{x}_n; \boldsymbol{\zeta}), \sigma_2(\mathbf{x}_n; \boldsymbol{\zeta})) \boldsymbol{\epsilon}_n; \boldsymbol{\xi}) - \mathbf{x}_n\|^2_2 \\
 &\quad + \frac{1}{N} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta})\right),
 \end{split}
\end{equation*}
where $\boldsymbol{\epsilon}_n$ are samples from the normal distribution $\mathcal{N}(\boldsymbol{0}, \mathbf{I}_2)$. Rewriting the cost function in this form is known as reparameterization trick and allows us to differentiate the cost function with respect to $\boldsymbol{\zeta}$ in a straightforward way, since the samples $\boldsymbol{\epsilon}_n$ do not depend on the parameters anymore.

We now consider the following nonlinear model, in which in addition to the nonlinear model above we also implement an encoding distribution. The mean and the logarithm of the diagonal entries of the covariance matrix of the encoding distribution are modeled by a shallow neural network, and the `encode` function returns them for a batch of inputs. As above, the `decode` function outputs the representative decoding for a batch of encodings.

In [None]:
class VAE(nn.Module):

    def __init__(self):
        super(VAE, self).__init__()

        # linear parts of the nonlinear encoder
        self.encoder_fc1 = nn.Linear(784, 400)
        self.encoder_fc2_mean = nn.Linear(400, 2)
        self.encoder_fc2_logsigma2 = nn.Linear(400, 2)

        # linear parts of the nonlinear function f
        self.decoder_fc1 = nn.Linear(2, 400)
        self.decoder_fc2 = nn.Linear(400, 784)

        # logarithm of variance sigma^2
        self.logsigma2 = nn.Parameter(torch.zeros(1))

    def encode(self, x):
        h1 = F.relu(self.encoder_fc1(x))
        return self.encoder_fc2_mean(h1), self.encoder_fc2_logsigma2(h1)

    def decode(self, z):
        h1 = F.relu(self.decoder_fc1(z))
        return self.decoder_fc2(h1)

## Task 5

Read through and try to understand the definition of the nonlinear model.

As discussed above, one term of the cost function is the term
\begin{equation}
\frac{1}{N} \sum_{n=1}^N \left(\sum_{i=1}^2 \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta}) + \|\mu(\mathbf{x}_n; \boldsymbol{\zeta})\|^2_2 - \sum_{i=1}^2 \log \sigma^2_i(\mathbf{x}_n; \boldsymbol{\zeta})\right),
\end{equation}
that originates from the KL divergence expression in the ELBO.

## Task 6

Implement the function `kl_divergence_term(Z_mu, Z_logsigma2)` that evaluates the expression above, where the rows of `Z_mu` are $\mu(\mathbf{x}_n; \boldsymbol{\zeta})$ and the rows of `Z_logsigma2` are $\begin{bmatrix} \log \sigma^2_1(\mathbf{x}_n; \boldsymbol{\zeta}) & \log \sigma^2_2(\mathbf{x}_n; \boldsymbol{\zeta})\end{bmatrix}^\intercal$.

*Hint*: You should probably use the PyTorch functions [`torch.mean`](https://pytorch.org/docs/stable/torch.html#torch.mean), [`torch.sum`](https://pytorch.org/docs/stable/torch.html#torch.sum), [`torch.exp`](https://pytorch.org/docs/stable/torch.html#torch.exp), and [`torch.pow`](https://pytorch.org/docs/stable/torch.html#torch.pow) in your implementation.

In [None]:
def kl_divergence_term(Z_mu, Z_logsigma2):
    # compute the average KL divergence to the standard normal
    # distribution, neglecting additive constant terms
    return # WRITE YOUR CODE HERE

We make use of the `kl_divergence_term` function in our implementation of the cost function.

In [None]:
def cost_function(X, model):
    # compute mean and log variance of the normal distribution of
    # the encoding z of input X
    Z_mu, Z_logsigma2 = model.encode(X)

    # compute the average KL divergence to the prior standard normal
    # distribution of z, neglecting constant terms
    # expected log p(x | z) + C
    kl = kl_divergence_term(Z_mu, Z_logsigma2)

    # sample z
    Z = Z_mu + torch.randn(Z_mu.size()) * torch.exp(0.5 * Z_logsigma2)

    # compute the representative decoding of z
    X_decoding = model.decode(Z)

    # compute negative average evidence of the input x, neglecting additive constant terms
    neg_log_evidence = 784 * model.logsigma2 + \
        torch.sum((X_decoding - X).pow(2) * torch.exp(- model.logsigma2), dim=1).mean()

    return neg_log_evidence + kl

Now we can train the nonlinear model.

In [None]:
# define the data loaders
train_data = torch.utils.data.DataLoader(train_dataset, batch_size=500, shuffle=True)
test_data = torch.utils.data.DataLoader(test_dataset, batch_size=500)

# define the model
model = VAE()

# define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)

# track the training and test loss
training_loss = []
test_loss = []

# optimize parameters for 20 epochs
for i in range(20):

    # for each minibatch
    for x, _ in train_data:

        # evaluate the cost function on the training data set
        loss = cost_function(x, model)

        # update the statistics
        training_loss.append(loss.item())
        test_loss.append(float('nan'))

        # perform backpropagation
        loss.backward()

        # perform a gradient descent step
        optimizer.step()
        
        # reset the gradient information
        optimizer.zero_grad()

    # evaluate the model after every epoch
    with torch.no_grad():

        # evaluate the cost function on the test data set
        accumulated_loss = 0
        for x, _ in test_data:
            loss = cost_function(x, model)
            accumulated_loss += loss.item()
            
        # update statistics
        test_loss[-1] = accumulated_loss / len(test_data)
            
    print(f"Epoch {i + 1:2d}: training loss {training_loss[-1]: 9.3f}, "
          f"test loss {test_loss[-1]: 9.3f}")
        
# plot loss
plt.figure()
iterations = np.arange(1, len(training_loss) + 1)
plt.scatter(iterations, training_loss, label='training loss')
plt.scatter(iterations, test_loss, label='test loss')
plt.legend()
plt.xlabel('iteration')
plt.show()

## Task 7

Read through and try to understand the implementation of the training procedure above. How does it differ from the implementation of the training procedure of the nonlinear model without encoder?

In contrast to the regular nonlinear model without encoder, we can encode the images of the MNIST data set and repeat the same analysis as for the PPCA model.

In [None]:
with torch.no_grad():
    # compute representative encoding of the training images
    train_encoding, _ = model.encode(train_images)

    # compute representative encoding of the test images
    test_encoding, _ = model.encode(test_images)

We visualize the encodings.

In [None]:
plot_encoding((train_encoding, train_labels), (test_encoding, test_labels))

For each of the digits 0, 1, $\ldots$, 9 we compute the average representation in the latent space by taking the mean of the encodings of the MNIST training data set.

In [None]:
# compute mean encoding
train_mean_encodings = mean_encodings(train_encoding, train_labels)

We visualize their location in the latent space.

In [None]:
plot_encoding((train_encoding, train_labels), (test_encoding, test_labels),
              train_mean_encodings, annotate=True)

Of course, we can also decode the latent encodings with our model.

In [None]:
# compute mean images
with torch.no_grad():
    train_mean_images = model.decode(train_mean_encodings)

plot_images(train_mean_images, torch.arange(10))

Let us get a feeling for the distribution in the latent space by defining and analysing a whole grid of encodings, spanned by the mean encodings of the digits "0" and "9".

In [None]:
# compute grid of latent vectors
zgrid = create_grid(train_mean_encodings[0], train_mean_encodings[9])

# visualize it
plot_encoding((train_encoding, train_labels), (test_encoding, test_labels), zgrid)

We show the corresponding decoded images.

In [None]:
# compute mean images
with torch.no_grad():
    xgrid = model.decode(zgrid)

plot_images(xgrid)

As in the previous parts of the lab session, we also compare the test images with their reconstructions to see how much information we lose by encoding the MNIST images in a two-dimensional space. We plot a set of images and their reconstructions, and compute the average squared reconstruction error as a more objective measure.

In [None]:
# compute reconstruction
with torch.no_grad():
    test_reconstruction = model.decode(test_encoding)

# compute average squared reconstruction error
sqerr = (test_images - test_reconstruction).pow(2).sum(dim=1).mean()
print(f"Average squared reconstruction error: {sqerr}")

plot_reconstruction(test_images, test_reconstruction, test_labels)

## Task 8

Now we have performed exactly the same analysis as for the regular and the probabilistic PCA. Compare your results and answer Questions 4.8, 4.9, and 4.10 in the lab instructions.

We can also generate new MNIST-like images in the same way as for the nonlinear model without encoder. Again we sample 25 vectors $\mathbf{z}_1, \ldots, \mathbf{z}_{25}$ from $\mathcal{N}(0, \mathbf{I}_{2})$, and plot the representative decoding $f(\mathbf{z}_n; \boldsymbol{\phi})$.

In [None]:
with torch.no_grad():
  x = sample_decode(model, 25)

plot_images(x)

## Summary

We have trained a variational autoencoder that allows us to generate MNIST-like images. Adding a nonlinearity and an encoder seems to improve the quality of the samples. However, we also notice that the model is not perfect. Many further modifications of the decoder and encoder models are possible and could potentially improve the sampler. For instance, the dimension of the latent space can be increased (then the information loss by encoding the images in the latent space should be reduced). Alternatively the nonlinear decoder model can be changed: a more flexible model with increased number of layers in the neural network or a diagonal (or even full) covariance matrix could be used, or the outputs could be restricted to values between 0 and 1 (since we represent MNIST images as vectors with entries between 0 and 1).