Download the file below to your Google Colab files.


In [0]:
!wget https://raw.githubusercontent.com/tirzaelise/cv_training/master/bmnist.py

In [0]:
import argparse
import os

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from torchvision.utils import make_grid, save_image
import torchvision.transforms as transforms
from torchvision import datasets

from bmnist import bmnist

Initialize the GPU with the code below. Make sure that you are running on GPU.

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Part 1. Variational Auto Encoders
A VAE is a latent variable model that leverages the flexibility of Neural Networks (NN) in order to learn/specify a latent variable model. We will first briefly discuss Latent Variable Models and then dive into VAEs.

## 1.1 Latent Variable Models

A latent variable model is a statistical model that contains both
observed and unobserved (or latent) variables. Assume a dataset $\mathcal{D} = \{x_n\}^N_{n=1}$, where $x_n \in \{0, 1\}^M$. For example, $x_n$ can be the pixel values of a binary image. The simplest latent variable model for this data can be summarized with the following generative story:

$$
z_n \sim \mathcal{N}(0, I_D) \\
x_n \sim p_X(f_\theta(z_n))
$$

where $f_\theta$ is some function — parameterized by $\theta$ — that maps $z_n$ the parameters of a distribution over $x_n$. For example, if $p_X$ would be a Gaussian distribution, we will use $f_\theta: \mathbb{R}^D \rightarrow (\mathbb{R}^M, \mathbb{R}^M_+)$, or if $p_X$ is a product of Bernoulli distributions, we have $f_\theta: \mathbb{R}^D \rightarrow [0,1]^M$. Here, $D$ denotes the dimensionality of the latent space. Note that our dataset $\mathcal{D}$ does not contain $z_n$, hence $z_n$ is a latent (or unobserved) variable in our statistical model. In the case of a VAE, a (deep) NN is used for $f_\theta(\cdot)$.

## 1.2 Decoder: Generative part of a VAE

In the previous section, a general graphical model for VAEs was given. In this section we will define a more specific generative model that we will use throughout this assignment. This will later be refered to as the decoding part (or decoder) of a VAE. For this assignment
we will assume the pixels of our images $x_n$ in $\mathcal{D}$ are Bernoulli($p$) distributed.

$$
p(z_n) = \mathcal{N}(0, I_D) \\
p(x_n|z_n) = \prod^M_{m=1} \text{Bern}(x_n^{(m)}|f_\theta(z_n)_m)
$$

where $x_n$ is the $m$th pixel of the $n$th image in $\mathcal{D}$, and $f_\theta: \mathbb{R}^D \rightarrow [0,1]^M$ is a Neural
Network, parameterized by $\theta$, that outputs the mean of the Bernoulli distributions for each pixel in $x_n$.

$p_\theta(x|z)$ is a multivariate Bernoulli whose probabilities are computed from $z$ with a fully-connected neural network with a single hidden layer:

$$
\log p(x|z) = \sum^D_{i=1} x_i \log y_i + (1 - x_i) \cdot \log (1-y_i) \\
\text{where } y = f_\sigma(W_2 f_{\text{ReLU}}(W_1z + b_1) + b_2)
$$

where $f_\sigma(\cdot)$ is the elementwise sigmoid activation function, $f_{\text{ReLU}}$ is the elementwise ReLU activation function, and $\theta = \{W_1, W_2, b_1, b_2\}$ are the weights and biases of the MLP.

In [0]:
class Decoder(nn.Module):

    def __init__(self, hidden_dim=500, z_dim=20):
        super().__init__()

    def forward(self, input):
        """
        Perform forward pass of encoder.
        Returns mean with shape [batch_size, 784].
        """
        mean = None
        raise NotImplementedError()

        return mean

## 1.2 Encoder: $q_\theta(z_n|x_n)$

We only want to sample $z_n$ for which $p(z_n|x_n)$ is not close to zero. One approach is to simply sample from $p(z_n|x_n)$ instead of sampling from $p(z_n)$. However, sampling from (or obtaining an analytical form of) $p(z_n|x_n)$ is intractable in the case of
VAE. Instead, we can solve this by sampling from a variational distribution $q(z_n|x_n)$, which we use to approximate the (intractable) posterior $p(z_n|x_n)$. One way to see if two
distributions are close to each other is the Kullback-Leibner divergence (KL-divergence):

$$
D_{\text{KL}}(q||p) = - \mathbb{E}_{q(x)} \Big [ \log \frac{p(X)}{q(X)} \Big ] = - \int q(x) \Big [ \log \frac{p(x)}{q(x)} \Big ] dx
$$

where $q$ and $p$ are probability distributions in the space of some random variable $X$.

Now, if we write out the expression for the KL-divergence between our proposal $q(z_n|x_n)$ and our posterior $p(z_n|x_n)$, we can derive an expression for the probability of our data under our model:

$$
\begin{align}
D_{\text{KL}}(q(Z|x_n)||p(Z|x_n)) &= - \mathbb{E}_{q(z|x_n)} \Big [ q(Z|x_n) \log \frac{p(Z|x_n)}{q(Z|x_n)} \Big ] \\
&= - \mathbb{E}_{q(z|x_n)} \Big [ q(Z|x_n) \log \frac{p(Z|x_n)p(Z)}{q(Z|x_n)p(x_n)} \Big ] \\
 &= - \mathbb{E}_{q(z|x_n)} \Big [ q(Z|x_n) \log \frac{p(Z)}{q(Z|x_n)} + \log p(x_n|Z) - \log p(x_n) \Big ] \\
 &= D_{\text{KL}}(q(Z|x_n)||p(Z)) - \mathbb{E}_{q(z|x_n)} [ p(x_n|Z)] + \log p(x_n) \\
\end{align}
$$

Hence,

$$\log p(x_n) - D_{\text{KL}}(q(Z|x_n)||p(Z|x_n)) = \mathbb{E}_{q(z|x_n)} [ p(x_n|Z)] - D_{\text{KL}}(q(Z|x_n)||p(Z)) $$

We have arranged the equation above so that directly-computable quantities are on the right-hand side. The right side of the equation is referred to as the lower bound on the log-probability of the data. This is what will optimize. We define our loss as the mean negative lower bound:

$$\mathcal{L}(\theta, \phi) = -\frac{1}{N} \sum^N_{n=1} \mathbb{E}_{q_\phi(z|x_n)} [p_\theta(x_n|Z) - D_{\text{KL}}(q_\phi(Z|x_n) || p_\theta(Z))$$

## 1.4 Specifying the encoder

In VAEs, we have some freedom to choose the distribution $q_\phi(z_n|x_n)$. In essence we want to choose something that can closely approximate $p(z_n|x_n)$, but we are also free a select
distribution that makes our life easier. We will do exactly that in this case and choose $q_\phi(z_n|x_n)$ to be a factored multivariate normal distribution, i.e.,

$$q_\phi(z_n|x_n) = \mathcal{N}(z_n|\mu_\phi(x_n), \text{diag}(\Sigma_\phi(x_n))),$$

where $\mu_\phi: \mathbb{R}^M \rightarrow \mathbb{R}^D$ maps an input image to the mean of the multivariate normal over $z_n$ and $\Sigma_\phi: \mathbb{R}^D \rightarrow \mathbb{R}^M_{>0}$ maps the input image to the diagonal of the covariance matrix of that same distribution. Moreover, diag($v$) maps a $K$-dimensional (for any $K$) input vector $v$ to a $K \times K$ matrix such that for $i,j \in \{1, ..., K\}$:

$$
\text{diag}(v)_{ij} = 
\begin{cases} 
   v_i & \text{if } i = j \\
   0   & \text{if } \text{otherwise}
\end{cases}
$$

The encoder is a multivariate Gaussian with a diagonal covariance structure:

$$
\begin{align}
\log q_\phi(z|x) &= \log \mathcal{N}(z; \mu, \sigma^2I) \\
\text{where } \mu &= W_4h+b_4 \\
\log \sigma^2 &= W_5h + b_5 \\
h &= f_{\text{ReLU}}(W_3 x + b_3)
\end{align}
$$

where $\{W_3, W_4, W_5, b_3, b_4, b_5\}$ are the weights and biases of the MLP and $f_{\text{ReLU}}$ is the elementwise ReLU activation function.

In [0]:
class Encoder(nn.Module):

    def __init__(self, hidden_dim=500, z_dim=20):
        super().__init__()

    def forward(self, input):
        """
        Perform forward pass of encoder.
        Returns mean and std with shape [batch_size, z_dim]. Make sure
        that any constraints are enforced.
        """
        mean, std = None, None
        raise NotImplementedError()

        return mean, std

## 1.6 Building a VAE

We can now implement a VAE in Pytorch. You may assume that the number of samples used to approximate the expectation in the loss is 1. 

In [0]:
class VAE(nn.Module):

    def __init__(self, hidden_dim=500, z_dim=20):
        super().__init__()

        self.z_dim = z_dim
        self.encoder = Encoder(hidden_dim, z_dim)
        self.decoder = Decoder(hidden_dim, z_dim)

    def forward(self, input):
        """
        Given input, perform an encoding and decoding step and return the
        negative average elbo for the given batch.
        """
        average_negative_elbo = None
        raise NotImplementedError()
        return average_negative_elbo

    def sample(self, n_samples):
        """
        Sample n_samples from the model. Return both the sampled images
        (from bernoulli) and the means for these bernoullis (as these are
        used to plot the data manifold).
        """
        sampled_ims, im_means = None, None
        raise NotImplementedError()

        return sampled_ims, im_means


In [0]:
def epoch_iter(model, data, optimizer):
    """
    Perform a single epoch for either the training or validation.
    use model.training to determine if in 'training mode' or not.
    Returns the average elbo for the complete epoch.
    """
    average_epoch_elbo = None
    raise NotImplementedError()

    return average_epoch_elbo


def run_epoch(model, data, optimizer):
    """
    Run a train and validation epoch and return average elbo for each.
    """
    traindata, valdata = data

    model.train()
    train_elbo = epoch_iter(model, traindata, optimizer)

    model.eval()
    val_elbo = epoch_iter(model, valdata, optimizer)

    return train_elbo, val_elbo


def save_elbo_plot(train_curve, val_curve, filename):
    plt.figure(figsize=(12, 6))
    plt.plot(train_curve, label='train elbo')
    plt.plot(val_curve, label='validation elbo')
    plt.legend()
    plt.xlabel('epochs')
    plt.ylabel('ELBO')
    plt.tight_layout()
    plt.savefig(filename)

In [0]:
def vae_main(epochs, zdim):
    data = bmnist()[:2]  # ignore test split
    model = VAE(z_dim=zdim)
    optimizer = torch.optim.Adam(model.parameters())

    train_curve, val_curve = [], []
    for epoch in range(epochs):
        elbos = run_epoch(model, data, optimizer)
        train_elbo, val_elbo = elbos
        train_curve.append(train_elbo)
        val_curve.append(val_elbo)
        print(f"[Epoch {epoch}] train elbo: {train_elbo} val_elbo: {val_elbo}")

        # --------------------------------------------------------------------
        #  Add functionality to plot samples from model during training.
        #  You can use the make_grid functioanlity that is already imported.
        # --------------------------------------------------------------------

    save_elbo_plot(train_curve, val_curve, 'elbo.pdf')

epochs = 40
zdim = 20
vae_main(epochs, zdim)

Plot the estimated lower-bounds of your training and validation set as training progresses — using a 20-dimensional latent space.

Plot samples from your model at three points throughout training (before training, half way through training, and after training). You should observe an improvement in the quality of samples.

# Part 2. Generative Adversarial Networks

Generative Adversarial Networks (GAN) are a type of deep generative models. Similar to VAEs, GANs can generate images that mimick images from the dataset by sampling an encoding from a noise distribution. In constract to VAEs, in vanilla GANs there is no inference mechanism to determine an encoding or latent vector that corresponds to a given data point (or image). One thing to notice is that a GAN consists of two separate networks (i.e., there is no parameter sharing, or the like) called the generator and the discriminator. Training a GAN leverages am adversarial training scheme. In short, that means that instead of defining a loss function by hand (e.g., cross entropy or mean squared error), we train a network that acts as a loss function. In the case of a GAN this network is trained to discriminate between real images and fake (or generated) images, hence the name iscriminator. The discriminator (together with the training data) then serves as a loss function for our generator network that will learn to generate images to are similar to those in the training set. Both the generator and discriminator are trained jointly. In this assignment we will focus on obtaining a generator network that can generate images that are similar to those in the training set.

## 2.1 Training objective: A Minimax Game

In order to train a GAN we have to decide on a noise distribution $p(z)$, in this case we will use a standard Normal distribution. Given this noise distribution, the GAN training procedure is a minimax game between the generator and discriminator. This is best seen by inspecting the loss (or optimization objective):

$$\min_G \max_D V(D,G) = \min_G \max_D \mathbb{E}_{p_{\text{data}}(x)} [\log D(X)] + \mathbb{E}_{p_z(z)}[\log (1-D(G(Z)))]$$

## 2.2 Building a GAN

Now that the objective is specified and it is clear how the generator and discriminator should behave, we are ready to implement a GAN. In this part of the assignment you will implement a GAN in PyTorch.

In [0]:
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()

        # Construct generator. You are free to experiment with your model,
        # but the following is a good start:
        #   Linear latent_dim -> 128
        #   LeakyReLU(0.2)
        #   Linear 128 -> 256
        #   Bnorm
        #   LeakyReLU(0.2)
        #   Linear 256 -> 512
        #   Bnorm
        #   LeakyReLU(0.2)
        #   Linear 512 -> 1024
        #   Bnorm
        #   LeakyReLU(0.2)
        #   Linear 1024 -> 768
        #   Output non-linearity

    def forward(self, z):
        # Generate images from z
        pass


class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()

        # Construct distriminator. You are free to experiment with your model,
        # but the following is a good start:
        #   Linear 784 -> 512
        #   LeakyReLU(0.2)
        #   Linear 512 -> 256
        #   LeakyReLU(0.2)
        #   Linear 256 -> 1
        #   Output non-linearity

    def forward(self, img):
        # return discriminator score for img
        pass


def train_gan(dataloader, discriminator, generator, optimizer_G, optimizer_D, n_epochs, save_interval):
    for epoch in range(n_epochs):
        for i, (imgs, _) in enumerate(dataloader):

            imgs.cuda()

            # Train Generator
            # ---------------

            # Train Discriminator
            # -------------------
            optimizer_D.zero_grad()

            # Save Images
            # -----------
            batches_done = epoch * len(dataloader) + i
            if batches_done % save_interval == 0:
                # You can use the function save_image(Tensor (shape Bx1x28x28),
                # filename, number of rows, normalize) to save the generated
                # images, e.g.:
                # save_image(gen_imgs[:25],
                #            'images/{}.png'.format(batches_done),
                #            nrow=5, normalize=True)
                pass

In [0]:
def gan_main(n_epochs, batch_size, lr, latent_dim, save_interval):
    # Create output image directory
    os.makedirs('images', exist_ok=True)

    # load data
    dataloader = torch.utils.data.DataLoader(
        datasets.MNIST('./data/mnist', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.5,),
                                                (0.5,))])),
        batch_size=batch_size, shuffle=True)

    # Initialize models and optimizers
    generator = Generator()
    discriminator = Discriminator()
    optimizer_G = torch.optim.Adam(generator.parameters(), lr=lr)
    optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=lr)

    # Start training
    train_gan(dataloader, discriminator, generator, optimizer_G, optimizer_D, n_epochs, save_interval)

    # You can save your generator here to re-use it to generate images for your
    # report, e.g.:
    # torch.save(generator.state_dict(), "mnist_generator.pt")

n_epochs = 200
batch_size = 64
lr = 0.0002
latent_dim = 100
save_interval = 500
gan_main(n_epochs, batch_size, lr, latent_dim, save_interval)

Sample 25 images from your trained GAN. Do this at the start of training, halfway through training and after training has terminated.

Edit the architecture of your GAN until you feel that you are getting good images.