# CSC412/2506  Assignment 3: Variational Auto Encoders

In this assignment we will learn how to preform efficient inference and learning in directed graphical models with continuous latent variables.We will use stochastic variational inference with automatic differentiation (SADVI) to approximate intractible posterior distributions. 
We will implement the two gradient estimators discussed in lecture, Score Function and Reparamterization, and experimentally demonstrate their properties such as biasedness and variance. 
We will use the reparameterization gradient estimators to optimize the ELBO of our latent variable model.

You can use automatic differentiation in your code.
You may also use a machine learning framework to specify the encoder and decoder neural networks, and provide gradientent optimizers such as ADAM.
However, you may not use any probabilistic modelling elements for these frameworks.
In particular, sampling from and evaluating densities under distributions must be written by you.

# Implementing the VAE [20pts]

In this assignment we will implement and investigate the Variational Auto Encoder on Binarized MNIST digits detailed in [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) by Kingma and Welling (2013). Before starting, read this paper. In particular, we will implement model as described in Appendix C.

## Load and Prepare Data

Load the MNIST dataset, binarize the images, split into a training dataset of 10000 images and a test set of 10000 images. Also partition the training set into minibatches of size M=100.

In [None]:
import math
import numpy as np
import torch
import torchvision
import torch.nn.functional as F
import torch.nn as nn
import torch.distributions as dist
import matplotlib.pyplot as plt

In [None]:
# You may use the script provided in A2 or dataloaders provided by framework
import data
N_data, train_images, train_labels, test_images, test_labels = data.load_mnist()

In [None]:
# Load MNIST and Set Up Data
train_images = np.round(train_images[0:10000])
train_labels = train_labels[0:10000]
test_images = np.round(test_images[0:10000])
test_labels = test_labels[0:10000]

train_images = torch.from_numpy(train_images).float()
train_labels = torch.from_numpy(train_labels).float()
test_images = torch.from_numpy(test_images).float()
test_labels = torch.from_numpy(test_labels).float()
print(f'N_data={N_data}')
print(f'train_images.shape={train_images.shape}')
print(f'train_labels.shape={train_labels.shape}')
print(f'test_images.shape={test_images.shape}')
print(f'test_labels.shape={test_labels.shape}')

In [None]:
def imshow(img):
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

imshow(torchvision.utils.make_grid(train_images[:5].view(5,1,28,28),padding=1))
print(' '.join(f'{torch.argmax(train_labels[j]).item()}' for j in range(5)))

In [None]:
# Implemented batching for you
batch_size = 100
num_batches = int(np.ceil(len(train_images) / batch_size))

def batch_indices(iter):
    idx = iter % num_batches
    return slice(idx * batch_size, (idx+1) * batch_size)

batch_indices(0), train_images[batch_indices(0)].shape, batch_size, num_batches

## Distributions [5pts]

Implement code to sample from and evaluate the log-pdf of diagonal multivariate gaussians $\mathcal{N}(x|\mu, \sigma^2 I)$ and Bernoulli distributions. For sampling from these distributions, you have access to samples from uniform and unit Gaussians, (`rand` and `randn`). Make sure you test you've implemented these correctly by comparing to standard packages!

In [None]:
# sampler from Diagonal Gaussian x~N(μ,σ^2 I) (hint: use reparameterization trick here)

def sample_gaussian(mu,logsigma2):
    # http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
    #
    epsilon = torch.randn_like(logsigma2)
    return mu + torch.exp(0.5*logsigma2)*epsilon

# sampler from Bernoulli

def sample_bernoulli(p):
    return (torch.rand_like(p) < p).float()

# log-pdf of x under Diagonal Gaussian N(x|μ,σ^2 I)
#
# x           (batch_size, n_x)
# mu          (batch_size, n_x)
# log_sigma2  (batch_size, n_x)
#
def logpdf_gaussian(x,mu,logsigma2):
    # batch dot product: https://github.com/pytorch/pytorch/issues/18027
    #
    # overflow problem fix: put 1/sigma^2 \circ (x-mu) first, ....
    #
    return (-mu.shape[-1]/2)*math.log(2*math.pi) - \
        (1/2)*torch.sum(logsigma2,dim=1) - \
        (1/2)*torch.sum((1/torch.exp(logsigma2))*(x-mu)*(x-mu),-1)
    

# log-pdf of x under Bernoulli 
#
# x    (batch_size, n_x)
# p    (batch_size, n_x)
#
def logpdf_bernoulli(x,p):
    return torch.sum(dist.bernoulli.Bernoulli(probs=p).log_prob(x),dim=-1)

## Defining Model Architecture [5pts]

Implement the model as defined in Appendix C. The MLPs will have a single hidden layer with Dh=500 hidden units. The dimensionality of the latent space will be Dz=2 for visualization purposes later.

Note that the output of the encoder will be $[\mu,\log\sigma^2]$. Why not ouput $\sigma^2$ directly? Keep this in mind when you sample from the distribution using your Diagonal Gaussian sampler.

In [None]:
class Encoder(nn.Module):
    
    def __init__(self,n_x,n_hidden,n_z):
        super(Encoder, self).__init__()
        
        self.fc1= nn.Linear(n_x,n_hidden)
        self.fc2_mu = nn.Linear(n_hidden,n_z)
        self.fc2_logsigma2 = nn.Linear(n_hidden,n_z)
        
    def forward(self,x):
        
        h = torch.tanh(self.fc1(x))
        mu = self.fc2_mu(h)
        logsigma2 = self.fc2_logsigma2(h)
        
        return mu, logsigma2
    

class StochasticLayer(nn.Module):
    
    def __init__(self):
        super(StochasticLayer, self).__init__()
        pass
    
    def forward(self,mu,logsigma2):
        z = sample_gaussian(mu,logsigma2)
        return z
    
class Decoder(nn.Module):
    
    def __init__(self,n_x,n_hidden,n_z):
        super(Decoder, self).__init__()
        
        self.fc1 = nn.Linear(n_z,n_hidden)
        self.fc2 = nn.Linear(n_hidden,n_x)
        
    def forward(self, z):
        h = torch.tanh(self.fc1(z))
        y = torch.sigmoid(self.fc2(h))
        
        return y

## Variational Objective [7pts]

Here we will use the log-pdfs, the encoder, gaussian sampler, and decoder to define the Monte Carlo estimator for the mean of the ELBO over the minibatch.

In [None]:
def to_numpy(tensor):
    return tensor.data.cpu().numpy()

def variational_objective(x,mu,logsigma2,z,y):

    # log_q(z|x) logprobability of z under approximate posterior N(μ,σ^2)

    log_approxposterior_prob = logpdf_gaussian(z,mu,logsigma2)

    # log_p_z(z) log probability of z under prior

    log_prior_prob = logpdf_gaussian(z,torch.zeros_like(z),torch.log(torch.ones_like(z)))

    # log_p(x|z) - conditional probability of data given latents.

    log_likelihood_prob = logpdf_bernoulli(x,y)

    # Monte Carlo Estimator of mean ELBO with Reparameterization over M minibatch samples.
    # This is the average ELBO over the minibatch
    # Unlike the paper, do not use the closed form KL between two gaussians,
    # Following eq (2), use the above quantities to estimate ELBO as discussed in lecture

    # number of samples = 1
    elbo = torch.mean(-log_approxposterior_prob + log_likelihood_prob + log_prior_prob)
    
#     print([to_numpy(torch.mean(t)) for t in [-log_approxposterior_prob,log_likelihood_prob,log_prior_prob]])

    return elbo

## Optimize with Gradient Descent

Minimize the -ELBO with ADAM optimizer. You may use the optimizer provided by your framework

In [None]:
# Load Saved Model Parameters (if you've already trained)

trained = True
n_x = 28*28
n_hidden = 500
n_z = 2

encoder = Encoder(n_x,n_hidden,n_z)
stochasticlayer = StochasticLayer()
decoder = Decoder(n_x,n_hidden,n_z)

if trained:
    device = torch.device("cpu")
    encoder.load_state_dict(torch.load('encoder.pt'))
    decoder.load_state_dict(torch.load('decoder.pt'))
else:
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

encoder.to(device)
stochasticlayer.to(device)
decoder.to(device)

print(f'{encoder}\n{stochasticlayer}\n{decoder}')

In [None]:
# Set up ADAM optimizer

torch.manual_seed(0)
optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()))

n_epochs = 200
n_batches_print = 100

for epoch in range(n_epochs):

    running_loss = 0.0
    for it in range(num_batches):
        
        iter_images = train_images[batch_indices(it)].to(device)
        iter_labels = train_labels[batch_indices(it)].to(device)

        optimizer.zero_grad()
        
        mu, logsigma2 = encoder(iter_images)
        z = stochasticlayer(mu, logsigma2)
        y = decoder(z)

        loss = -variational_objective(iter_images,mu,logsigma2,z,y)
        loss.backward()
    
        optimizer.step()
        
        running_loss += loss
        
        if it % n_batches_print == n_batches_print-1:    # print every 200 mini-batches
            print(f'[{epoch+1} {it+1}] loss: {running_loss/n_batches_print}')
            running_loss = 0.0

print('Finished Training')

In [None]:
# Save Optimized Model Parameters
torch.save(encoder.state_dict(), "./encoder.pt")
torch.save(decoder.state_dict(), "./decoder.pt")

## Report ELBO on Training and Test Set [3pts]

In [None]:
# ELBO on training set

mu, logsigma2 = encoder(train_images.to(device))
z = stochasticlayer(mu, logsigma2)
y = decoder(z)

elbo_training = variational_objective(train_images.to(device),mu,logsigma2,z,y)
print(f"Training set ELBO = {elbo_training}")

# ELBO on test set

mu, logsigma2 = encoder(test_images.to(device))
z = stochasticlayer(mu, logsigma2)
y = decoder(z)

elbo_testing = variational_objective(test_images.to(device),mu,logsigma2,z,y)
print(f"Training set ELBO = {elbo_testing}")

# Numerically Computing Intractable Integrals [10pts]

## Numerical Integration over Latent Space [5pts]

Since we chose a low dimensional latent space, we are able to perform [numerical integration](https://en.wikipedia.org/wiki/Riemann_sum) to evaluate integrals which are intractible in higher dimension. 

For instance, we will use this to integrate over the latent space. e.g. the $$p(z|x) = \frac{p(x|z)*p(z)}{p(x)}= \frac{p(x|z)*p(z)}{\int p(x|z)*p(z) dz}$$

We want to numerically compute that integral. However, since we are parameterizing $\log p(x|z)$ and $\log p(z)$ we will have
$$\log p(z|x) = \log p(x|z) + \log p(z) - \log \int \exp [\log p(x|z)+ \log p(z)] dz$$

You will write code which computes $\log \int \exp \log f(z) dz$ given an equally spaced  grid of $\log f(z)$s as input.
Note that if we approximate that integral with a numerical sum, in order for it to be numerically stable we will need `logsumexp`.

In [None]:
# Implement log sum exp

def logsumexp(x):
    a = torch.max(x)
    return a + torch.log(torch.sum(torch.exp(x-a)))

# Implement stable numerical integration 
#     over a 2d grid of equally spaced (delta_z) evaluations logf(x)
#     sum over volumes over grid points    `delta_z^2 x value`
def integrate(values,delta_z):
    return logsumexp(values + 2*torch.log(delta_z))

## Compare Numerical Log-Likelihood to ELBO [5pts]

We can use the numerical integration to compute the log-likeihood of a element in our dataset under our model. We can then compare the numerical integration to the estimate given by the ELBO.

In [None]:
np.random.seed(0)

def compare_loglikelihood_to_elbo():

    # Define the delta_z to be the spacing of the grid  (I used delta_z = 0.1)
    delta_z = 0.1
    delta_z_t = torch.tensor(delta_z)

    # Define a grid of delta_z spaced points [-4,4]x[-4,4]
    x,y = np.meshgrid(np.arange(-4,4+delta_z,delta_z),np.arange(-4,4+delta_z,delta_z))
    z = np.hstack([x.reshape(-1,1),y.reshape(-1,1)])
    z = torch.from_numpy(z).float().to(device)

    # Sample an x from the data to evaluate the likelhiood
    x = train_images[int(np.random.rand()*10000),:]
    x = x.float().to(device)

    # Compute log_p(x|z)+log_p(z) for every point on the grid

    log_pxz = logpdf_bernoulli(x,decoder(z))
    log_pz = logpdf_gaussian(z,torch.zeros_like(z),torch.log(torch.ones_like(z)))

    # Using your numerical integration code
    # integrate log_p(x|z)+log_p(z) over z to find log_p(x)

    log_px = integrate(log_pxz + log_pz, delta_z_t)

    # Check that your numerical integration is correct 
    # by integrating log_p(x|z)+log_p(z) - log_p(x)
    # If you've successfully normalized this should integrate to 0 = log 1

    assert(torch.allclose(integrate(log_pxz + log_pz - log_px, delta_z_t), torch.tensor(0).float(),atol=1e-4))

    # Now compute the ELBO on x

    mu, logsigma2 = encoder(x.view(1,-1))
    z = stochasticlayer(mu, logsigma2)
    y = decoder(z)
    elbo = variational_objective(x,mu,logsigma2,z,y)
    
    return log_px, elbo


# Try this for multiple samples of x
# note that the ELBO is a lower bound to the true log_p(x)!

for i in [np.random.rand() for _ in range(5)]:
    x = train_images[int(i*10000),:]
    
    log_px, elbo = compare_loglikelihood_to_elbo()
    
    print(f'{elbo:.3f} <= {log_px:.3f}')

# Data Space Visualizations [10pts]

In this section we will investigate our model by visualizing the distributions over data given by the generative model, samples from these distributions, and reconstructions of the data.

In [None]:
# Write a function to reshape 784 array into a 28x28 image for plotting

def imshow(img):
    npimg = img.detach().numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

## Samples from the generative model [5pts]

Here you will sample from the generative model using ancestral sampling. 

* First sample a z from the prior. 
* Then use the generative model to parameterize a bernoulli distribution over x given z. Plot this distribution.
* Then sample x from the distribution. Plot this sample.

Do this for 10 samples z from the prior.

Concatenate all your plots into one 10x2 figure where the first column is the distribution over x and the second column is a sample from this distribution. Each  row will be a new sample from the prior.

In [None]:
# Sample 10 z from prior

mu = torch.zeros(10,n_z)
logsigma2 = torch.log(torch.ones(10,n_z))
z = sample_gaussian(mu,logsigma2)

# For each z, plot p(x|z)

y = decoder(z)

# Sample x from p(x|z)

xt = sample_bernoulli(y)

# Concatenate plots into a figure

ims = torch.cat([im.view(-1,1,28,28) for im in [y, xt]],dim=0)
imshow(torchvision.utils.make_grid(ims,padding=1,nrow=5))

## Reconstructions of data [5pts]

Here we will investigate the VAEs ability to reconstruct 10 inputs from the data. For each input ou will

* Plot the input $x$
* Use the recognition network to encode $x$ to the parameters for a distribution $q(z|x)$
* Sample $z \sim q(z|x)$
* Use the generative model to decode to the parameters for distribution $p(x|z)$. Plot this
* Sample $\tilde x \sim p(x|z)$. Plot this

Then you will concatenate all your plots into a 10x3 figure where the first column is the input data, the second column is the distribution over x, the third column is a reconstruction of the input. Each row will be a new sample from the data.

In [None]:
# Sample 10 xs from the data, plot.

x = train_images[[int(np.random.rand()*10000) for _ in range(10)],:]

# For each x, encode to distribution q(z|x)

mu,logsigma2 = encoder(x)

# For each x, sample distribution z ~ q(z|x)

z = stochasticlayer(mu,logsigma2)

# For each z, decode to distribution p(x̃|z), plot.

y = decoder(z)

# For each x, sample from the distribution x̃ ~ p(x̃|z), plot.

xt = sample_bernoulli(y)

# Concatenate all plots into a figure.

ims = torch.cat([im.view(-1,1,28,28).transpose(2,3) for im in [x, y, xt]],dim=0)
ims = torchvision.utils.make_grid(ims,padding=1,nrow=10).transpose(1,2)
imshow(ims)

# Latent Space Visualizations [15pts]

In this section we will investigate our model by visualizing the latent space through various methods. These will include encoding the data, decoding along a grid, and linearly interpolating between encdoded data.

## Latent embedding of data [5pts]

One way to understand what is represented in the latent space is to consider where it encodes elements of the data. Here we will produce a scatter plot in the latent space, where each point in the plot will be the mean vector for the distribution $q(z|x)$ given by the encoder. Further, we will colour each point in the plot by the class label for the input data. 

Hopefully our latent space will have learned to distinguish between elements from different classes, even though we never provided class labels to the model!

In [None]:
# Encode the training data

mu,_ = encoder(train_images)

# Take the mean vector of each encoding


# Plot these mean vectors in the latent space with a scatter
# Colour each point depending on the class label

mu = mu.detach().numpy()
colors = torch.argmax(train_labels,dim=1).detach().numpy()

plt.scatter(mu[:,0], mu[:,1], c=colors, alpha=0.5)
plt.title('latent space visualization')
plt.xlim((-7,7))
plt.ylim((-7,7))
plt.xlabel('z1')
plt.ylabel('z2')

## Decoding along a lattice [5pts]

We can also understand the "learned manifold" by plotting the generative distribution $p(x|z)$ for each point along a grid in the latent space. We will replicate figure 4b in the paper.

In [None]:
# Create a 20x20 equally spaced grid of z's
# (use the  previous figure to help you decide appropriate bounds for the grid)

x,y = np.meshgrid(np.linspace(-4,4,20),np.linspace(-4,4,20))
z = np.hstack([x.reshape(-1,1),y.reshape(-1,1)])
z = torch.from_numpy(z).float().to(device)

# For each z on the grid plot the generative distribution over x

y = decoder(z)

# concatenate these plots to a lattice of distributions

imshow(torchvision.utils.make_grid(y.view(-1,1,28,28),padding=1,nrow=20))

## Interpolate between two classes [5pts]

A common technique to assess latent representations is to interpolate between two points.

Here we will encode 3 pairs of data points with different classes. Then we will linearly interpolate between the mean vectors of their encodings. We will plot the generative distributions along the linear interpolation.

In [None]:
# Function which gives linear interpolation z_α between za and zb

def interpolate(za,zb,alpha):
    return (1-alpha)*za + alpha*zb

n_alpha = 10

for it in range(3):
    
    # Sample 3 pairs of data with different classes
    
    i = int(np.random.rand()*10000)
    j = int(np.random.rand()*10000)
    while torch.all(torch.eq(train_labels[i,:],train_labels[j,:])):
        j = int(np.random.rand()*10000)
        
    print(f'interpolate between class {torch.argmax(train_labels[i])}, {torch.argmax(train_labels[j])}')
    
    x = train_images[[i,j],:]
    
    # Encode the data in each pair, and take the mean vectors
    
    z,_ = encoder(x)

    # Linearly interpolate between these mean vectors
    
    z_alpha = []
    
    for alpha in range(n_alpha):
        z_alpha.append(interpolate(z[0,:],z[1,:],alpha/n_alpha).view(1,-1))

    z_alpha = torch.cat(z_alpha,dim=0)
        
    # Along the interpolation, plot the distributions p(x|z_α)

    y = decoder(z_alpha)

    # Concatenate these plots into one figure
    
    imshow(torchvision.utils.make_grid(y.view(-1,1,28,28),padding=1,nrow=10))


# Posteriors and Stochastic Variational Inference [20pts]

Here we will use numerical integration to plot the "true" posterior $p(z|x)$ which is generally intractable. We will compare the intractable true posterior to the variational approximate posterior given by the recognition model $q(z|x)$.

Then we will use the generative model to perform inference other inference tasks. In particular, we will see that the purpose of the encoder was only to make training the generative model tractable, and that we can do inference using the generative model completely without the encoder. To illustrate this we will perform the inference task of producing a generative distribution over the bottom half of the digit conditioned on the top half. We will achieve this with stochastic variational inference.

## Plotting Posteriors [5pts]

Here we will plot the true posterior by evaluating $\log p(x|z)+\log p(z)$ on an equally spaced grid over z then numerically integrating over this grid to find the log-normalizer $\log p(x)$. This will give us the intractable true posterior $p(z|x)$.

Then we will compare the true posterior to the approximate posterior given by the recognition model $q(z|x)$. Does the recognition model produce a good approximate posterior to the intractable true posterior?

In [None]:
np.random.seed(1)
# Sample an element x from the dataset to plot posteriors for

x = train_images[int(np.random.rand()*10000),:]
mu,logsigma2 = encoder(x)

# Define a grid of equally spaced points in z
# The grid needs to be fine enough that the plot is nice
# To keep the integration tractable 
# I reccomend centering your grid at the mean of q(z|x)

n_samples = 100

mu = mu.detach().numpy()
std3 =  0.5

X = np.linspace(mu[0]-std3,mu[0]+std3,n_samples)
Y = np.linspace(mu[1]-std3,mu[1]+std3,n_samples)
delta_z = X[1]-X[0]
delta_z_t = torch.tensor(delta_z)

X,Y= np.meshgrid(X,Y)
z = np.hstack([X.reshape(-1,1),Y.reshape(-1,1)])
z = torch.from_numpy(z).float().to(device)

# Evaluate log_p(x|z) + log_p(z) for every z on the grid

log_pxz = logpdf_bernoulli(x,decoder(z))
log_pz = logpdf_gaussian(z,torch.zeros_like(z),torch.log(torch.ones_like(z)))

# Numerically integrate log_p(x|z) + log_p(z) to get log_p(x)

log_px = integrate(log_pxz + log_pz, delta_z_t)

# Produce a grid of normalized log_p(z|x)

assert(torch.allclose(integrate(log_pxz + log_pz - log_px, delta_z_t), torch.tensor(0).float(),atol=1e-4))
log_pzx = log_pxz + log_pz - log_px

# Plot the contours of p(z|x) (note, not log)

pzx = torch.exp(log_pzx.reshape((n_samples,n_samples)))
pzx = pzx.detach().numpy()
plt.contour(X,Y,pzx,colors='red')

# Evaluate log_q(z|x) recognition network for every z on grid

muu = torch.tensor(mu).view(1,-1).repeat(n_samples*n_samples,1)
logsigma22 = logsigma2.view(1,-1).repeat(n_samples*n_samples,1)

log_qzx = logpdf_gaussian(z,muu,logsigma22)

# Plot the contours of q(z|x) on previous plot

qzx = torch.exp(log_qzx.reshape((n_samples,n_samples)))
qzx = qzx.detach().numpy()
plt.contour(X,Y,qzx,colors='blue')

print(f'red: p_z|x blue: q_z|x')

## True posterior for top of digit [5pts]

In this question we will plot the "true" posterior given only the top of the image, $p(z|x_{top})$. 

Realize that the generative model gives a Bernoulli distribution over each pixel in the image. We can easily evaluate the likelihood of only the top of an image by evaluating under only those corresponding dimensions of the generative model.

In [None]:
# Function which returns only the top half of a 28x28 array
# This will be useful for plotting, as well as selecting correct bernoulli params

# log_p(x_top | z) (hint: select top half of 28x28 bernoulli param array)

# Sample an element from the data set and take only its top half: x_top

# Define a grid of equally spaced points in z

# Evaluate log_p(x_top | z) + log_p(z) for every z on grid

# Numerically integrate to get log_p(x_top)

# Normalize to produce grid of log_p(z|x_top)

# Plot the contours of p(z|x_top)


## Learn approximate posterior for top of digit with Stochastic Variational Inference [10 pts]

In this question we will see how we can use SVI to learn an approximate posterior to $p(z|x_{top})$ which we just obtained through intractable integration.

Note that we can't just use our recognition model, because our encoder doesn't know what to do with only top halfs of images. Instead, we will initialize a variational distribution $q(z) = \mathcal{N}(z| \mu,\sigma^2 I)$ and optimize the ELBO to minimize the KL divergence between it and the true distribution.

In [None]:
# Initialize parameters μ and logσ for variational distribution q(z)

# Define mean ELBO over M samples z ~ q(z)
# using log_p(z), log_p(x_top | z), and q(z|x_top)

# Loss for SVI is -1*ELBO

# Set up ADAM to optimize μ and logσ^2

# Optimize for a few iterations until convergence (you can use a larger stepsize here)


In [None]:
# On previous plot of contours of p(z|x_top) plot the optimized q(z)


In [None]:
# Sample z ~ q(z)

# Use generative model p(x|z) to produce distribution over x

# Extract the bottom half of this generative distribution: p(x_bot| z)

# Concatenate the x_top and p(x_bot | z) and plot.


# Investigating Gradient Estimators [Bonus 5pts]

In this part we will experimentally investigate the difference in variances between the gradient estimates given by the Reparameterization and Score-Function gradient estimators.

Comment on their mean and variances

In [None]:
# Use Reparameterization Gradient Estimator
# to estimate gradient of mean ELBO wrt μ over M minibatch samples
# hint: this will involve just taking gradients through the code used to train

# Use Score-Function Gradient Estimator
# to estimate gradient of mean ELBO wrt μ over M minibatch samples
# make sure you are not useing the reparameterization trick to sample z from q
# you should only be taking gradients through log_q(z|x), no gradients through ELBO or z

# Consider the gradients wrt the first component of μ
# Produce two histograms in two different subplots
# First show the distribution of gradients given by Reparameterization estimator
# Second show the distribution of gradients given by Score Function Estimator
