# Implementation of VAE

## Objectives
By the end of this lesson, students will be able to:
- Connect VAE to the graphical model.
- Understand VAE as a generative model.

## Prerequisite
Before starting this lesson, students should be familiar with the following concepts
- Basics of VAE


## Connection of Graphical Model to VAE

The graphical model was a high-level viewpoint that enabled to see the connection between the random variables using distributions and increased the interpretability of the problem. Now, let's make the connection from the graphical model to the neural network structure.

Let's take an example, say you want to generate an animal. First we imagine the animal: it has four legs, orange body with black stripes. With these criteria, we will be able to generate animal by sampling from the animal kingdom. It's a tiger. Doesn't this imagination sound like latent variable

From this example, we can make analogy of human thinking to the VAE network.

- $\mathbf{x}$(Data): the animal we want to generate.

- $\mathbf{h}$(Latent variable): Imagination

- $P(x)$(Evidence/distribution of data): The animal kingdom  [Target]

- $P(h)$(Prior/distribution of latent variable): Imagination source or the human brain.

- $P(x|h)$(Likelihood): Turning the imagination into real animal.

Our target is to model the data i.e. find the value of P(x) using the equation we studied in last video. Comparing to the graphical model, the VAE's encoder function is analogous to the dotted approximate posterior of the graphical model which is made similar to the known prior $P(h)$, as suggested by the $\mathbb{KL}$ term of variational lowerbound. The intuition is that the VAE encoder function will act similar to the prior from the graphical model and generate latent variables likely under input data. In terms of the example, we want to limit the imagination to animal kingdom domain and shouldn't imagine things like stem, root, table, computer(whatever lies outside the animal domain) because these things don't have anything to do with the generation process.

Next, the VAE's decoder function is analogous to the likelihood distribution from the graphical model which generates data from latent space variables.
Relating to the examples, it is like generating animals from the imagination. Given an imagination, the brain must relate it to the animal. If the brain doesn't relate the imagination to the correct animal, it must be taught correct recreation. Similarly, expectation of the likelihood term in the above lowerbound formula which needs to be increased makes the decoder function generate proper data given the latent variable. This makes the decoder a generative network, which allows us to mimic the hidden random process of generation of data from latent space representation and generate artificial data that resembles the real data. Decoder network generates the most likely data from the sampled code.



Looking from the perspective of the neural network. First, the approximate posterior probability of the encoder are parameterized by it's weight and bias $\phi$ ($q_\phi(\mathbf{h}|\mathbf{x})$) which takes the input data $\mathbf{x}$  and output's variational parameter $\lambda$ .

Next, in the decoder, the likelihood function are parameterized by its weight and bias $\theta$ ($P_\theta(\mathbf{x}|\mathbf{h})$) which inputs the random sample taken from $\lambda$ i.e $\mathbf{h}$ and outputs $\mathbf{x'}$. The latent variable is sampled from encoder function causing a slight change in notation. So, the lowerbound derived from our graphical model for each data point $x^{(i)}$ converts to the following Eqn in the neural network model.

$$
\mathcal{L_i(\phi, \theta)} = \underbrace{\mathbb{E}_{h \sim {q_\phi(\mathbf{h}|x^{(i)})}} [logP_\theta(x^{(i)}|\mathbf{h})]}_{\text {monte-carlo estimate this}} - \underbrace{\mathbb{KL}(q_\phi(\mathbf{h}|x^{(i)})  || P(\mathbf{h}))}_{\text {solved analytically}} \tag{1}
$$

The 1st part of the Eqn denotes the log-likelihood of obtaining points $P_\theta(x^{(i)}|h)$ given random latent variable $\mathbf{h}$ derived from the approximate posterior, which is just the reconstruction of our input data. This expectation is known as the -ve of the reconstruction error of VAE (-ve reconstruction error because in neural networks we minimize the errors). Here, we have to maximize this expectation term and make this term to be high in value to maximize our lowerbound. Our likelihood in the 1st term is multivariate Gaussian. We replace expectation term with Monte-Carlo approximation and write this term as:

$$
\mathbb{E}_{h \sim {q_\phi(\mathbf{h}|x^{(i)})}} = -C ||x^{(i)}-x'^{(i)}||^2 \tag{2}
$$

where the constant term $C = \frac{1}{2\sigma^2}$, this expectation is with respect to the encoder's distribution over the representation of i-th data point. Since the data points are independent of each other, the total loss is just the summation over i data points. If the decoded fails to reconstruct data properly, we say statistically, the decoder's parameters of likelihood function is poor. Poor reconstruction will result in a larger value of our reconstruction error. The decoder network is responsible for computing $P_\theta(x^{(i)}|h)$ and we can compute the estimation of this term through sampling.

The 2nd part is the KL-divergence term, which needs to be minimized making our approximate posterior similar to assumed prior (gaussian). This term can be thought of as a regularization term added to the loss of our neural network such that our encoder distribution has form similar to normal distribution. Also, this term forces the encoding of the data to be spread evenly around the center of the latent space. If the encoder tries to cheat by clustering data apart in specific regions, this KL term penalizes the network [Kingma et al. (2013)]

$$
- \mathbb{KL}(q_\phi(\mathbf{h}|\mathbf{x}) || \mathcal{N}(0,I)) = \frac{1}{2} \sum_{j=1}^J (1 + log((\sigma_j)^2) - (\mu_j)^2 - (\sigma_j)^2) \tag{3}
$$

where $j$ denotes $j^{th}$ element of the vectors.

Writing the variational lowerbound in term of input $\mathbf{x}$ instead of each data point, eqn (1) converts to:

$$
\mathcal{L(\phi, \theta)} =\underbrace{-C||\mathbf{x}-\mathbf{x'}||^2}_{\text {-ve reconstruction error}} + \underbrace{\frac{1}{2} \sum_{j=1}^J (1 + log((\sigma_j)^2) - (\mu_j)^2 - (\sigma_j)^2)}_{\text{regularization term}}  \tag{4}
$$

Similar to the Monte-Carlo estimate of expectation term, the KL-divergence term's estimate is analytically calculated. The Monte-Carlo gradient [Mohamed et al. (2019)] estimates of the lowerbound are beyond the scope of this video.

In neural network, we tend to minimize the loss. So, generally the vae implementation minimizes -ve of variational lowerbound term (equivalent to maximize the lowerbound). This can be written as:

$$
\max_{\phi, \theta} \mathcal{L} (\phi, \theta) = \min_{\phi, \theta} -\mathcal{L} (\phi, \theta) = \min_{\phi, \theta} \text{VAE loss} = \min_{\phi, \theta} C||\mathbf{x}-\mathbf{x'}||^2 -\sum_{j=1}^J (1 + log((\sigma_j)^2) - (\mu_j)^2 - (\sigma_j)^2)
$$

We relate the loss of VAE as -ve of variational lowerbound and minimize it. During implementation, we generally compute the variational lowerbound and multiply it with -ve to get the loss of VAE.

Finally, we have the loss function estimates of the probability distribution of encoder and decoder network of our VAE.

<center>
<figure>

<p><img src="https://i.postimg.cc/4xSGyp0s/schematics-of-vae.jpg"></p>
<figcaption align="center">Fig: Schematics of VAE</figcaption>
</figure>
</center>

Now the process involved in training VAE network for being a generative network are as follows:

1. First, Encode the input $\mathbf{x}$ as latent distribution $q_\phi(\mathbf{h}|\mathbf{x})$ = $q_\phi(\mu_x, \sigma_x^2)$ = $ \mathcal{N}(\mu_x, \sigma_x^2)$.
2. Second, sample a latent space point $\mathbf{h} \sim \mathcal{N}(\mu_x, \sigma_x^2)$ from the distribution.
3. Third, Decode the reconstructed output $\mathbf{x'}$ from the sample point $\mathbf{h}$ using decoder's distribution $P_\theta(\mathbf{x}|\mathbf{h})$.
4. Finally, compute the loss term and backpropagate it through the network.

Current conditions applied to our VAE model enables us to compute the loss term but the Monto-Carlo estimate of the expectation is not differentiable w.r.to inference network parameters $\phi$. It is because of random sampling of our latent variable $\mathbf{h}$ from the approximate posterior distribution which exhibits large variance. To make this term differentiable, reparameterization trick is used. We will be thoroughly going through the process of calculating errors, it's gradients and backpropagating it through the network for optimizing parameters $\phi$ and $\theta$ in the next chapter.

Up to now we have learned how the VAE model's probability distribution is stored, how loss is calculated. But, how does this make the VAE model as a generative model? This is discussed in the next subsection with an example of generating handwritten digits by placing the MNIST dataset as input data for training the network.


## VAE as a Generative Model


We have learned how the VAE model's probability distribution is stored, how loss is calculated. But, how does this make the VAE model as a generative model?


After training the VAE model, we generate a vector of mean and matrix of variance representing our probability distribution of input data. If the latent vectors are supposed to be of only 2-Dimension, varying these latent vectors and constructing the output will result in different images that weren't present in our original input dataset. Plotting this 2 dimensional manifold shows how changing the latent variables generates the output.

Here, the MNIST dataset of handwritten digits with 60000 training samples and 10000 test samples are taken as input data for training the network. The encoder and decoder network are structured as follows:



<center>
<figure>


<p><img src="https://i.postimg.cc/gJdgW2W8/vae-generative-model.png"></p>
<figcaption align="center">Fig: VAE as a Generative Model</figcaption>
</figure>
</center>

Encoder Network's shape:

1. Input Layer = 784 dimension input (28x28 dimension image)

2. Hidden Layers
  - Dense Layer = 512x1 dimension
  - Mean vector = 2x1 dimension (connected to dense layer)
  - Variance = 2x1 dimension (connected to dense layer)
  - lambda layer/ sampled latent vector = 2x1 dimension (connected to Mean and Variance, used as part of reparameterization trick for sampling latent variable from $\mathcal{N}(0,1)$)

Decoder Network's shape:
1. Hidden Layers
  - Dense = 512x1 dimension
2. Dense(output) = 784x1 dimension (later reshaped to 28x28 images for plotting)

The activation functions used are similar to vanilla autoencoder with ReLU on hidden layers and sigmoid on the output layer.

Since the network has 2-dimensional latent space, plotting the 2-D manifold of the latent space with 20x20 generated outputs respectively results in:



<p><img src="https://i.postimg.cc/T3cHkbmg/2d-manifold-learned-repr-digits.png"  ></p>
<center>Learned representation of digits over 2-D manifold</center>

Each digit image in the above plot is created by the decoder network for various random samples from the unit normal distribution. The transition between digits with the change in the value of latent variables are smooth. For example, look at how 0 changes to 6 to 4 to 7. Some of the digits generated might not even look like actual digits in some cases but are generated by the VAE network as the input to the decoder was taken from a unit normal distribution and not from our encoder sample.



<p><img src="https://i.postimg.cc/kXHbCBmt/Latent-Space.png"  ></p>
<center>Scatter diagram showing distribution of data</center>

The scatter diagram above shows how the digit image representation is distributed over the latent space of 2 dimensions. Each color of the dots denotes a digit representation over the 2-D manifold. You must have noticed how the same digits are clustered together, while different digits are apart from each other. We can easily see that digit 0's and 1's are opposite of each other while 0's and 6's are near each other, corresponding to the image plot we saw before

VAE can also reconstruct the input similar to normal autoencoders. Instead of taking the input from the unit normal distribution, after training the VAE we take an input dataset and pass it through the encoder sample from the corresponding distribution and reconstruct the output dataset of digits from the decoder. You can see how the reconstructed digits look like below.



<p><img src="https://i.postimg.cc/26YQg9D5/digit-reconstruction.png" ></p>
<center>Digit reconstruction from VAE model</center>

The top row is the original input digits, while the reconstructed output digits are of the row below. The corresponding reconstruction of the digits mightn't match sometimes, nevertheless, the reconstruction of digits/images is possible using the VAE model. So, they can both be used as a generative and compression-reconstruction model.

In this notebook, we learned how VAE works differently from normal autoencoders, constructed its loss function by relating neural network from the graphical model, and finally saw an example of how it can act as a generative model. Next, we will learn about the reparameterization trick which will be used to compute the gradients of our loss function on VAE architecture.




### Key Takeaways

1. Variational Autoencoder minimizes the KL-Divergence between two distribution, this is similar to maximizing the variational lowerbound.
2. Maximizing the variational lowerbound is equivalent to minimizing the negative variational lowerbound.
3. In neural network, we minimize the loss function. VAE minimizes the negative variational lowerbound term.
4. Random sampling prevents the VAE network from backpropagation and thus the network cannot be trained. Reparameterization Trick is used to solve this problem.
5. VAE can be used as a generative and compression-reconstruction model.


### Additional Resources

* Papers
   * Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. https://arxiv.org/pdf/1312.6114.pdf
       * Appendix B analytically solves the KL divergence term.

   * Mohamed, S., Rosca, M., Figurnov, M., & Mnih, A. (2019). Monte carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652. https://arxiv.org/pdf/1906.10652.pdf
       * Chapter 2 for Monte-Carlo estimate of gradients

   * Roeder, G., Wu, Y., & Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems (pp. 6925-6934). https://arxiv.org/pdf/1703.09194.pdf
       * Chapter 2 for understanding why estimate of gradient has lower variance than the original estimator.

* Articles
    * Weng, Lilian. *From Autoencoder to Beta-VAE*. [lilianweng.github.io](https://lilianweng.github.io/posts/2018-08-12-vae/), 2018.
       
