Helpful resources:
1. https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html#vae-variational-autoencoder
2. https://arxiv.org/pdf/1606.05908.pdf

### Probabilistic model
Variational autoencoder are based on the principles of variational inference and graphical models. We want to generate data $x \in \mathcal{X}$; to do so we first split the joint probability of observable and latent variables into prior and likelihood
\begin{align}
    p(x,z) = p(x \mid z) p(z).
\end{align}
If we have many latent variables, marginalization of the joint probability distribution is intractable. 
#### Model assumption and loss function
In order to make the marginalization tractable, we model the posterior probability as 
\begin{align}
    q_{\phi}(z \mid x) \approx p(z \mid x).
\end{align}
Obviously, we want the model to be very close to $p(z \mid x)$; a measure of closeness is given by the Kullback--Leibler divergence 
\begin{align}
    D_{\text{KL}}(q_\phi || p) &= \int q_\phi \log \frac{q_\phi}{p} \, dx \\
                                 &= \int q_\phi \log \frac{q_\phi p(x)}{p(x,z)} \, dx \\
                                 &= \log p(x) + \int q_\phi \log \frac{q_\phi}{p(x \mid z) p(z)} \, dx \\   
                                 &= \log p(x) + D_{\text{KL}}(q_\phi || p(z)) - \mathbb{E}_{q_\phi}(p(x \mid z)).
\end{align}
Since we want to minimize, both, the negative log likelihood $- \log p(x)$ as well as the difference between $q_\phi$ and $p(z \mid x)$, we the loss of our model to
\begin{align}
    L_{\text{VAE}} &= - \log p(x) + D_{\text{KL}}(q_\phi || p) \\
                   &= \underbrace{D_{\text{KL}}(q_\phi || p(z)) - \mathbb{E}_{q_\phi}(p(x \mid z))}_{\text{Evidence lower bound (ELBO)}}. \quad \text{(using equation above)}
\end{align}
#### Computing the KL divergence
If we let both $p(z)$ and $q_\phi$ be normally distributed, we can compute D_{\text{KL}}(q_\phi || p(z)) exactly. Let $q_\phi = \mathcal(\mu_q, \Sigma_q)$ and $p(z)=\mathcal(\mu_p, \Sigma_p)$, then 
\begin{align}
    D_{\text{KL}}(q_\phi || p(z)) = \frac{1}{2} \left[ log \frac{|\Sigma_p|}{|\Sigma_q|} - d + \text{tr} \left(\Sigma_p^{-1} \Sigma_q\right) + (\mu_p - \mu_q)^T \Sigma_p^{-1} (\mu_p - \mu_q) \right].
\end{align}
Generally, we choose $\mu_p = 0$ and $\Sigma_p = Id$; the KL divergence then simplifies to
\begin{align}
\frac{1}{2} \left[ - log |\Sigma_q| - d + \text{tr} \left(\Sigma_q\right) + \mu_p^T \mu_p \right].
\end{align}
#### Computing the expected value and the reparameterization trick
Computing the expected value $\mathbb{E}_{q_\phi}(p(x \mid z))$ involves sampling from a probability distribution. Backpropagation does not work here due to the stochastic nature of sampling, however, the reparameterization trick finds a way to make it all work out in the end. Instead of sampling $z$ directly from $z \sim q_\phi = \mathcal{N}(z; \mu_q, \Sigma_q)$, we sample $\epsilon \sim \mathcal{N}(0, Id)$ and compute $z=\mu_q + \Sigma_q \epsilon$.

In [1]:
import numpy as np

In [None]:
class VariationalAutoencoder():
    def __init__(self,
                 encoder_network,
                 decoder_network,
                 ):
        