Variational autoencoders use gaussian models to generate images.  

### Gaussian distribution
Before going into the details of VAEs, we discuss the use of gaussian distribution for data modeling. In the following diagram, we assume the probability of $X$ equal to a certain value $x$, $p(X=x)$, follows a gaussian distribution: 

<img src="images/g0.png" width="60%">


$$
\text{Probability density function (PDF)} = p(X=x) = f(x) = \frac{e^{-(x - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}}
$$

We can sample data using the PDF above. We use the following notation for sample data using a gaussian distribution with mean $ \mu $ and standard deviation $ \sigma $.

$$
x \sim \mathcal{N}{\left(
\mu 
,
\sigma
\right)}
$$

In the example above, mean: $ \mu=0 $, standard deviation: $ \sigma=0.1$:

> In many real world examples, the data sample distribution follows a gaussian distribution. 

Now, we generalize it with multiple variables. For example, we want to model the relationship between the body height and the body weight for San Francisco residents. We collect the information from 1000 adult residents and plot the data below with each red dot represents 1 person:

<img src="images/auto.png" width="60%">

We can plot the corresponding probability density function 

$$ PDF = probability(height=h, weight=w)$$ 

in 3D:

<img src="images/auto2.png" width="60%">

We can model usch probability density function using a gaussian distribution function. The PDF with $p$ variables is:

$$
x = \begin{pmatrix}
x_1 \\
\vdots \\
x_p
\end{pmatrix}
$$

<img src="images/g1.png" width="60%">


with covariance matrix $ \sum $:

$$
\sum = \begin{pmatrix}
    E[(x_{1} - \mu_{1})(x_{1} - \mu_{1})] & E[(x_{1} - \mu_{1})(x_{2} - \mu_{2})] & \dots  & E[(x_{1} - \mu_{1})(x_{p} - \mu_{p})] \\
    E[(x_{2} - \mu_{2})(x_{1} - \mu_{1})] & E[(x_{2} - \mu_{2})(x_{2} - \mu_{2})] & \dots  & E[(x_{2} - \mu_{2})(x_{p} - \mu_{p})] \\
    \vdots & \vdots & \ddots & \vdots \\
    E[(x_{p} - \mu_{p})(x_{1} - \mu_{1})] & E[(x_{p} - \mu_{p})(x_{2} - \mu_{2})] & \dots  & E[(x_{n} - \mu_{p})(x_{p} - \mu_{p})]
\end{pmatrix}
$$

The notation for sampling $x$ is:

$$
x 
\sim \mathcal{N}{\left(
\mu
,
\sum
\right)}

x =
\begin{pmatrix}
x_1 \\
\vdots \\
x_p
\end{pmatrix}

\sim \mathcal{N}{\left(
\mu
,
\sum
\right)}

= \mathcal{N}{\left(
\begin{pmatrix}
\mu_1 \\
\vdots \\
\mu_p
\end{pmatrix}
,
\sum
\right)}
$$

Let's go back to our weight and height example to illustrate it.

$$
x = \begin{pmatrix}
weight \\
height 
\end{pmatrix}
$$

From the data, we comput the mean weight is 190 lb and mean height is 70 inches:

$$
\mu = \begin{pmatrix}
190 \\
70
\end{pmatrix}
$$

For the covariance matrix $ \sum $, here, we illustrate how to compute one of the element $ E_{21} $

$$
 E_{21} = E[(x_{2} - \mu_{2})(x_{1} - \mu_{1})] = E[(x_{height} - 70)(x_{weight} - 190)]
$$

which $E$ is the expected value. Let say we have only 2 datapoints (200 lb, 80 inches) and (180 lb, 60 inches)

$$
E_{21} = E[(x_{height} - 70)(x_{weight} - 190)] = \frac{1}{2} \left( ( 80 - 70) \times (200 - 190)  + ( 60 - 70) \times (180 - 190)  \right)
$$

After computing all 1000 data, here is the value of $ \sum $:

$$
\sum = \begin{pmatrix}
    100 & 25 \\
    25 & 50 \\
\end{pmatrix}
$$

$$
x \sim \mathcal{N}{\left(
\begin{pmatrix}
190 \\
70
\end{pmatrix}
,
\begin{pmatrix}
    100 & 25 \\
    25 & 50 \\
\end{pmatrix}
\right)}
$$

$ E_{21} $ measures the co-relationship between variables $x_2$ and $x_1$. Positive values means both are positively related. With not surprise, $ E_{21} $ is positive because weight increases with height. If two variables are independent of each other, it should be 0 like:

$$
\sum = \begin{pmatrix}
    100 & 0 \\
    0 & 50 \\
\end{pmatrix}
$$

and we will simplify the gaussian distribution notation here as:

$$
x \sim \mathcal{N}{\left(
\begin{pmatrix}
190 \\
70
\end{pmatrix}
,
\begin{pmatrix}
    100 \\
    50 \\
\end{pmatrix}
\right)}
$$

### Autoencoders

In an autoencoders, we use a deep network to map the input image (for example 256x256 pixels = 256x256 = 65536 dimension) to a lower dimension **latent variables** (latent vector say 100-D  vector: $ (x_1, x_2, \cdots x_{100}) $). We use another deep network to decode the latent variables to restore the image. We train both encoder and decoder network to minimize the difference between the original image and the decoded image. By forcing the image to a lower dimension, we hope the network learns to encode the image by extracting core features.

<img src="images/auto3.jpg" width="80%">


For example, we enter a 256x256 image to the encoder, we use a CNN to encode the image to 20-D latent variables $ (x_1, x_1, ... x_{20}) = (0.1, 0, ..., -0.05) $. We use another network to decode the latent variables into a 256x256 image. We use backpropagation with cost function comparing the decoded and input image to train both encoding and decoding network.

### Variational Autoencoders (VAEs)

For VAEs, we replace the middle part with a stochastic model using a gaussian distribution. Let's get into an example to demonstrate the flow:

<img src="images/auto4.jpg" width="80%">


For a variation autoencoder, we replace the middle part with 2 separate steps. VAE does not generate the latent vector directly. It generates 100 Gaussian distributions each represented by a mean $ (\mu_i) $ and a standard deviation $ (\sigma_i) $. Then it samples a latent vector, say (0.1, 0.03, ..., -0.01), from these distributions.  For example, if element $$x_i $$ of the latent vector has $ \mu_i=0.1 $ and $ \sigma_i=0.5 $. We randomly select $ x_i $ with probability based on this Gaussian distribution: 

$$
p(X=x_{i}) = \frac{e^{-(x_{i} - \mu_i)^{2}/(2\sigma_i^{2}) }} {\sigma_i\sqrt{2\pi}}
$$

$$
z = 
\begin{pmatrix}
z_1 \\
\vdots \\
z_{20}
\end{pmatrix}
\sim \mathcal{N}{\left(
\begin{pmatrix}
\mu_1 \\
\vdots \\
\mu_{20}
\end{pmatrix}
,
\begin{pmatrix}
\sigma_{1}\\
\vdots\\
\sigma_{20}\\
\end{pmatrix}
\right)}
$$

Say, the encoder generates $ \mu=(0, -0.01, ..., 0.2) $ and $ \sigma=(0.05, 0.01, ..., 0.02) $ 

We can sample a value from this distribution:

$$
\mathcal{N}{\left(
\begin{pmatrix}
0 \\
-0.01 \\
\vdots \\
0.2
\end{pmatrix}
,
\begin{pmatrix}
0.05 \\
0.01 \\
\vdots \\
0.02
\end{pmatrix}
\right)}
$$

with the latent variables as (say) :

$$
z = 
\begin{pmatrix}
z_1 \\
\vdots \\
z_{20}
\end{pmatrix}
=
\begin{pmatrix}
0.03 \\
-0.015 \\
\vdots \\
0.197
\end{pmatrix}
$$

The autoencoder in the previous section is very hard to train with not much guarantee that the network is generalize enough to make good predictions.(We say the network simply memorize the training data.) 
 In VAEs, we add a constrain to make sure:
1. The latent variable are relative independent of each other, i.e. the 20 variables are relatively independent of each other (not co-related). This maximizes what a 20-D latent vectors can represent. 
1. Latent variables $$z$$ which are similar in values should generate similar looking images. This is a good indication that the network is not trying to memorize individual image.

To achieve this, we want the gaussian distribution model generated by the encoder to be as close to a normal gaussian distribution function. We penalize the cost function if the gaussian function is deviate from a normal distribution. This is very similar to the L2 regularization in a fully connected network in avoiding overfitting.

$$
z \sim \mathcal{N}{\left(
0
,
1
\right)}
= \text{normal distribution}
$$

In a normal gaussian distribution, the covariance $ E_{ij} $ is 0 for $ i \neq j $. That is the latent variables are independent of each other. If the distribution is normalize, the distance between different $z$ will be a good measure of its similarity. With sampling and the gaussian distribution, we encourage the network to have similar value of $z$ for similar images. 

### Encoder

Here we have a CNN network with 2 convolution layers using leaky ReLU follow by one fully connected layer to generate 20 $ \mu $ and another fully connected layer for 20 $ \sigma $.

```python
def recognition(self, input_images):
     with tf.variable_scope("recognition"):
         h1 = lrelu(conv2d(input_images, 1, 16, "d_h1"))   # Shape: (?, 28, 28, 1) -> (?, 14, 14, 16)
         h2 = lrelu(conv2d(h1, 16, 32, "d_h2"))            # (?, 7, 7, 32)
         h2_flat = tf.reshape(h2, [self.batchsize, 7 * 7 * 32])  # (100, 1568)

         w_mean = linear(h2_flat, self.n_z, "w_mean")      # (100, 20)
         w_stddev = linear(h2_flat, self.n_z, "w_stddev")  # (100, 20)

     return w_mean, w_stddev
```

### Decoder
The decoder feeds the 20 latent variables to a fully connected layer followed with 2 transpose convolution layer with ReLU. The output is then feed into a sigmoid layer to generate the image.

```python
def generation(self, z):
    with tf.variable_scope("generation"):
        z_develop = linear(z, 7 * 7 * 32, scope='z_matrix')  # (100, 20) -> (100, 1568)
        z_matrix = tf.nn.relu(tf.reshape(z_develop, [self.batchsize, 7, 7, 32]))  # (100, 7, 7, 32)
        h1 = tf.nn.relu(conv_transpose(z_matrix, [self.batchsize, 14, 14, 16], name="g_h1"))  # (100, 14, 14, 16)
        h2 = conv_transpose(h1, [self.batchsize, 28, 28, 1], name="g_h2")  # (100, 14, 14, 16)
        out = tf.nn.sigmoid(h2)  # (100, 28, 28, 1)

    return out     
```

### Building the VAE

We use the encoder to encode the input image. Use sampling to generate $ z $ from the mean and variance of the gaussian distribution and then decode it.

```python
# Encode the image
z_mean, z_stddev = self.recognition(image_matrix)
		
# Sampling z
samples = tf.random_normal([self.batchsize, self.n_z], 0, 1, dtype=tf.float32)
guessed_z = z_mean + (z_stddev * samples)

# Decode the image
self.generated_images = self.generation(guessed_z)
generated_flat = tf.reshape(self.generated_images, [self.batchsize, 28 * 28])
```

### Cost function & training

We define a generation loss which measure the difference between the original and the decoded message using the mean square error. The latent loss measure the difference between gaussian function of the image from a normal distribution using KL-Divergence.

```python
self.generation_loss = -tf.reduce_sum(
    self.images * tf.log(1e-8 + generated_flat) + (1 - self.images) * tf.log(1e-8 + 1 - generated_flat), 1)

self.latent_loss = 0.5 * tf.reduce_sum(
    tf.square(z_mean) + tf.square(z_stddev) - tf.log(tf.square(z_stddev)) - 1, 1)

self.cost = tf.reduce_mean(self.generation_loss + self.latent_loss)
```
We use the adam optimizer to train both networks.

```python
self.optimizer = tf.train.AdamOptimizer(0.001).minimize(self.cost)

## Cost function in detail

In VAE, we want to model the data distribution $p(x)$ with an encoder $ q_ϕ(z \vert x)$ , a decoder $p_θ(x  \vert z) $ and a latent variable model $p(z)$ through the VAE objective function:

$$
\log p(x) \approx \mathbb{E}_q [   \log p_θ (x \vert z)] - D_{KL} [q_ϕ (z \vert x) \Vert p(z)]   \\
$$

To draw this conclusion, we start with the KL divergence which measures the difference of 2 distributions. By definition, KL divergence is defined as: 

$$
\begin{align}
D_{KL}\left(q \Vert p\right) & = \sum_{x} q(x) \log (\frac{q(x)}{p(x)}) \\ 
& = \mathbb{E}_q[log (q(x))−log (p(x))] \\
\end{align}
$$


Apply it with:

$$
\begin{align}
D_{KL}[q(z \vert x) \Vert p(z \vert x)] &= \mathbb{E}[\log q(z \vert x) - \log p(z \vert x)] \\
\end{align}
$$


Let $ q_\lambda (z \vert x) $ be the distribution of $ z $ predicted by our encoder deep network. We want it to match the true distribution $ p(z \vert x) $. We want the distribution approximated by the deep network has little divergence from the true distribution. i.e. we want to optimize $\lambda $ with the smallest KL divergence.

$$
D_{KL} [ q_λ (z \vert x) \Vert p(z \vert x) ] = \mathbb{E}_q [ \log q_λ (z \vert x)  -   \log p (z \vert x) ]
$$

Apply:

$$
p(z \vert x) = \frac{p(x \vert z) p(z)}{p(x)}
$$

$$
\begin{align}
D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ] & = \mathbb{E}_q [ \log q_λ (z \vert x) - \log \frac{ p (x \vert z) p(z)}{p(x)}  ] \\
& = \mathbb{E}_q [ \log q_λ (z \vert x)  - \log p (x \vert z) - \log p(z)  + \log p(x)]   \\
& = \mathbb{E}_q [ \log q_λ (z \vert x)  - \log p (x \vert z) - \log p(z) ] + \log p(x) \\
 D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ]  - \log p(x) & = \mathbb{E}_q [ \log q_λ (z \vert x)  - \log p (x \vert z) - \log p(z) ] \\
 \log p(x) - D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ]  & = \mathbb{E}_q [   \log p (x \vert z) - ( \log q_λ (z \vert x) - \log p(z)) ] \\
&=  \mathbb{E}_q [   \log p (x \vert z)] - \mathbb{E}_q [ \log q_λ (z \vert x) - \log p(z)) ] \\
&=  \mathbb{E}_q [   \log p (x \vert z)] - D_{KL} [q_λ (z \vert x) \Vert p(z)] \\
\end{align}
$$

Define the term ELBO (Evidence lower bound) as:

$$
\begin{align}
ELBO(λ) & =  \mathbb{E}_q [   \log p (x \vert z)] - D_{KL} [q_λ (z \vert x) \Vert p(z)] \\
\log p(x) - D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ] & = ELBO(λ)  \\
\end{align}
$$

We call ELBO the evidence lower bound because:

$$
\begin{align}
\log p(x) - D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ] & = ELBO(λ) \\
\log p(x) & \geqslant ELBO(λ) \quad \text{since KL is always positive} \\
\end{align}
$$

Here, we define our VAE objective function

> $$ \log p(x) - D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ] = \mathbb{E}_q [   \log p (x \vert z)] - D_{KL} [q_λ (z \vert x) \Vert p(z)]  $$


Instead of the distribution $p(x)$, we can model the data $x$ with $ \log p(x) $. With the error term, $D_{KL} [ q_\lambda (z \vert x) \Vert p(z \vert x)  ]$, we can establish a lower bound $ELBO$ for $ \log p(x) $ which in practice is good enough in modeling the data distribution. In the VAE objective function, maximize our model probability $ \log p(x) $ is the same as maximize $ \log p (x \vert z)]$ while minimize the divergence of $D_{KL} [q_λ (z \vert x) \Vert p(z)] $. 

Maximizing $\log p (x \vert z)$ can be done by building a decoder network and maximize its likelihood. So with an encoder $ q_ϕ(z \vert x)$ , a decoder $p_θ(x  \vert z) $, our objective become optimizing:

$$
ELBO(\theta, \phi) = E_{q_\theta(z \vert x) }  [  \log (p_{\theta}(x_{i}|z))  ] - D_{KL} [ q_\phi (z \vert x) \Vert p(z) ]
$$

We can apply a constrain to $ p(z) $ such that we can evaluate $D_{KL} [ q_\phi (z \vert x) \Vert p(z) ]$ easily. In AVE, we use  $ p(z) = \mathcal{N} (0, 1) $. For optimal solution, we want $ q_\phi (z \vert x) $ to be as close as $\mathcal{N} (0, 1) $.

In VAE, we model $ q_\phi (z \vert x) $ as $ \mathcal{N} (\mu, \Sigma)$

$$
\begin{align}
D_{KL} [ q_\phi (z \vert x) \Vert p(z) ] &= D_{KL}[N(\mu, \Sigma) \Vert N(0, 1)] \\
& = \frac{1}{2} \, ( \textrm{tr}(\Sigma) + \mu^T\mu - k - \log \, \det(\Sigma) ) \\
& = \frac{1}{2} \, ( \sum_k \Sigma + \sum_k \mu^2 - \sum_k 1 - \log \, \prod_k \Sigma ) \\
& = \frac{1}{2} \, ( \sum_k \Sigma(X) + \sum_k \mu^2(X) - \sum_k 1 - \sum_k \log \Sigma(X) ) \\
& = \frac{1}{2} \, \sum_k ( \Sigma + \mu^2 - 1 - \log \Sigma )
\end{align}
$$

### KL-divergence of 2 Gaussian distributions

Here is an exercise in computing the KL divergence of 2 simple gaussian distributions:

$$
p(x) = N(\mu_1, \sigma_1) \\
q(x) = N(\mu_2, \sigma_2)
$$

$$
\begin{align}
KL(p, q) &= \int \left[\log( p(x)) - log( q(x)) \right] p(x) dx \\
& = E_1 \left[ -\frac{1}{2} \log(2\pi) - \log(\sigma_1) - \frac{1}{2} \left(\frac{x-\mu_1}{\sigma_1}\right)^2 + \frac{1}{2}\log(2\pi) + \log(\sigma_2) + \frac{1}{2} \left(\frac{x-\mu_2}{\sigma_2}\right)^2  \right] \\
&=E_{1} \left\{\log\left(\frac{\sigma_2}{\sigma_1}\right) + \frac{1}{2} \left[ \left(\frac{x-\mu_2}{\sigma_2}\right)^2 - \left(\frac{x-\mu_1}{\sigma_1}\right)^2 \right]\right\} \\
& =\log\left(\frac{\sigma_2}{\sigma_1}\right) + \frac{1}{2\sigma_2^2} E_1 \left\{(X-\mu_2)^2\right\} - \frac{1}{2\sigma_1^2} E_1 \left\{(X-\mu_1)^2\right\} \\
& =\log\left(\frac{\sigma_2}{\sigma_1}\right) + \frac{1}{2\sigma_2^2} E_1 \left\{(X-\mu_2)^2\right\} - \frac{1}{2} \quad \text{ because } E_1 \left\{(X-\mu_1)^2\right\} = \sigma_1^2\\
Note: & (X - \mu_2)^2 = (X-\mu_1+\mu_1-\mu_2)^2 = (X-\mu_1)^2 + 2(X-\mu_1)(\mu_1-\mu_2) + (\mu_1-\mu_2)^2 \\
KL(p, q) & = \log\left(\frac{\sigma_2}{\sigma_1}\right) + \frac{1}{2\sigma_2^2}
\left[E_1\left\{(X-\mu_1)^2\right\} + 2(\mu_1-\mu_2)E_1\left\{X-\mu_1\right\} + (\mu_1-\mu_2)^2\right] - \frac{1}{2} \\
& = \log\left(\frac{\sigma_2}{\sigma_1}\right) + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2} \\
\end{align}
$$


# Model 2

Autocoders are a family of neural network models aiming to learn compressed latent variables of high-dimensional data. Starting from the basic autocoder model, this post reviews several variations, including denoising, sparse, and contractive autoencoders, and then Variational Autoencoder (VAE) and its modification beta-VAE.

Autocoder is invented to reconstruct high-dimensional data using a neural network model with a narrow bottleneck layer in the middle (oops, this is probably not true for [Variational Autoencoder](#vae-variational-autoencoder), and we will investigate it in details in later sections). A nice byproduct is dimension reduction: the bottleneck layer captures a compressed latent encoding. Such a low-dimensional representation can be used as en embedding vector in various applications (i.e. search), help data compression, or reveal the underlying data generative factors. 


$$\mathcal{D}$$ | The dataset, $$\mathcal{D} = \{ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)} \}$$, contains $$n$$ data samples; $$\vert\mathcal{D}\vert =n $$. |
| $$\mathbf{x}^{(i)}$$ | Each data point is a vector of $$d$$ dimensions, $$\mathbf{x}^{(i)} = [x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_d]$$. |
| $$\mathbf{x}$$ | One data sample from the dataset, $$\mathbf{x} \in \mathcal{D}$$. |
| $$\mathbf{x}’$$| The reconstructed version of $$\mathbf{x}$$. |
| $$\tilde{\mathbf{x}}$$ | The corrupted version of $$\mathbf{x}$$. |
| $$\mathbf{z}$$ | The compressed code learned in the bottleneck layer. |
| $$a_j^{(l)}$$ | The activation function for the $$j$$-th neuron in the $$l$$-th hidden layer. |
| $$g_{\phi}(.)$$ | The **encoding** function parameterized by $$\phi$$. |
| $$f_{\theta}(.)$$ | The **decoding** function parameterized by $$\theta$$. |
| $$q_{\phi}(\mathbf{z}\vert\mathbf{x})$$ |Estimated posterior probability function, also known as **probabilistic encoder**.  |
| $$p_{\theta}(\mathbf{x}\vert\mathbf{z})$$ | Likelihood of generating true data sample given the latent code, also known as **probabilistic decoder**.

## Autoencoder

**Autoencoder** is a neural network designed to learn an identity function in an unsupervised way  to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation. The idea was originated in [the 1980s](https://en.wikipedia.org/wiki/Autoencoder), and later promoted by the seminal paper by [Hinton & Salakhutdinov, 2006](https://pdfs.semanticscholar.org/c50d/ca78e97e335d362d6b991ae0e1448914e9a3.pdf).

It consists of two networks:
- *Encoder* network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
- *Decoder* network: The decoder network recovers the data from the code, likely with larger and larger output layers.


<img src="images/autoencoder-architecture.png" width="80%">


The encoder network essentially accomplishes the [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction), just like how we would use Principal Component Analysis (PCA) or Matrix Factorization (MF) for. In addition, the autoencoder is explicitly optimized for the data reconstruction from the code. A good intermediate representation not only can capture latent variables, but also benefits a full [decompression](https://ai.googleblog.com/2016/09/image-compression-with-neural-networks.html) process.

The model contains an encoder function $g(.)$ parameterized by $\phi$ and a decoder function $f(.)$ parameterized by $\theta$. The low-dimensional code learned for input $\mathbf{x}$ in the bottleneck layer is $\mathbf{z} = $ and the reconstructed input is $\mathbf{x}' = f_\theta(g_\phi(\mathbf{x}))$.

The parameters $(\theta, \phi)$ are learned together to output a reconstructed data sample same as the original input, $\mathbf{x} \approx f_\theta(g_\phi(\mathbf{x}))$, or in other words, to learn an identity function. There are various metrics to quantify the difference between two vectors, such as cross entropy when the activation function is sigmoid, or as simple as MSE loss:

$$
L_\text{AE}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2
$$

## Denoising Autoencoder

Since the autoencoder learns the identity function, we are facing the risk of “overfitting” when there are more network parameters than the number of data points. 

To avoid overfitting and improve the robustness, **Denoising Autoencoder** (Vincent et al. 2008) proposed a modification to the basic autoencoder. The input is partially corrupted by adding noises to or masking some values of the input vector in a stochastic manner, $\tilde{\mathbf{x}} \sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}} \vert \mathbf{x})$. Then the model is trained to recover the original input (**Note: Not the corrupt one!**).


$$
\begin{aligned}
\tilde{\mathbf{x}}^{(i)} &\sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}}^{(i)} \vert \mathbf{x}^{(i)})\\
L_\text{DAE}(\theta, \phi) &= \frac{1}{n} \sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\tilde{\mathbf{x}}^{(i)})))^2
\end{aligned}
$$


where $\mathcal{M}_\mathcal{D}$ defines the mapping from the true data samples to the noisy or corrupted ones.

<img src="images/denoising-autoencoder-architecture.png" width="80%">


This design is motivated by the fact that humans can easily recognize an object or a scene even the view is partially occluded or corrupted. To “repair” the partially destroyed input, the denoising autoencoder has to discover and capture relationship between dimensions of input in order to infer missing pieces. 

For high dimensional input with high redundancy, like images, the model is likely to depend on evidence gathered from a combination of many input dimensions to recover the denoised version (sounds like the [attention]({{ site.baseurl }}{% post_url 2018-06-24-attention-attention %}) mechanism, right?) rather than to overfit one dimension. This builds up a good foundation for learning *robust* latent representation.

The noise is controlled by a stochastic mapping $$\mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}} \vert \mathbf{x})$$, and it is not specific to a particular type of corruption process (i.e. masking noise, Gaussian noise, salt-and-pepper noise, etc.). Naturally the corruption process can be equipped with prior knowledge

In the experiment of the original DAE paper, the noise is applied in this way: a fixed proportion of input dimensions are selected at random and their values are forced to 0. Sounds a lot like dropout, right? Well, the denoising autoencoder was proposed in 2008, 4 years before the dropout paper ([Hinton, et al. 2012](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)) ;)

**Stacked Denoising Autoencoder**: In the old days when it was still hard to train deep neural networks, stacking denoising autoencoders was a way to build deep models ([Vincent et al., 2010](http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf)). The denoising autoencoders are trained layer by layer. Once one layer has been trained, it is fed with clean, uncorrupted inputs to learn the encoding in the next layer.

<img src="images/stacking-dae.png" width="80%">

Image source: [Vincent et al., 2010](http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf))*

## Sparse Autoencoder

**Sparse Autoencoder** applies a "sparse" constraint on the hidden unit activation to avoid overfitting and improve robustness. It forces the model to only have a small number of hidden units being activated at the same time, or in other words, one hidden neuron should be inactivate most of time.

Recall that common [activation functions](http://cs231n.github.io/neural-networks-1/#actfun) include sigmoid, tanh, relu, leaky relu, etc. A neuron is activated when the value is close to 1 and inactivate with a value close to 0.

Let’s say there are $s_l$ neurons in the $l$-th hidden layer and the activation function for the $j$-th neuron in this layer is labelled as $a^{(l)}_j(.)$, $j=1, \dots, s_l$. The fraction of activation of this neuron $\hat{\rho}_j$ is expected to be a small number $\rho$, known as *sparsity parameter*; a common config is $\rho = 0.05$.


$$
\hat{\rho}_j^{(l)} = \frac{1}{n} \sum_{i=1}^n [a_j^{(l)}(\mathbf{x}^{(i)})] \approx \rho
$$

This constraint is achieved by adding a penalty term into the loss function. The KL-divergence $D_\text{KL}$ measures the difference between two Bernoulli distributions, one with mean $$\rho$$ and the other with mean $\hat{\rho}_j^{(l)}$. The hyperparameter $\beta$ controls how strong the penalty we want to apply on the sparsity loss.


$$
\begin{aligned}
L_\text{SAE}(\theta) 
&= L(\theta) + \beta \sum_{l=1}^L \sum_{j=1}^{s_l} D_\text{KL}(\rho \| \hat{\rho}_j^{(l)}) \\
&= L(\theta) + \beta \sum_{l=1}^L \sum_{j=1}^{s_l} \rho\log\frac{\rho}{\hat{\rho}_j^{(l)}} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j^{(l)}}
\end{aligned}
$$


![KL divergence]({{ '/assets/images/kl-metric-sparse-autoencoder.png' | relative_url }})
{: style="width: 80%;" class="center"}
*Fig. 4. The KL divergence between a Bernoulli distribution with mean $$\rho=0.25$$ and a Bernoulli distribution with mean $$0 \leq \hat{\rho} \leq 1$$.*

**$$k$$-Sparse Autoencoder**

In $$k$$-Sparse Autoencoder ([Makhzani and Frey, 2013](https://arxiv.org/abs/1312.5663)), the sparsity is enforced by only keeping the top k highest activations in the bottleneck layer with linear activation function. 
First we run feedforward through the encoder network to get the compressed code: $$\mathbf{z} = g(\mathbf{x})$$.
Sort the values  in the code vector $$\mathbf{z}$$. Only the k largest values are kept while other neurons are set to 0. This can be done in a ReLU layer with an adjustable threshold too. Now we have a sparsified code: $$\mathbf{z}’ = \text{Sparsify}(\mathbf{z})$$.
Compute the output and the loss from the sparsified code, $$L = \|\mathbf{x} - f(\mathbf{z}') \|_2^2$$.
And, the back-propagation only goes through the top k activated hidden units!


![k-sparse autoencoder]({{ '/assets/images/k-sparse-autoencoder.png' | relative_url }})
{: style="width: 100%;" class="center"}
*Fig. 5. Filters of the k-sparse autoencoder for different sparsity levels k, learnt from MNIST with 1000 hidden units.. (Image source: [Makhzani and Frey, 2013](https://arxiv.org/abs/1312.5663))*

## Contractive Autoencoder

Similar to sparse autoencoder, **Contractive Autoencoder** ([Rifai, et al, 2011](http://www.icml-2011.org/papers/455_icmlpaper.pdf)) encourages the learned representation to stay in a contractive space for better robustness. 

It adds a term in the loss function to penalize the representation being too sensitive to the input,  and thus improve the robustness to small perturbations around the training data points. The sensitivity is measured by the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input:


$$
\|J_f(\mathbf{x})\|_F^2 = \sum_{ij} \Big( \frac{\partial h_j(\mathbf{x})}{\partial x_i} \Big)^2
$$

where $$h_j$$ is one unit output in the compressed code $$\mathbf{z} = f(x)$$. 

This penalty term is the sum of squares of all partial derivatives of the learned encoding with respect to input dimensions. The authors claimed that empirically this penalty was found to  carve a representation that corresponds to a lower-dimensional non-linear manifold, while staying more invariant to majority directions orthogonal to the manifold.


## VAE: Variational Autoencoder

The idea of **Variational Autoencoder** ([Kingma & Welling, 2014](https://arxiv.org/abs/1312.6114)), short for **VAE**, is actually less similar to all the autoencoder models above, but deeply rooted in the methods of variational bayesian and graphical model.

Instead of mapping the input into a *fixed* vector, we want to map it into a distribution. Let’s label this distribution as $$p_\theta$$, parameterized by $$\theta$$.  The relationship between the data input $$\mathbf{x}$$ and the latent encoding vector $$\mathbf{z}$$ can be fully defined by:
- Prior $$p_\theta(\mathbf{z})$$
- Likelihood $$p_\theta(\mathbf{x}\vert\mathbf{z})$$
- Posterior $$p_\theta(\mathbf{z}\vert\mathbf{x})$$


Assuming that we know the real parameter $$\theta^{*}$$ for this distribution. In order to generate a sample that looks like a real data point $$\mathbf{x}^{(i)}$$, we follow these steps:
1. First, sample a $$\mathbf{z}^{(i)}$$ from a prior distribution $$p_{\theta^*}(\mathbf{z})$$. 
2. Then a value $$\mathbf{x}^{(i)}$$ is generated from a conditional distribution $$p_{\theta^*}(\mathbf{x} \vert \mathbf{z} = \mathbf{z}^{(i)})$$.


The optimal parameter $$\theta^{*}$$ is the one that maximizes the probability of generating real data samples:


$$
\theta^{*} = \arg\max_\theta \prod_{i=1}^n p_\theta(\mathbf{x}^{(i)})
$$

Commonly we use the log probabilities to convert the product on RHS to a sum:


$$
\theta^{*} = \arg\max_\theta \sum_{i=1}^n \log p_\theta(\mathbf{x}^{(i)})
$$

Now let’s update the equation to better demonstrate the data generation process so as to involve the encoding vector:


$$
p_\theta(\mathbf{x}^{(i)}) = \int p_\theta(\mathbf{x}^{(i)}\vert\mathbf{z}) p_\theta(\mathbf{z}) d\mathbf{z} 
$$

Unfortunately it is not easy to compute $$p_\theta(\mathbf{x}^{(i)})$$ in this way, as it is very expensive to check all the possible values of $$\mathbf{z}$$ and sum them up. To narrow down the value space to facilitate faster search, we would like to introduce a new approximation function to output what is a likely code given an input $$\mathbf{x}$$, $$q_\phi(\mathbf{z}\vert\mathbf{x})$$, parameterized by $$\phi$$.


![Distributions in VAE]({{ '/assets/images/VAE-graphical-model.png' | relative_url }})
{: style="width: 80%;" class="center"}
*Fig. 6. The graphical model involved in Variational Autoencoder.  Solid lines denote the generative distribution $$p_\theta(.)$$ and dashed lines denote the distribution $$q_\phi (\mathbf{z}\vert\mathbf{x})$$ to approximate the intractable posterior $$p_\theta (\mathbf{z}\vert\mathbf{x})$$.*

Now the structure looks a lot like an autoencoder:
- The conditional probability $$p_\theta(\mathbf{x} \vert \mathbf{z})$$ defines a generative model, similar to the decoder $$f_\theta(\mathbf{x} \vert \mathbf{z})$$ introduced above. $$p_\theta(\mathbf{x} \vert \mathbf{z})$$ is also known as *probabilistic decoder*. 
- The approximation function $$q_\phi(\mathbf{z} \vert \mathbf{x})$$ is the *probabilistic encoder*, playing a similar role as $$g_\phi(\mathbf{z} \vert \mathbf{x})$$ above.


### Loss Function

The estimated posterior $$q_\phi(\mathbf{z}\vert\mathbf{x})$$ should be very close to the real one $$p_\theta(\mathbf{z}\vert\mathbf{x})$$. We can use [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) to quantify the distance between these two distributions. KL divergence $$D_\text{KL}(X\|Y)$$ measures how much information is lost if the distribution Y is used to represent X.

In our case we want to minimize $$D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) )$$ with respect to $$\phi$$.

But why use $$D_\text{KL}(q_\phi \| p_\theta)$$ (reversed KL) instead of $$D_\text{KL}(p_\theta \| q_\phi)$$ (forward KL)? Eric Jang has a great explanation in his [post](https://blog.evjang.com/2016/08/variational-bayes.html) on Bayesian Variational methods. As a quick recap:


![Forward vs reversed KL]({{ '/assets/images/forward_vs_reversed_KL.png' | relative_url }})
{: style="width: 100%;" class="center"}
*Fig. 7. Forward and reversed KL divergence have different demands on how to match two distributions. (Image source: [blog.evjang.com/2016/08/variational-bayes.html](https://blog.evjang.com/2016/08/variational-bayes.html))*

- Forward KL divergence: $$D_\text{KL}(P\|Q) = \mathbb{E}_{z\sim P(z)} \log\frac{P(z)}{Q(z)}$$; we have to ensure that Q(z)>0 wherever P(z)>0. The optimized variational distribution $$q(z)$$ has to cover over the entire $$p(z)$$.
- Reversed KL divergence: $$D_\text{KL}(Q\|P) = \mathbb{E}_{z\sim Q(z)} \log\frac{Q(z)}{P(z)}$$; minimizing the reversed KL divergence squeezes the $$Q(z)$$ under $$P(z)$$.


Let's now expand the equation:

$$
\begin{aligned}
& D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) & \\
&=\int q_\phi(\mathbf{z} \vert \mathbf{x}) \log\frac{q_\phi(\mathbf{z} \vert \mathbf{x})}{p_\theta(\mathbf{z} \vert \mathbf{x})} d\mathbf{z} & \\
&=\int q_\phi(\mathbf{z} \vert \mathbf{x}) \log\frac{q_\phi(\mathbf{z} \vert \mathbf{x})p_\theta(\mathbf{x})}{p_\theta(\mathbf{z}, \mathbf{x})} d\mathbf{z} & \scriptstyle{\text{; Because }p(z \vert x) = p(z, x) / p(x)} \\
&=\int q_\phi(\mathbf{z} \vert \mathbf{x}) \big( \log p_\theta(\mathbf{x}) + \log\frac{q_\phi(\mathbf{z} \vert \mathbf{x})}{p_\theta(\mathbf{z}, \mathbf{x})} \big) d\mathbf{z} & \\
&=\log p_\theta(\mathbf{x}) + \int q_\phi(\mathbf{z} \vert \mathbf{x})\log\frac{q_\phi(\mathbf{z} \vert \mathbf{x})}{p_\theta(\mathbf{z}, \mathbf{x})} d\mathbf{z} & \scriptstyle{\text{; Because }\int q(z \vert x) dz = 1}\\
&=\log p_\theta(\mathbf{x}) + \int q_\phi(\mathbf{z} \vert \mathbf{x})\log\frac{q_\phi(\mathbf{z} \vert \mathbf{x})}{p_\theta(\mathbf{x}\vert\mathbf{z})p_\theta(\mathbf{z})} d\mathbf{z} & \scriptstyle{\text{; Because }p(z, x) = p(x \vert z) p(z)} \\
&=\log p_\theta(\mathbf{x}) + \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z} \vert \mathbf{x})}[\log \frac{q_\phi(\mathbf{z} \vert \mathbf{x})}{p_\theta(\mathbf{z})} - \log p_\theta(\mathbf{x} \vert \mathbf{z})] &\\
&=\log p_\theta(\mathbf{x}) + D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z})) - \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\log p_\theta(\mathbf{x}\vert\mathbf{z}) &
\end{aligned}
$$


So we have:

$$
D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) =\log p_\theta(\mathbf{x}) + D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z})) - \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\log p_\theta(\mathbf{x}\vert\mathbf{z})
$$


Once rearrange the left and right hand side of the equation,

$$
\log p_\theta(\mathbf{x}) - D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) = \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\log p_\theta(\mathbf{x}\vert\mathbf{z}) - D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}))
$$

The LHS of the equation is exactly what we want to maximize when learning the true distributions: we want to maximize the (log-)likelihood of generating real data (that is $$\log p_\theta(\mathbf{x})$$) and also minimize the difference between the real and estimated posterior distributions (the term $$D_\text{KL}$$ works like a regularizer).  Note that $$p_\theta(\mathbf{x})$$ is fixed with respect to $$q_\phi$$.

The negation of the above defines our loss function:

$$
\begin{aligned}
L_\text{VAE}(\theta, \phi) 
&= -\log p_\theta(\mathbf{x}) + D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) )\\
&= - \mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})} p_\theta(\mathbf{x}\vert\mathbf{z}) + D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}) ) \\
\theta^{*}, \phi^{*} &= \arg\min_{\theta, \phi} L_\text{VAE}
\end{aligned}
$$

In Variational Bayesian methods, this loss function is known as the *variational lower bound*, or *evidence lower bound*. The “lower bound” part in the name comes from the fact that KL divergence is always non-negative and thus $$-L_\text{VAE}$$ is the lower bound of $$\log p_\theta (\mathbf{x})$$. 

$$
-L_\text{VAE} = \log p_\theta(\mathbf{x}) - D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) \leq \log p_\theta(\mathbf{x})
$$

Therefore by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.

### Reparameterization Trick

The expectation term in the loss function invokes generating samples from $$\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})$$. Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable $$\mathbf{z}$$ as a deterministic variable $$\mathbf{z} = \mathcal{T}_\phi(\mathbf{x}, \boldsymbol{\epsilon})$$, where $$\boldsymbol{\epsilon}$$ is an auxiliary independent random variable, and the transformation function $$\mathcal{T}_\phi$$ parameterized by $$\phi$$ converts $$\boldsymbol{\epsilon}$$ to $$\mathbf{z}$$.

For example, a common choice of the form of $$q_\phi(\mathbf{z}\vert\mathbf{x})$$ is a multivariate Gaussian with a diagonal covariance structure:


$$
\begin{aligned}
\mathbf{z} &\sim q_\phi(\mathbf{z}\vert\mathbf{x}^{(i)}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)}\boldsymbol{I}) & \\
\mathbf{z} &= \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} \text{, where } \boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I}) & \scriptstyle{\text{; Reparameterization trick.}}
\end{aligned}
$$

where $$\odot$$ refers to element-wise product.


![Reparameterization trick]({{ '/assets/images/reparameterization-trick.png' | relative_url }})
{: style="width: 80%;" class="center"}
*Fig. 8. Illustration of how the reparameterization trick makes the $$\mathbf{z}$$ sampling process trainable.(Image source: Slide 12 in Kingma’s NIPS 2015 workshop [talk](http://dpkingma.com/wordpress/wp-content/uploads/2015/12/talk_nips_workshop_2015.pdf))*

The reparameterization trick works for other types of distributions too, not only Gaussian.
In the multivariate Gaussian case, we make the model trainable by learning the mean and variance of the distribution, $$\mu$$ and $$\sigma$$, explicitly using the reparameterization trick, while the stochasticity remains in the random variable $$\boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})$$.


![Gaussian VAE]({{ '/assets/images/vae-gaussian.png' | relative_url }})
{: style="width: 100%;" class="center"}
*Fig. 9. Illustration of variational autoencoder model with the multivariate Gaussian assumption.*