# Variational AutoEncoders (VAE)

## Objectives
By the end of this lesson, students will be able to:
- Understand the limitations of standard autoencoders
- Describe the working of VAE.

## Prerequisite
Before starting this lesson, students should be familiar with the following concepts
- Probability Distribution
- Bayes Theorem


## **Limitations of Standard Autoencoders**

While standard autoencoders effectively compress and reconstruct data, they have key limitations:

* They can **overfit** by perfectly reconstructing training data, even from a low-dimensional latent space, leading to meaningless outputs for unseen latent points. This means the autoencoder essentially memorizes the training data, failing to learn generalized features that would be useful for new, unseen data.


* They are **not generative models**—they lack a probabilistic structure, so they can’t generate new data samples from the latent space. Without a structured, meaningful latent space, simply sampling random points from it and decoding them will result in arbitrary or nonsensical outputs, not new, coherent data.


These issues are overcome by its modern implementations, **Variational AutoEncoder(VAE)** and **Generative Adversarial Networks (GAN)**.



### Illustration: What’s Going Wrong?
<center>

<img src="https://i.postimg.cc/qv2MHGyD/Problem-with-standard-AE.png" height="300" width="900">
<figcaption align="center">Fig: Problem with Vanilla Autoencoders</figcaption>
</center>

* In the figure, the latent space is assumed to be mapped onto a 1D real axis without any loss of information.
* Suppose the autoencoder encodes training data onto this axis and decodes it back perfectly, achieving zero reconstruction loss.

In such a scenario, the encoder has too much flexibility. Despite using a low dimensional latent space, it can still encode and decode the data without any information loss. This leads to a latent space that lacks meaningful structure.

As a result, the model is prone to overfitting. It memorizes the training data rather than learning generalizable patterns. Points in the latent space that were not covered during training may produce unrealistic or meaningless outputs when decoded.

This issue arises due to the absence of regularization in standard autoencoders.Essentially, the standard autoencoder prioritizes perfect reconstruction of the training data over learning a smooth and continuous representation in the latent space that could generalize to new data.

## **Conceptual Insight**

Autoencoders tend to capture **all available information**, not just the **relevant features**. As a result, important and unimportant details are treated equally in the latent space. This leads to compressed representations that may not generalize well, causing **overfitting**, especially on unseen data.

This is a critical point for representation learning: we want the model to extract the most salient features that define the data, not just compress everything indiscriminately.


Regularization helps address this by encouraging the model to focus on **meaningful patterns**.

> Can we generate new data if the latent space is well-regularized?

In theory, yes—but in practice, it’s rare. Standard autoencoders don’t structure the latent space well enough for reliable data generation. This limitation of regularity is solved by Variational Autoencoders.



## **Introduction to VAE**

VAE introduces probabilistic spin on autoencoders to let them generate new data by sampling. VAE can be defined as an autoencoder whose training procedure has been regularised to avoid overfitting and ensures that the latent space representation has good properties that enable the generative process.

The encoder and decoder structure are both fully-connected/convolution as in vanilla autoencoder. The process of Autoencoder and VAE is compared in the below diagram.


<center>

<p><img src="https://i.postimg.cc/90MNZHL0/image.png"></p>
<figcaption align="center">Fig: Comparison of encode-decode process between autoencoder and VAE</figcaption>

</center>

Instead of encoding the data as a single point over latent space as in autoencoders, VAE encodes the distribution over the latent space which ensures regularised code present in the bottleneck. Then points are taken as random samples(each point relating to a different data) from the distribution which is later passed through the decoder to generate reconstructed output from the network.

This is a fundamental shift: a standard autoencoder maps an input to a fixed point in latent space, whereas a VAE maps it to a distribution (e.g., a Gaussian) defined by a mean and variance. This "fuzziness" is key to its generative capabilities and regularization.



<center>

<p><img src="https://i.postimg.cc/Kv930SJF/Visualization-AE-VAE-Encode-Decode-Process.png" width=60% ></p>
<figcaption align="center">Fig: Visualization of encoding difference between standard autoencoder and Variational autoencoder> </figcaption>
</center>




### **Advantages of VAEs over Traditional Autoencoders**

1. **Generative Capabilities**

   * VAEs can **generate new data samples** by sampling from the learned latent distribution, enabling tasks like image synthesis or data augmentation.
   * Traditional autoencoders cannot generate meaningful new samples — they can only reconstruct inputs.

2. **Structured & Continuous Latent Space**

   * VAEs learn a **smooth and continuous latent space**, making it possible to interpolate between points (e.g., morphing one digit to another).
   * This structure makes the latent space more **meaningful and navigable**.
   * This "**smoothness**" means that small changes in the latent space correspond to small, meaningful changes in the generated output, allowing for coherent interpolations.

3. **Probabilistic Interpretation**

   * VAEs model uncertainty by learning distributions (mean and variance) rather than fixed encodings.
   * This allows **Bayesian reasoning**, confidence estimation, and principled handling of variability in data.

4. **Improved Latent Sampling**

   * Because the latent space follows a known distribution, **sampling is well-defined**, unlike in standard autoencoders where the latent space is arbitrary and disorganized.

   *  Specifically, VAEs often regularize the latent space to approximate a standard normal distribution, making it easy and meaningful to sample new points for generation.




## Probability Model Perspective

<center>
<figure>
<img src="https://i.postimg.cc/LsDz8mBf/vae-prob-model-perspective.png">
<figcaption align="center">Fig: VAE network's structure</figcaption>
</figure>
</center>

The foundation here is that we want to **learn how to represent and generate data** using a latent variable representation, denoted by $\mathbf{h}$, to reconstruct our input data $\mathbf{x}'$.

Think of this like a **compression and generation factory**: we feed in raw material (data samples $\mathbf{x}$) and the encoder tries to distill it into meaningful “codes” $\mathbf{h}$ — compact summaries of the input that still allow reconstruction.

Formally, this requires finding how to **map** or **encode** a dataset

$$
\mathbf{x} = \{{x^{(i)}}\}_{i=1}^N
$$

(where $N$ is the number of i.i.d samples) into a **latent distribution** using the encoder (aka inference network). The encoder’s task is to approximate the **posterior probability**: the probability of a latent representation $\mathbf{h}$ **given** the observed data $\mathbf{x}$. This posterior is parameterized by the encoder’s weights and biases $\phi$:

$$
P_\phi(\mathbf{h}|\mathbf{x}) = \frac{P(\mathbf{x}|\mathbf{h}) \cdot P(\mathbf{h})}{P(\mathbf{x})} \tag{1}
$$

> ### Analogy: The Hidden Artist Behind the Scenes
>
> Imagine you're watching a performance (your observed data $\mathbf{x}$), and you're trying to **guess the script** that was used to generate it — that script is the latent variable $\mathbf{h}$. You know the general style of the writer (the prior $P(\mathbf{h})$), and you know how scripts turn into performances (the likelihood $P(\mathbf{x}|\mathbf{h})$). But the full probability of the performance, $P(\mathbf{x})$, is like trying to imagine **every possible script** that could've created what you saw.

The challenge comes from the denominator, the **marginal likelihood** (also called **evidence**):

$$
P(\mathbf{x}) = \int \underbrace{P(\mathbf{x}|\mathbf{h}) \cdot P(\mathbf{h}) \, d\mathbf{h}}_{\text{intractable}} \tag{2}
$$

If the latent space $\mathbf{h}$ is **high-dimensional** (i.e., has many features or neurons), then computing this marginal likelihood involves a **complex, high-dimensional integration** — a mathematically heavy task that quickly becomes **intractable** (computationally impossible in practice).

> **Analogy:** It’s like trying to guess the total number of combinations of secret recipes that could possibly result in a dish — when each recipe has 100+ ingredients, it’s practically impossible to taste your way to the exact number.

### How Do We Solve This?

There are two main strategies to approximate the **posterior probability**:

* **1. Monte Carlo Integration** — which relies on **random sampling** of possible $\mathbf{h}$ values (like taste-testing random secret recipes).
* **2. Variational Inference** — which instead **learns a simpler, approximate distribution** $q_\phi(\mathbf{h}|\mathbf{x})$ to mimic the true posterior (like building a model that can generalize the taste with fewer samples).

We won’t discuss Monte Carlo methods here. Instead, we’ll focus on **Variational Inference**, which is the **core technique behind VAEs** ([Kingma & Welling, 2013](https://arxiv.org/pdf/1312.6114.pdf)) in next notebook.

### Key Takeaways

1. VAE is a deep learning technique that assigns probability distribution to its latent representation.
2. It generates new data by learning representation of the input data and sampling process.



### Additional Resources

* Papers
   * Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. https://arxiv.org/pdf/1312.6114.pdf
       * Appendix B analytically solves the KL divergence term.
