# **Introduction to KL Divergence and Reconstruction Loss**

## Pre-requisite

Before starting this lesson, the students must have knowledge on:
- Probability
- Bayes' Theorem







### **Information**

Suppose we want to measure how much information a sentence carries. In real life:

- If a sentence is **unique or rare**, it carries **more information**.  
- If it is **common or obvious**, it carries **less information**.

For example:  
* "**The Sun rises in the East**" — very common → **low information**  
* "**The President will get killed tomorrow**" — rare and shocking → **high information**

**Information** quantifies how *unexpected* an event is. Less likely events carry more information.

Mathematically,

$$
I(x) = -\log P(x)
$$

> As $P(x) \to 0$, $I(x)$ increases. If $P(x) = 1$, then $I(x) = 0$, meaning a certain event carries no new information.



### **Entropy (Shannon Entropy)**

Entropy can be interpreted as the **average level of information** (or surprise or uncertainty) across all  possible outcomes.

* Think of it as the **expected information(surprise)** from a distribution.
* Higher randomness = higher entropy.

$$
H(P) = -\sum_i P_i(x) \log P_i(x)
$$

$$
\text{or}
$$

$$
H(P) = -\mathbb{E}_{P}[ \log P(x) ]
$$




### **Cross-Entropy**

Cross-entropy compares two distributions:

* True distribution $P$
* Estimated model distribution $q$

$$
H(P, q) = -\sum_i P_i(x) \log q_i(x)
$$

> Commonly used as a **loss function** in machine learning to measure how close predictions are to ground truth.


## **Kullback-Leibler Divergence (KL Divergence)**

Also known as **Relative Entropy**. KL-divergence shows the dissimilarity between two distributions. Say two distributions $P$ and $q$, KL-Divergence of $P$ w\.r.t. $q$ is given as:

$$
\mathbb{KL}(P||q) = H(P, q)\ (\text{cross-entropy})\ -\ H(P)\ (\text{entropy})
$$

$$
\mathbb{KL}(P||q) = -\sum P(x) \log q(x) + \sum P(x) \log P(x)
$$

$$
\mathbb{KL}(P||q) = -\sum P(x) \log \frac{q(x)}{P(x)} \tag{6}
$$

Things to note about KL-Divergence:

* The value of KL-Divergence is always greater or equal to zero, $\mathbb{KL} \geq 0$. The value of KL-divergence equals zero if the two distributions are the same because $\log(1) = 0$
* KL-divergence is non-symmetric in nature, that is, $\mathbb{KL}(P||q) \neq \mathbb{KL}(q||P)$

> Note: The summation terms are converted to integrals for continuous data.

## **Reconstruction Loss**

Reconstruction loss measures how well a model can recreate the original input from its internal representation. It is commonly used in models like **autoencoders** and **variational autoencoders (VAEs)**.

For input $x$ and its reconstruction $\hat{x}$, common reconstruction loss functions include:

* **Mean Squared Error (MSE):**

$$
\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{x}_i)^2
$$

* **Binary Cross-Entropy (BCE):**

$$
\mathcal{L}_{\text{BCE}} = -\sum_i \left[x_i \log \hat{x}_i + (1 - x_i) \log(1 - \hat{x}_i)\right]
$$

In **VAEs**, reconstruction loss represents the negative log-likelihood of the observed data under the decoder’s output distribution:

$$
\mathcal{L}_{\text{recon}} = -\mathbb{E}_{q(z|x)} \left[\log p(x|z)\right]
$$

It encourages the model to preserve important information about the input during encoding and accurately regenerate it during decoding.
