## Lecture Note on KL Divergence

yndk@sogang.ac.kr

Mainly from
**Geometric Modeling in Probability and Statistics, Cilin and Udrste, 2014**

# Kullback-Leibler Divergence
The Kullback–Leibler divergence was introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback preferred the term discrimination information. The divergence is discussed in Kullback's 1959 book, Information Theory and Statistics [wikipedia.org]

- Also known as the Kullback-Leibler Relative Entropy

- The KL relative entropy is a non-commutative measure of the difference between two probability densities $p$ and $q$ on the same statistical manifold.
    - Definition:

    $$
        D_{KL}(p||q) = \mathbb{E}_p\big[ \log \frac{p}{q} \big] = \sum_{i \in X} p_i \log\frac{p_i}{q_i}
    $$

- In information theory the density $p$ is considered to be the true density (normally unknown), while $q$ is the theoretical model density.
- The KL relative entropy may be regarded as a measure of inefficiency of assuming data distributed according to $q$, when actually it is distributed as $p$.

## Entropy
- The entropy of a discrete density $p=\{p_1, ..., p_n\}$ is defined to be:

$$
    H(p) = \sum_i p_i \log\frac{1}{p_i} = -\sum_i p_i\log p_i
$$

## Cross Entropy
- The cross entropy $H(p,q)$ of two densities $p$ and $q$ is defined to be

$$
    H(p, q) = \sum_i p_i \log\frac{1}{q_i} = -\sum_i p_i\log q_i
$$

## Lemma: $D_{KL}(p||q) \geq 0$

If $p=\{p_1, ..., p_n\}$ and $q=\{ q_1, ..., q_n \}$ are two strictly positive discrete densities on the same event sapce. Then

$$    
    \begin{align}
        H(p,p)  \leq H(p,q) &\quad\Leftrightarrow\quad
        -\sum_i p_i\log p_i  \leq -\sum_i p_i\log q_i
    \end{align}
$$

Or

$$    
    \begin{align}
        \sum_i p_i\log p_i  \geq \sum_i p_i\log q_i
    \end{align}
$$

**Proof** Using the inequality $\log x \leq x - 1$ (prove?) for  $x>0$, we find

$$
    \begin{align}
        \sum_i p_i \log q_i - \sum_i p_i \log p_i 
            = \sum_i p_i \log\frac{q_i}{p_i}
            & \leq \sum_i p_i \left( \frac{q_i}{p_i} - 1 \right) \\
            & = \sum_i q_i - \sum_i p_i = 0
    \end{align}
$$

The equality is reached for $q_i/p_i = 1$, i.e., the case of equal densities.
Therefore

$$
    D_{KL}(p||q) = -\sum_i p_i \log q_i + \sum_i p_i \log p_i \geq 0
$$

## Proposition

The relative entropy $D_{KL}(p||q)$, the entropy $H(p)$ and the cross entropy $H(p,q)$ are related by

$$
    D_{KL} = H(p,q) - H(p)
$$

## Corollary
The entropy $H(p)$ and the cross entropy $H(p,q)$ satisfy the inequality

$$
    H(p) \leq H(p,q)
$$

with equality if and only if $p = q$.

* This shows that the cross entropy is the minimum when the two distributions are identical.

* This can also be stated as:

$$
    \min_q H(p,q) = H(p)
$$

It is worth noting that the KL relative entropy can be also written as a difference of two log-likelihood functions:

$$
    D_{KL}(p||q) = \mathbb{E}_p[l_p] - \mathbb{E}_p[l_q]
$$

## KL Divergence of Two Gaussian Distributions
Let us consider two Gaussian distributions $p$ and $q$:

$$
p(x) = \frac{1}{\sqrt{2\pi}\sigma_1} \exp \bigg[ -\frac{(x-\mu_1)^2}{2\sigma_1^2} \bigg]
\quad\mbox{and}\quad
q(x) = \frac{1}{\sqrt{2\pi}\sigma_2} \exp \bigg[ -\frac{(x-\mu_2)^2}{2\sigma_2^2} \bigg]
$$

The KL divergence of $p$ and $q$ 
$$
    D_{KL}(p||q) = \frac{1}{2}
                    \bigg[
                        \big( \frac{\sigma_1}{\sigma_2} \big)^2
                        - 
                        \ln\big( \frac{\sigma_1}{\sigma_2} \big)^2
                        -
                        1
                    \bigg]
                    +
                    \frac{(\mu_1 - \mu_2)^2}{2\sigma_2^2}
$$

* Plot $D_{KL}(p||q)$ when $p(x) = \mathcal{N}(x|0,1)$. It is a function of $\sigma_2$ and $\mu_2$. Do the same when $q(x) = \mathcal{N}(x|0,1)$.

## KL Diverence of two multivariate Gaussians

$$
    D_{KL}(p||q) = \frac{1}{2}\bigg[
                        \ln\frac{|\Sigma_2|}{|\Sigma_1|}
                        - k
                        + \mathrm{trace}\big(\Sigma_2^{-1}\Sigma_1\big)
                        +
                        \big(\mu_2 - \mu_1\big)^\top\Sigma_2^{-1}\big(\mu_2 - \mu_1\big)
                    \bigg]
$$

- https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions
- https://stats.stackexchange.com/questions/60680/kl-divergence-between-two-multivariate-gaussians

## KL Divergence when $q(x)=\mathcal{N}(x|0,I)$

A special case, and a common quantity in variational inference, is the KL-divergence between a diagonal multivariate normal, and a standard normal:

$$
    \begin{align}
        \mu & = \big(\mu_1, ..., \mu_k\big)  \\
        \sigma & = \big(\sigma_1, ..., \sigma_k\big)
    \end{align}
$$

$$
    D_{KL}\big(\mathcal{N}(\mu, \mathrm{diag}(\sigma_1,...,\sigma_k) \big|\big| \mathcal(0,I)\big)
    =
    \frac{1}{2}\sum_{i=1}^k \bigg(\sigma_i^2 + \mu_i^2 - \ln\sigma_i^2 - 1 \bigg)
$$

**Prove**

## References

- Bishop
- MacKay
- Murphy
- Geometric Modeling in Probability and Statistics, Cilin and Udrste, 2014