## Kullback-Liebler Divergence

- Let's suppose we have a distribution $P(X)$ and $Q(X)$ across some random variable X

- We wish to evaluate how far $Q(X)$ differs (diverges) from $P(X)$. Let's denote this quantity as $D_{KL}(P || Q)$

- Then KL divergence is defined as:
$$\begin{aligned}
    D_{KL}(P || Q) &= \sum_{x \in X} P(X) \log(\frac{P(X)}{Q(X)})
\end{aligned}$$

- You can pretty much take this formula and run with it, because it's quite idiot proof tbh

- But building up this formula from its information theory foundations is extremely instructive. So we'll have a `Theoretical Foundations` section below to deal with exactly this

## Implementation

- In this section, we will implement the KL divergence of a biased vs an unbiased die

- The point is to see how, as the dice gets increasingly biased, the divergence gets larger!

In [6]:
import numpy as np

In [17]:
unbiased_die = {x+1: 1/6 for x in range(6)}
biased_die = {x+1: ((x+1)/6)**2 / np.sum([((x+1)/6)**2 for x in range(6)]) for x in range(6)}
biased_die2 = {x+1: ((x+1)/6)**3 / np.sum([((x+1)/6)**3 for x in range(6)]) for x in range(6)}
biased_die3 = {x+1: ((x+1)/6)**4 / np.sum([((x+1)/6)**4 for x in range(6)]) for x in range(6)}

# unbiased_die, biased_die, biased_die2, biased_die3

In [25]:
def kl_divergence(p, q):
    assert p.keys() == q.keys()

    res = 0
    for k in unbiased_die.keys():
        res += p[k] * np.log(p[k]/q[k])

    return res

In [48]:
kl_divergence(unbiased_die, biased_die)
# kl_divergence(unbiased_die, biased_die2) ##increasing divergence with increasingly distance from original
# kl_divergence(unbiased_die, biased_die3) ##increasing divergence with increasingly distance from original

# ## However, divergence measure is NOT symmetric!
# kl_divergence(biased_die, unbiased_die) == kl_divergence(unbiased_die, biased_die)

np.float64(0.5260162999520948)

In [49]:
from scipy.special import kl_div
np.sum(kl_div([1/6]*6, list(biased_die.values())))

np.float64(0.5260162999520948)

## Theory

- KL Divergence has a straightforward implementation, and the theory builds up from **Shannon Entropy**. If you're not 100% familiar, go read the notebook on the topic. The stuff below assumes full knowledge of Shannon Entropy.

### KL Divergence from Shannon Entropy

- We've previously established how Shannon entropy is derived, KL divergence is nothing more than an extension of the entropy definition!

- Shannon Entropy Recap: 
    - For a given count of outcomes $N$, entropy measures the minimum bits needed to encode the outcomes
    - If all outcomes are equally likely, each outcome requires $\log_2(N)$ bits (i.e. for 8 outcomes, you need 3 bits, because 2^3=8)
        - So the weighted average bits needed (or the Shannon Entropy) across all outcomes is simply $N \cdot \frac{1}{N} \log_2(N)$
            - $\frac{1}{N}$ is the probablility of observing each of $N$ equally likely outcomes
            - $\frac{1}{N} \log_2(N)$ is the probability weighted number of bits needed for 1 outcome
            - $N \cdot \frac{1}{N} \log_2(N)$ is the weighted average bits needed
        - Note that $\log_2(N)$ can be written as $- \log_2(\frac{1}{N})$, and $P(X) = \frac{1}{N}$, which gives us $- \log_2(P(X))$
    - If outcomes are unequal, then we can simply weight the outcome probabilities to get the weighted average bits required
        - $-\sum_X P(X) \log_2(P(X))$

- KL Divergence from Entropy 
    - Suppose we have some set of discrete events that follows a probability distribution $P(X)$
    - Suppose we mistakenly attribute a different probability distribution $Q(X)$
    - Then, our mistake will have created additional "uncertainty", which can be measured by
    $$\begin{aligned}
        -\sum_X P(X) \log_2(Q(X))
    \end{aligned}$$

    - This is also known as **cross entropy**

    - Since $\log_2(Q(X))$ is negative and monotonically increasing between 0 and 1 (which is the support of $Q(X)$), it must be true that $-\sum_X P(X) \log_2(Q(X))$ is minimised when $P(X) = Q(X)$. This is because the largest weight $P(X)$ will then be multiplied by the smallest possible $\log_2(Q(X))$

    - Therefore, any mistaken assumption of $Q(X) \neq P(X)$ will **create** uncertainty, which increases entropy. 
    $$\begin{aligned}
        -\sum_X P(X) \log_2(Q(X)) \ge -\sum_X P(X) \log_2(P(X))
    \end{aligned}$$

    - This uncertainty created can be treated as a meausre of how different $P(X)$ and $Q(X)$ are from each other, which is given by 
    $$\begin{aligned}
        -\sum_X P(X) \log_2(Q(X)) + \sum_X P(X) \log_2(P(X)) &= \sum_X P(X) \log_2(\frac{P(X)}{Q(X)}) \\
        &= \text{KL Divergence}
    \end{aligned}$$

- This gives us the KL divergence exactly!!

In [59]:
import numpy as np

unbiased_die = {x+1: 1/6 for x in range(6)}
biased_die = {x+1: ((x+1)/6)**2 / np.sum([((x+1)/6)**2 for x in range(6)]) for x in range(6)}
biased_die2 = {x+1: ((x+1)/6)**3 / np.sum([((x+1)/6)**3 for x in range(6)]) for x in range(6)}
biased_die3 = {x+1: ((x+1)/6)**4 / np.sum([((x+1)/6)**4 for x in range(6)]) for x in range(6)}

biased2_unbiased_cross_entropy = 0
for key in unbiased_die:
    biased2_unbiased_cross_entropy += -biased_die2[key] * np.log2(unbiased_die[key])

biased2_biased_cross_entropy = 0
for key in unbiased_die:
    biased2_biased_cross_entropy += -biased_die2[key] * np.log2(biased_die[key])

biased2_biased2_cross_entropy = 0
for key in unbiased_die:
    biased2_biased2_cross_entropy += -biased_die2[key] * np.log2(biased_die2[key])

biased2_biased3_cross_entropy = 0
for key in unbiased_die:
    biased2_biased3_cross_entropy += -biased_die2[key] * np.log2(biased_die3[key])

### Cross entropy is minimised when P(X) = Q(X)
print(biased2_unbiased_cross_entropy, biased2_biased_cross_entropy, biased2_biased2_cross_entropy, biased2_biased3_cross_entropy)

2.5849625007211556 1.8484436218941096 1.7956083181006413 1.8329487933642483


## Resources

- https://www.cs.cmu.edu/~dst/Tutorials/Info-Theory/