# An information theoretical introduction to KL Divergence

### Formulas

- Entropy : $H(p) = - \sum_{x \in X} p(x) log(p(x))$


- Cross-Entropy : $C(p|q) = - \sum_{x \in X} p(x) log(q(x))$


- Kullback-Leibler Divergence : $KL(p|q) = \sum_{x \in X} p(x) log(\frac{p(x)}{q(x)})$

where does the $log$ come from, what do these quantities measure?

![Claude Shannon](shannon.jpg)

### Information content and entropy



How much information is contained in the realization of an event?
How surprising an event is?

#### Equiprobable case


![image0](image.png)

$P(\text{"rain"}) = 0.5 = \frac{1}{2^1}$
The level of uncertainty of the user is divided by 2, so he is provided with 1 bit of information.

On average, the realization of an events provides the user with $- (0.5 \times log_2(0,5) + 0.5 \times log_2(0,5)) = 1$ bit. This is the entropy of the distribution.

#### Non-equiprobable case

![image1](image1.png)

Equivalent to a uniform distribution over the 4 events {"sun", "sun", "sun", "rain"} 

$P(\text{"rain"}) = 0.25 = \frac{1}{2^2}$

The level of uncertainty of the user is divided by 4 -> 2 bits of information

On average, the realization of an events provides the user with $- (0.75 \times log_2(0,75) + 0.25 \times log_2(0,25)) = 0.81$ bits $= H(p)$. 

**Information Content** associated with an event $x$ under the probability $p$: $$I_p(x) = log(\frac{1}{p(x)})$$

Equivalent : minimum number of bits needed to encode the realization of the event.

- **Bridge** between the worlds of probabilities and information theory.

### Entropy

Average of the Information Content over all the possible events:
$$H(p) = E_p(I_p(x)) = - \sum_{x \in X} p(x) log(p(x))$$


### Cross-Entropy

How should I encode these events to communicate them in a channel?

![image1](image7.png)

Each event $x$ is encoded with a binary code of length $m(x)$. Here $m(x) = 3$ for all events.

Here the event "sun" will occur more frequently so we should encode it on less bits than a rarer event such as "storm".
Quantitatively, I could reduce the **average of my messages lengths** by changing the encodings.


- $H(p) = 2.23$ bits
- $C(p|q) = 3$ bits

This average of message length under the **true distribution** is the **Cross Entropy** of my encoding.

Cross-Entropy : $C(p|q) = E_p(m(x)) = E_p(I_q(x))$ where $q(x) = \frac{1}{2^{m(x)}}$ is the distribution underlying the binary messages.


Indeed choosing an encoding is equivalent to making assumptions on the events probabilities (here a uniform distribution, with probabilities equal to $\frac{1}{2^3}$)

This encoding is Better: ![image1](image4.png)

$C(p|q)=2.42 $ bits -> closest to the entropy than the first encoding

**Cross-Entropy** : $$C(p|q) = - \sum_{x \in X} p(x) log(q(x))$$


### Kullback-Leibler divergence

How much extra information do I send with my encoding? 

Instead of computing the message lengths for each message, we measure the difference between our message lengths and the information content under the true distribution $p$.
$\delta_{p,q}(x) = m(x)-I_p(x) = I_q(x)-I_p(x)$ 

The **Kullback-Leibler divergence** is the average of this additional message lengths:

$$\begin{align}
KL(p|q) &= E_p[\delta(x)] \\&= \sum_{x \in X} p(x) (log(\frac{1}{q(x)}) - log(\frac{1}{p(x)})) \\ &= \sum_{x \in X} p(x) log(\frac{p(x)}{q(x)}) \\ &= - \sum_{x \in X} p(x) log(q(x)) - (- \sum_{x \in X} p(x) log(p(x)) \\ &= C(p|q) - H(p)
\end{align}$$

![image1](image4.png)

On average, I send $K(p|q) = 2.42-2.23 = 0.19 $ more bits than necesary into my channel. 

Link with Cross-Entropy 
$$C(p|q) = H(p) + KL(p|q) $$

### In machine learning

In multi-class classification $q$ will be a distribution over classes estimated from sample data.
$$ agmin_q C(p|q) \Leftrightarrow armin_q KL(p|q) $$

$C(p|q)$ is easier to compute (no Entropy term)

- Largely inspired from the video "A Short Introduction to Entropy, Cross-Entropy and KL-Divergence", by Aurélien Géron
- https://en.wikipedia.org/wiki/Information_content
- https://en.wikipedia.org/wiki/Entropy_(information_theory)