# Information, Entropy and KL-divergence

[Back to index](https://shotahorii.github.io/math-for-ds/)

---

## Table of contents
1. **Introduction**
2. **Self-Information**  
2.1. Definition  
2.2. Example  
3. **Entropy**  
3.1. Definition  
3.2. Example  
4. **Cross Entropy**  
4.1. Definition  
4.3. Example  
5. **KL-divergence**  
5.1. Definition  
5.2. Example  
6. **Mutual Information**  
6.1. Definition  
6.2. Example

---

## 1. Introduction
The minimum amount of average bit-size (or nat-size) to convey information of an observed event $ X=x $ under a probability distribution $P(X)$ is depending only on the probability distribution $P(X)$.  
And the minimum average bit-size (or nat-size) is achived when assigning $-logP(\omega)$ length code to the event occuring in probability of $P(\omega)$. The unit of length is **bit** when the base of $log$ is $2$ and **nat** when the base of $log$ is $e$.  
The bit-size (or nat-size) used to convey an event $X=\omega$ with this encoding is called **Self-information** of the event. And the average amount of the bit-size (or nat-size) over all possible realisation of $X$ in $P(X)$ is called **Entropy** of the probability distribution $P(X)$.  

I'm going to use below example throughout this notebook.  
> Assume that we observe weather of 3 different countries (A,B and C) every morning, and convey the information to the meteorological bureau.  
We know the probability distributions of those countries' weather as below.  
**(Country A) Sunny=1/4, Cloudy=1/4, Rain=1/4, Snow=1/4**  
**(Country B) Sunny=1/2, Cloudy=1/4, Rain=1/8, Snow=1/8**  
**(Country C) Sunny=1, Cloudy=0, Rain=0, Snow=0**

---

## 2. Self-Information
### 2.1. Definition
Given a random variable $X$ with probability mass function $P_X(x)$, the self-information of measuring $X=x$ is

$I_X(x) = -logP_X(x)$ 

### 2.2. Example
When the actual weather is observed in the 3 countries, necessary bit-size to convey the information is as below.
#### Country A
$I_A(Sunny) = -log_2P_A(Sunny) = -log_2\frac{1}{4} = 2$  
$I_A(Cloudy) = -log_2P_A(Cloudy) = -log_2\frac{1}{4} = 2$  
$I_A(Rain) = -log_2P_A(Rain) = -log_2\frac{1}{4} = 2$  
$I_A(Snow) = -log_2P_A(Snow) = -log_2\frac{1}{4} = 2$

#### Country B
$I_B(Sunny) = -log_2P_B(Sunny) = -log_2\frac{1}{2} = 1$  
$I_B(Cloudy) = -log_2P_B(Cloudy) = -log_2\frac{1}{4} = 2$  
$I_B(Rain) = -log_2P_B(Rain) = -log_2\frac{1}{8} = 3$  
$I_B(Snow) = -log_2P_B(Snow) = -log_2\frac{1}{8} = 3$

#### Country C
$I_C(Sunny) = I_C(Cloudy) = I_C(Rain) = I_C(Snow) =0$ (As this country is always Sunny, no need to send any information.)

---

## 3. Entropy
### 3.1. Definition

$H[P_X] = - \sum P_X(x) logP_X(x) = E_{P_X}[-logP_X(x)] = \sum P_X(x) I_X(x)$ 

### 3.2. Example
Average bit-size to convey the weather information in each country is as below. 
#### Country A
$H[P_A] = P_A(Sunny)I_A(Sunny) + P_A(Cloudy)I_A(Cloudy) + P_A(Rain)I_A(Rain) + P_A(Snow)I_A(Snow) = 
\frac{1}{4}\cdot2 + \frac{1}{4}\cdot2 + \frac{1}{4}\cdot2 + \frac{1}{4}\cdot2 = 2$

**Encoding**: $Sunny=00, Cloudy=01, Rain=10, Snow=11$

#### Country B
$H[P_B] = P_B(Sunny)I_B(Sunny) + P_B(Cloudy)I_B(Cloudy) + P_B(Rain)I_B(Rain) + P_B(Snow)I_B(Snow) = 
\frac{1}{2}\cdot1 + \frac{1}{4}\cdot2 + \frac{1}{8}\cdot3 + \frac{1}{8}\cdot3 = 1.75$

**Encoding**: $Sunny=0, Cloudy=10, Rain=110, Snow=111$

#### Country C
$H[P_C] = 0$

**Encoding**: N/A

---

## 4. Cross Entropy
### 4.1. Definition

$H(p,q) = E_p[-logq] = -\sum p(x)logq(x)$