<h1 align="center">Information Theory</h1>

In **Statistics** and **Machine Learning**, some concept of the **distance** between probability distributions is often required. 

In other words, we want to know **how similar two different probability distributions are**, and, moreover, we want to **quantify this similarity**.

<h3 align="center">Measure of Information Content</h3>

Let $(\Omega, \Sigma, P)$ be a probability space and let $A \in \Sigma$ is some event. 

$\textbf{Definition}$. **Information content**, or **self-information** of an event $A$ is defined as:

$$I(A)= - logP(A).$$

**Information content** satisfies the following **three** properties:
- Likely events have low information content, and events that are guaranteed to happen have no information content whatsoever.
- Less likely events have higher information content.
- Independent events have additive information.



<h3 align="center">Example</h3>

Let's assume we toss fair dice and observe the outcome. 

If events are defined as follows:
<br> &emsp; $\bullet$ $A = \{1, 2, 3, 4, 5, 6\}$;
<br> &emsp; $\bullet$ $B = \{2, 4, 6\}$;
<br> &emsp; $\bullet$ $C = \{6\}$,

then the **information content** of these events are:
- $I(A)=-log(1)=0$,  i.e. event $A$ has **no surprise** because we already knew that $A$ is guaranteed to happen.
- $I(B)=-log(0.5)=1$, i.e. event $B$ has some **information content** of 1. We get some information - we can rule out that observation is even.</li>
- $I(C)=-log(\frac{1}{6})\approx2.58$, i.e. event $C$ has smallest probability so carries the most **surprise**.

**Conclusion**: Rare events have high information content.




<h3 align="center">Shannon entropy</h3>

We can quantify the amount of uncertainty in an entire probability distribution using the **Shannon entropy**.
<br>
In other words, it's the expected amount of information in an event drawn from that distribution.

Let $p(X)$ be a **probability mass function** for some random variable $X$.

$\textbf{Definition}$.  **Entropy** of a random variable $X$ is defined as an expectation of the information content of it's outcomes:

$$H(X)=\mathbb{E}_p[I_p(X)]=-\sum_{i=1}^{n}p(x_i)logp(x_i),$$ 

where $x_i$ are possible values of $X$.

<h3 align="center">Cross entropy</h3>

Let $p(X)$ and $q(X)$ be two **probability mass functions** over a same set of events.

$\textbf{Definition}$. **Cross Entropy** between distributions $p$ and $q$ is defined as following expectation:

$$H(p, q)=\mathbb{E}_p[I_q(X)]=-\sum_{i=1}^{n}p(x_i)logq(x_i),$$

where $x_i$ are possible values of $X$.

If $p=q$, then $H(p, q)=H(p)=H(q)$, i.e. cross-entropy becomes just entropy.


<h3 align="center">Kullback-Leibler (KL) divergence</h3>

Let $p(X)$ and $q(X)$ be two **probability mass functions** over a same set of events.

$\textbf{Definition}$. **Kullback-Leibler divergence** between distributions $p$ and $q$ is defined as following expectation:

$$KL(p, q)=\mathbb{E}_p[log\frac{p(X)}{q(X)}]=\sum_{i=1}^{n}p(x_i)log\frac{p(x_i)}{q(x_i)},$$

where $x_i$ are possible values of $X$. Formula is valid when $q(x)\neq0$ or both $p(x)=q(x)=0$.

- There is a relation between **Entropy**, **Cross-Entropy** and **KL divergence**: 
$$H(p, q)=H(P)+KL(p, q).$$
- Similarly to Cross-Entropy, KL divergence is used in Statistics and Machine Learning to measure similarity between probability distributions.
- KL Divergence is used as a loss function in approximate variational inference - powerful unsupervised technique in ML to learn complex distributions.