## Information Theory for Language
* Text = String = Sequence of Chars
* What is the # bits per character needed to encode English?

#### Entropy
$$H(X) = - \sum_{x \in X} p(x) \cdot \text{log} \ p(x)$$

Example horse race, all horses uniformly likely $p(x_i) = 1/8$
Therefore $H(X) = - \sum_{i=1}^8 \frac{1}{8} \text{log} \ \frac{1}{8} = 3$

_This means that given that all outcomes are equally likely, we will need 3 bits in order to convey the average value (3 bits allows us to represent 8 diff. numbers)_

However, say we have the following probability distr.
* $p(x_1) = 1/2$
* $p(x_2) = 1/4$
* $p(x_3) = 1/8$
* $p(x_4) = 1/16$
* $p(x_5) = p(x_6) = p(x_7) = p(x_8) = 1/64$

Then then Entropy becomes:

$H(X) = \frac{1}{2} \text{log} \frac{1}{2} + \frac{1}{4} \text{log} \frac{1}{4} + \frac{1}{8} \text{log} \frac{8}{2}+ \frac{1}{16} \text{log} \frac{1}{16} + 4\cdot(\frac{1}{64} \text{log} \frac{1}{64}) = 2$

Thus to convey the message, we on average only need 2 - if the distribution is heavily in favor of certain outcomes.

In [7]:
import numpy as np
# Equally likely
X = np.ones(8) / [8,8,8,8,8,8,8,8]
print(sum(-X * np.log2(X)))

X = np.ones(8) / [2,4,8,16,64,64,64,64]
print(sum(-X * np.log2(X)))

3.0
2.0


Entropy over a __sequence__

$$H(w_1,w_2,...,w_n) = -\sum_W p(W) \ \text{log} \ p(W)$$

__Entropy Rate $H(L)$__

$$H(L) := \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} \ \text{log} \ p(w_1 w_2 ... w_n)$$

Shannon-McMillan-Breimain theorem:

$H(L) = - \text{lim}_{n \rightarrow \infty} \ \frac{1}{n} \sum_{W \in L} p(w_1,...,w_n) \cdot \text{log} \ p(w_1,...,w_n)$

__Cross Entropy $H(p,m)$__

Upper bound on the Entropy:

$$H(p) \leq H(p,m)$$

__Idea__: Estimate the true Entropy by using a very long sequence, rather than summing over _all_ possible sequences in some language $L$.

Useful when we dont know the probability distribution $p$ which generated the data.

Instead, use $m$ as the approximation of the unknown distribution: $m \approx p$ where the CE of $m$ is

$$H(p,m) = \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} \sum_{W \in L} p(w_1,...,w_n) \cdot log \ m(w_1,...,w_n)$$

Using the SMB theorem we can simplify this to only depend on $m$:

$$H(p,m) = \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} log \ m(w_1,...,w_n)$$

#### Cross Entropy and Perplexity
We need an estimate of the cross entropy we can actually work with. Given a sequence of words $W$ the CE approximation is:

$$H(W) = - \frac{1}{N} \text{log} \ P(w_1 w_2 ... w_N)$$

Perplexity Definition: 

$$Perp(W) = 2^{H(W)}$$

$$= P(w_1 w_2 ... w_N) ^{-1/2}$$ 

Since we used log2, where $2 ^ {log_2 \ x} = x$

$$= (\prod_{i=1}^N P(w_i | w_1 ... w_{i-1}))^{-1/N}$$

$$= \sqrt[N]{\frac{1}{\prod_{i=1}^N P(w_i | w_1 ... w_{i-1})}}$$

#### Perplexity for Uniform Distribution
If the distribution is uniform then $Perp(X) = \vert X \vert$

Proof: 

Let $X=[x_1...x_N]$ be the random variables and as such the size of $X$ is N and $p(x)=\frac{1}{N}$ for all $x\in X$.

The entropy of $X$ is then:

$$H(X) = -\sum_{i=1}^N p(x_i) \cdot \text{log}_2 p(x_i)$$
$$H(X) = - N \cdot (\frac{1}{N} \cdot \text{log}_2 \frac{1}{N}) = - \text{log}_2 \frac{1}{N} = \text{log}_2 N - \text{log}_2 1 = \text{log}_2 N$$

(Since $\text{log}_2 1 = 0)

The perplexity of $X$ then becomes:

$$2^ {\text{log}_2 N} = N $$

## N-Gram Models
Goal: Predict next word in sequence by using the most probably word $w_i$ given some sequence $w_1 ... w_{i-1}$


### Probability
Probability for a sequence $P(w_1, w_2, ..., w_n)$

Notation: Let the sequence $w_1...w_n = w_1^n$

Chain rule for a sequence:
$P(w_1 ... w_n) = P(w_1) P(w_2|X_1) P(w_3 | w_1^2) ... P(w_n | w_1^{n-1})$

$P(w_1 ... w_n) = \prod_{i=1}^n P(w_i | w_i^{i-1})$

### N-gram models
__Bigram model (n=2)__
Markov assumption: The probability of a word $w_i$ only depends on the previous word $w_{i-1}$

$$P(w_i | w_1^{i-1}) \approx P(w_i | w_{i-1})$$

__N-Gram model__
The bigram model can be generalized to using the previous $n$ words to predict the next word, this is called an __N-gram model__, where $N=n+1$

$$P(w_i | w_{1}^{i-1}) \approx P(w_i | w_{i-n}^{i-1})$$

### Estimating Probabilities



In [11]:
X = np.ones(10000)
X = X / len(H)

def H(X):
    return np.sum(-X * np.log2(X))

In [13]:
2** H(X)

9999.99999999997

In [14]:
2**(np.log2(10000) -np.log2(1))

9999.999999999995