## Information Theory for Language
* Text = String = Sequence of Chars
* What is the # bits per character needed to encode English?

#### Entropy
$$H(X) = - \sum_{x \in X} p(x) \cdot \text{log} \ p(x)$$

Example horse race, all horses uniformly likely $p(x_i) = 1/8$
Therefore $H(X) = - \sum_{i=1}^8 \frac{1}{8} \text{log} \ \frac{1}{8} = 3$

_This means that given that all outcomes are equally likely, we will need 3 bits in order to convey the average value (3 bits allows us to represent 8 diff. numbers)_

However, say we have the following probability distr.
* $p(x_1) = 1/2$
* $p(x_2) = 1/4$
* $p(x_3) = 1/8$
* $p(x_4) = 1/16$
* $p(x_5) = p(x_6) = p(x_7) = p(x_8) = 1/64$

Then then Entropy becomes:

$H(X) = \frac{1}{2} \text{log} \frac{1}{2} + \frac{1}{4} \text{log} \frac{1}{4} + \frac{1}{8} \text{log} \frac{8}{2}+ \frac{1}{16} \text{log} \frac{1}{16} + 4\cdot(\frac{1}{64} \text{log} \frac{1}{64}) = 2$

Thus to convey the message, we on average only need 2 - if the distribution is heavily in favor of certain outcomes.

In [7]:
import numpy as np
# Equally likely
X = np.ones(8) / [8,8,8,8,8,8,8,8]
print(sum(-X * np.log2(X)))

X = np.ones(8) / [2,4,8,16,64,64,64,64]
print(sum(-X * np.log2(X)))

3.0
2.0


Entropy over a __sequence__

$$H(w_1,w_2,...,w_n) = -\sum_W p(W) \ \text{log} \ p(W)$$

__Entropy Rate $H(L)$__

$$H(L) := \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} \ \text{log} \ p(w_1 w_2 ... w_n)$$

Shannon-McMillan-Breimain theorem:

$H(L) = - \text{lim}_{n \rightarrow \infty} \ \frac{1}{n} \sum_{W \in L} p(w_1,...,w_n) \cdot \text{log} \ p(w_1,...,w_n)$

__Cross Entropy $H(p,m)$__

Upper bound on the Entropy:

$$H(p) \leq H(p,m)$$

__Idea__: Estimate the true Entropy by using a very long sequence, rather than summing over _all_ possible sequences in some language $L$.

Useful when we dont know the probability distribution $p$ which generated the data.

Instead, use $m$ as the approximation of the unknown distribution: $m \approx p$ where the CE of $m$ is

$$H(p,m) = \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} \sum_{W \in L} p(w_1,...,w_n) \cdot log \ m(w_1,...,w_n)$$

Using the SMB theorem we can simplify this to only depend on $m$:

$$H(p,m) = \text{lim}_{n \rightarrow \infty} \ - \frac{1}{n} log \ m(w_1,...,w_n)$$

#### Cross Entropy and Perplexity
We need an estimate of the cross entropy we can actually work with. Given a sequence of words $W$ the CE approximation is:

$$H(W) = - \frac{1}{N} \text{log} \ P(w_1 w_2 ... w_N)$$

Perplexity Definition: 

$$Perp(W) = 2^{H(W)}$$

$$= P(w_1 w_2 ... w_N) ^{-1/2}$$ 

Since we used log2, where $2 ^ {log_2 \ x} = x$

$$= (\prod_{i=1}^N P(w_i | w_1 ... w_{i-1}))^{-1/N}$$

$$= \sqrt[N]{\frac{1}{\prod_{i=1}^N P(w_i | w_1 ... w_{i-1})}}$$

#### Perplexity for Uniform Distribution
If the distribution is uniform then $Perp(X) = \vert X \vert$

Proof: 

Let $X=[x_1...x_N]$ be the random variables and as such the size of $X$ is N and $p(x)=\frac{1}{N}$ for all $x\in X$.

The entropy of $X$ is then:

$$H(X) = -\sum_{i=1}^N p(x_i) \cdot \text{log}_2 p(x_i)$$
$$H(X) = - N \cdot (\frac{1}{N} \cdot \text{log}_2 \frac{1}{N}) = - \text{log}_2 \frac{1}{N} = \text{log}_2 N - \text{log}_2 1 = \text{log}_2 N$$

(We used $\text{log}_2 1 = 0$)

The perplexity of $X$ then becomes:

$$2^ {\text{log}_2 N} = N \quad \square$$

### Evaluating language models
Extrinsic evaluation: Compare accuracy, ex run two language models over a text and see which one has the higher accuracy. Usually very expensive and time consuming.

Intrinsic evaluation: Quality of model independent of application. Whichever model assigns the highest probability to the test set is the best.


## N-Gram Models
Goal: Predict next word in sequence by using the most probably word $w_i$ given some sequence $w_1 ... w_{i-1}$


### Probability
Probability for a sequence $P(w_1, w_2, ..., w_n)$

Notation: Let the sequence $w_1...w_n = w_1^n$

Chain rule for a sequence:
$P(w_1 ... w_n) = P(w_1) P(w_2|X_1) P(w_3 | w_1^2) ... P(w_n | w_1^{n-1})$

$P(w_1 ... w_n) = \prod_{i=1}^n P(w_i | w_i^{i-1})$

### N-gram models
__Bigram model (n=2)__
Markov assumption: The probability of a word $w_i$ only depends on the previous word $w_{i-1}$

$$P(w_i | w_1^{i-1}) \approx P(w_i | w_{i-1})$$

__N-Gram model__
The bigram model can be generalized to using the previous $n$ words to predict the next word, this is called an __N-gram model__, where $N=n+1$

$$P(w_i | w_{1}^{i-1}) \approx P(w_i | w_{i-n}^{i-1})$$




### Estimating Probabilities
Model: $p(w_i | w_1^{i-1}) \approx p(w_i | w_{i-n}^{i-1}$

Empirically: $$\frac{\text{count}(w_{i-n}^n)}{\sum_{w'} \text{count} (w_{i-n}^{i-1}, w')} = \frac{\text{count}(w_{i-n}^n)}{\text{count} (w_{i-n}^{i-1})}$$

#### Bigram example
Example on the following corpus:
```
<S> I am Sam </S>
<S> Sam I am </S>
<S> I do not like green eggs and ham </S>
```

Using a bigram model the probabilities of words becomes:
* p( I | `<S>`)= 2/3
* p(Sam | `<S>`) = 1/3
* p(am | I ) = 2/3
* p(`</S>` | Sam) = 1/3
* p(Sam | am) = 1/2
* p(do | I) = 1/3
* p(I | Sam) = 1/2
* p(`</S>`|`am`) = 1/2

Probability for a sentence:

p(`<S> Sam I am </S>`) = p(`Sam`|`<S>`) p(`I`|`Sam`) p(`am`|`I`) p(`</S>`|`am`)

$ = \frac{1}{3} \cdot \frac{1}{2} \cdot \frac{2}{3} \cdot \frac{1}{2} = \frac{2}{36} = \frac{1}{18}$
#### N-Gram example
???

### Unknown Words
Problem: We need a fixed vocabulary for basic language models unknown word are words not in the vocabulary. 

What is the probability of seen an unknwon word next?

Idea: Treat all unknwon words in training corpus the same, assign then $UNK$ token to them and all future unknwon words are then also treated as the $UNK$ token.

Problem: By choosing a small vocab, and thus making a lot of words $UNK$ the perplexity score will be low, since the model will just predict that it doesn't know the next word when predicting - which is a really bad model.


### Smoothing
Problem: Words that are in corpus, but rarely occur, will have n-gram probabilities of zero, destroying the probabilistic modelling.

Idea: Smoothe the probability distribution slightly in favor of rarely seen n-grams/words, i.e. take some mass away from frequent words and give it to infrequent words.

#### Laplace/Add-One Smoothing
Actually bad, but good as a baseline

Idea: For a vocab of size $V$
* Unsmoothed n-gram probability: $$P(w_n | w_{n-1}) = \frac{\text{count}(w_{i-n}^n)}{\text{count} (w_{i-n}^{i-1})}$$
* Laplace smoothed n-gram probability: $$P(w_n | w_{n-1}) = \frac{\text{count}(w_{i-n}^n) + 1}{\text{count} (w_{i-n}^{i-1}) +V}$$

However for many zero counts this method can really mess us the distribution.

Idea. Can be generalized to add-k smoothing, where $k \in [0;1]$ rather than 1.


$$P(w_n | w_{n-1}) = \frac{\text{count}(w_{i-n}^n) + k}{\text{count} (w_{i-n}^{i-1}) +kV}$$

#### Good-Turing Estimator





#### Backoff and Interpolation
Rather than using a constant frequency for all missing n-grams, we can insead make a qualified estimation by using the probability of an (n-1)-gram.

Example estimating a trigram count from the bigram:
$$P(w_n | w_{n-2} w_{n-1}) \approx P(w_n | w_{n-1})$$

Similarily we can estimate bigrams via unigrams (word count):
$$P(w_n |w_{n-1}) \approx P(w_n)$$

__Backoff__

Only use n-gram if evidence i sufficient, i.e. nonzero
* Otherwise try trigram, bigram, unigram...

__Interpolation__

Mix probability estimates from all lower n-gram estimaters and weight the counts

$$P(w_n | w_{n-2} w_{n-1}) \approx \lambda_1 P(w_n | w_{n-2} w_{n-1}) + \lambda_2 P(w_n | w_{n-1}) + \lambda_3 P(w_n) $$

$$\sum_i \lambda_i = 1$$


#### Backoff with Discounting (Katz Backoff)
Discount higher-order n-grams in order to transder probability mass to the lower order n-grams, this is called the __Katz Backoff__, denoted $P^*$,

If not discounted then replacing 0 probability n-gram with a >0 probability lower-order n-gram, the total probability mass would exceed 1.

Concretely discount probability if seem before (i.e. nonzero counts) otherwise, recursively back off to lower-order n-gram prob. $P^*$.

## Neural Models
Vocab $V$ of size $m$, embedding dimension of $d$

__word embeddings (vector encodings)__ 

Indicator/one hot encoding of a word: $w \in (0,1) ^m$

Word embedding: $x: w \rightarrow x_w \quad x_w \in R^d$

__Sequence embedding__

Concatenate word embedded vectors $s = [x_{w_1} ... x_{w_n}]$ such that $s \in R ^{nd}$

__Conv layers__
Window size of filter, ex $w=3$

For a single channel $f_j : R^{3d} \rightarrow R$
* Apply filter to input
* Parameter sharing for filters
* Apply activation function in the end

For $k$ independent channels
* Use padding
* Compute $n\times k$ numbers

__Pooling layers__

No parameters!

Goal: Reduce each channel output to a single number: $R^{n\times k} \rightarrow R^k$
* Typically max-over-time pooling
* Condense history into vector $h \rightarrow z_h \in R^k$

__Softmax__
Convert to word-probabilities

Each word having an associated weight-vector $\beta_w \in R^k$

$$p(w |h) = \frac{exp(\langle \beta_w, z_h \rangle)}{\sum_{w'} exp(\langle \beta_{w'}, z_h \rangle}$$