# Perplexity and Information Theory
Ian Tenney, September 8, 2016

Consider our language modeling task, where the goal is to predict the next word in a sentence:
```
I have a pet ____
```
We want to model, for all possible words $w$:
$$ P(w_i = w | w_{i-1}, w_{i-2}, ..., w_0) $$

Of course, we can't predict with certainty what word should go here, and there's plenty of valid options. If we knew the true distribution, we could precisely say what the probabilities should be - but we don't, so we'll have to estimate it from data.

Suppose our corpus is ten sentences:

```
I have a pet dog
I have a pet dog
I have a pet dog
I have a pet dog
I have a pet cat
I have a pet cat
I have a pet cat
I have a pet cat
I have a pet gecko
I have a pet rock
```

We can get our maximum likelihood estimate from counting words. We'll use $Q$ here for our model distributions; you'll see why soon:
$$ Q(\text{dog}\ |\ \text{I have a pet}) = 4/10 = 0.4$$
$$ Q(\text{cat}\ |\ \text{I have a pet}) = 4/10 = 0.4$$
$$ Q(\text{gecko}\ |\ \text{I have a pet}) = 1/10 = 0.1$$
$$ Q(\text{rock}\ |\ \text{I have a pet}) = 1/10 = 0.1$$

In [1]:
import numpy as np
words = ['dog', 'cat', 'gecko', 'rock']
c = np.array([4,4,1,1])
q = c / np.sum(c, dtype=float)
print "  ".join("Q(%s)=%.02f" % wp for wp in zip(words, q))

Q(dog)=0.40  Q(cat)=0.40  Q(gecko)=0.10  Q(rock)=0.10


## Entropy

Entropy is a measure of information. When we're dealing with random variables, entropy tells us how much information is "contained" in an instance of that variable.

For example, for a variable that's always equal to 1, it doesn't take any information at all to specify the value. So, entropy is zero bits.

For a variable that's equal to 0 or 1 with equal probability, we need one bit. For one that can be in `{0,1,2,3}` with equal probability, we need two bits to specify the value: `00`, `01`, `10`, or `11`.

What if we get a skewed distribution? It turns out we can compress things a bit. Let's try a [Huffman code](https://en.wikipedia.org/wiki/Huffman_coding): we'll assign shorter codes to the more common entries, and longer ones to things we see less often:  
`"dog": 0`  
`"cat": 10`  
`"gecko": 110`  
`"rock": 111`  
Now on average, we need:  

$$ E[\text{bits}] = Q(\text{dog}) \cdot 1 + Q(\text{cat}) \cdot 2 + Q(\text{gecko}) \cdot 3 + Q(\text{rock}) \cdot 3 = 1.8\ \text{bits} $$

So we didn't need two full bits after all! Because some elements are more frequent, there's less information we need to store overall.

We can formalize this with the notion of **Entropy**. According to the [Shannon source coding theorem](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem), the optimal number of bits for each symbol is $-\log_2(q(x))$. Suppose we had that. Then the expected number of bits would be:

$$ E[\text{bits}] = H(Q) = -\sum_x Q(x) \log_2 Q(x) $$

where for our example $x \in \{\text{dog, cat, gecko, rock}\}$. Let's calculate it:

In [2]:
# Use q for this, as it's a predicted distribution
sum(-q * np.log2(q))

1.7219280948873623

Our Huffman code gets close to optimal. In this case, it's just limited by discretization - we have to use a whole number of bits, so we might need to round up from the optimum.

## Cross-Entropy and KL Divergence

Suppose we have our language model, trained to predict the above distribution. Of course, the corpus was just a sample from some underlying true distribution. Suppose we know that of all pet owners, 40% have dogs, 50% have cats, and only 5% each have geckos or pet rocks.

Now we can ask, how well are we approximating our true distribution with our model? More precisely: suppose we come up with a code optimized for our model distribution $Q$, like the one above. How many extra bits will we need if we use this code on the true distribution $P$?

Using our *optimal* code from $Q$ on samples from $P$, we get the **cross-entropy** between $P$ and $Q$:

$$ CE(P,Q) = -\sum_x P(x) \log_2 Q(x) $$

The number of bits we would need with an optimal code from $P$, on $P$, would be the **entropy** of P:

$$ H(P) = -\sum_x P(x) \log_2 P(x) $$

Now the difference, how much "extra" information we would need - or equivalently, the information lost when we approximate $P$ by $Q$, is the **Kullback-Liebler (KL) divergence:**

$$ D_{KL}(P||Q) = CE(P,Q) - H(P) = -\sum_x P(x) \log_2 Q(x) + \sum_x P(x) \log_2 P(x) $$
$$ D_{KL}(P||Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)} $$

Note that KL divergence is not symmetric; it assumes one "true" distribution ($P$) and an approximation to it ($Q$).

Also observe that the difference between the cross-entropy and the KL divergence only depends on $P$. So if $P$ is a fixed "true" distribution that we want to approximate, then it's equivalent to optimize either the KL divergence or the cross-entropy. The latter has a simpler form, so that's what we use in practice.

In [3]:
p = np.array([0.4, 0.5, 0.05, 0.05])
ce_pq = sum(-p * np.log2(q))
h_p = sum(-p * np.log2(p))
dkl_pq = ce_pq - h_p
print "   CE(P,Q) = %.05f" % ce_pq
print "      H(P) = %.05f" % h_p
print "D_KL(P||Q) = %.05f" % dkl_pq

   CE(P,Q) = 1.52193
      H(P) = 1.46096
D_KL(P||Q) = 0.06096


The KL divergence isn't very high in this toy example, because our sample wasn't that far off from the true distribution. More interestingly though, note that the cross-entropy loss is still fairly high. What if we had a perfect model, and set $Q = P$?

In [4]:
ce_pp = sum(-p * np.log2(p))
print "CE(P,P) = %.05f" % ce_pp
print "D_KL(P||P) = %.05f" % (ce_pp - h_p)

CE(P,P) = 1.46096
D_KL(P||P) = 0.00000


KL divergence is 0, but we still get a loss of 1.46 - not great! But we can't possibly do better than the true distribution; there's just 1.46 bits of inherent uncertainty in the next word here. This is exactly the same situation you might have seen with noisy training labels, where we can't do better than the [Bayes error rate](https://en.wikipedia.org/wiki/Bayes_error_rate).

## Perplexity

For machine learning, it's not always useful to think about "encoding". Instead, we're just interested in how well our predictions match up with *data*. We don't get to know the true value of $P(x)$ - instead, we can evaluate on a sample from it. 

Let $x_i : i = 1,...,N$ be our samples. We can approximate the cross-entropy as an expectation over the true distribution:

$$ CE(P,Q) = -\sum_x P(x) \log_2 Q(x) = E_{P}\left[ -\log_2 Q(x) \right] \approx \frac{1}{N} \sum_i -\log_2 Q(x_i) $$

This is our usual machine learning objective:

In [5]:
test_set = [
"I have a pet dog",
"I have a pet cat",
"I have a pet dog",
"I have a pet dog",
"I have a pet cat",
"I have a pet cat",
"I have a pet gecko",
"I have a pet cat",
"I have a pet gecko",
"I have a pet rock",
]

word_to_q = dict(zip(words, q))
q_xi = np.array([word_to_q[s.split()[-1]] for s in test_set])
est_ce = np.mean(-np.log2(q_xi))
print "Estimated CE(P,Q) = %.05f" % est_ce

Estimated CE(P,Q) = 1.92193


For language modeling purposes, we tend to prefer an equivalent, but easier to interpret measure: **perplexity**. For the cross-entropy loss, we took the *arithemetic mean* of the *negative log probabilities*. We can instead think of the *inverse probabilities*, and take the *geometric* mean:

$$ \text{Perplexity} = \left( \prod_i \frac{1}{Q(x_i)} \right)^{1/N} $$

You can think of this as averaging how confused the model is, over all the words (hence the name, "perplexity"). The more probability the model puts on the right word, the lower the perplexity - down to, at best, 1. As a mental model, you can think of this as how many words the model thinks are plausible at each point: suppose the model suggests $k$ words - including the correct one - and assigns each $Q(x) = 1/k$. Then the perplexity would be $1/Q(x) = k$.

Just as cross-entropy won't go below $H(P)$, perplexity won't ever go below a certain point - even the best model should be able to predict several plausible candidates.

In fact, perplexity is just the exponentiated cross-entropy loss:

$$ \text{Perplexity} = \left( \prod_i \frac{1}{Q(x_i)} \right)^{1/N} = \left( \prod_i 2^{- \log_2 Q(x_i)} \right)^{1/N} = 2^{\left(\frac{1}{N} \sum_i -\log_2 Q(x_i)\right)} \approx 2^{CE(P,Q)}$$

If we were to multiply all the probabilities together directly, we'd get a numerical underflow, so it's much more common to compute perplexity in log-space and take the exponent at the end:

In [6]:
est_ce = np.mean(-np.log2(q_xi))
print "Perplexity: %.02f" % 2**est_ce

Perplexity: 3.79


### One last note: Conditional probabilities

Just like when we average our loss over training examples with different features, we average perplexity over predicted words with different contexts. Just replace all samples $x_i$ with (sample, context) pairs $x_i, c_i$, where $c_i = (w_{i-1}, w_{i-2}, ...)$, and replace $P(x_i)$ and $Q(x_i)$ with the conditional probabilities $P(x_i | c_i)$ and $Q(x_i | c_i)$, and all the above results still hold.