# Perplexity

In [1]:
import numpy as np

def calculate_perplexity(probabilities):
    log_probs = np.log2(probabilities)
    avg_log_prob = np.mean(log_probs)
    perplexity = 2 ** (-avg_log_prob)
    return perplexity

In [2]:
true_sentence = "The quick brown fox jumps over the lazy dog"
sentence_1 = "The fast black cat jumps over the lazy dog"

s1_word_proba = [0.99, 0.85, 0.89, 0.99, 0.99, 0.99, 0.99, 0.99]
perplexity = calculate_perplexity(s1_word_proba)
print("Perplexity sentence 1:", perplexity)

Perplexity sentence 1: 1.04333190315947


In [3]:
sentence_2 = "The bold orange car drove by the lazy dog"

s2_word_proba = [0.99, 0.65, 0.13, 0.05, 0.21, 0.99, 0.99, 0.99]
perplexity = calculate_perplexity(s2_word_proba)
print("Perplexity sentence 2:", perplexity)

Perplexity sentence 2: 2.419227171949897


## Relationship to Cross Entropy

In [4]:
def cross_entropy(p, q):
    # Clip q to avoid log2(0) which is undefined
    q = np.clip(q, 1e-10, 1.0)
    H = -np.sum(p * np.log2(q))
    
    return H

n = len(s1_word_proba)

In [5]:
cross_entropy(np.ones(n), s1_word_proba)

0.48958543061604043

In [6]:
2**(cross_entropy(np.ones(n), s1_word_proba) / n )

1.04333190315947

In [7]:
calculate_perplexity(s1_word_proba)

1.04333190315947

## Perplexity with TorchMetrics

In [8]:
from torchmetrics.text import Perplexity

Torchmetrics' perplexity takes in a `predictions` and a `target` variable. 

For the `predictions` it assumes the shape `[batch_size, seq_len, vocab_size]`, and for the targets it assumes the shape `[batch_size, seq_len]`.

if we are only looking at one sentence, we have a batch size of 1.


In [9]:
sentence_1 = "The fast black cat jumps over the lazy dog"

Now, in this notebook, we haven't constructed a vocabulary, which is the set of all unique words in the training set. For simplicity, let's assume the vocabulary contains the following words:

In [10]:
vocab = {
    0: "The",
    1: "quick",
    2: "brown",
    3: "fox",
    4: "jumps",
    5: "over",
    6: "the",
    7: "lazy",
    8: "dog",
    9: "fast",
    10: "black",
    11: "cat",
}

Since the vocabulary has 12 words, each word output by the model would be a 12-dimensional probability vector. So, for a sentence consisting of 9 words ("The fast black cat jumps over the lazy dog") we have a 1x9x12 dimensional tensor.

Also, previously, we considerded the word probabilities 

```python
s1_word_proba = [0.99, 0.85, 0.89, 0.99, 0.99, 0.99, 0.99, 0.99]
```

In the representation below, the vocabulary index corresponding to the word at that position will have that probability value.

```python

vocab = {
    0: "The",
    1: "quick",
    2: "brown",
    3: "fox",
    4: "jumps",
    5: "over",
    6: "the",
    7: "lazy",
    8: "dog",
    9: "fast",
    10: "black",
    11: "cat",
}
```

In [11]:
import torch

model_outputs = torch.tensor([[
    [0.99, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01], # The, index 0
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.02, 0.01, 0.00, 0.85, 0.00, 0.00], # fast, index 9
    [0.01, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.89, 0.0], # black, 10
    [0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99], # cat, 11
    [0.0, 0.01, 0.0, 0.0, 0.99, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], # jumps, 4
    [0.0, 0.0, 0.01, 0.0, 0.0, 0.99, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], # over, 5
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99, 0.0, 0.0, 0.01, 0.0, 0.0], # the, 6
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99, 0.01, 0.0, 0.0, 0.0], # lazy, 7
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99, 0.0, 0.0, 0.01], # dog, 8
]])

Note that the list of vectors above may represent the probability vectors returned by an LLM, for example. One vector per word. The probabilities in each row should sum up to one.

For example, looking at the first row

```
[0.99, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01], # The, index 0
```

this means the model assings a probability of 0.99 to the first word, 0.99. The probabilities for the other words is 0 except for the last word ("cat"), which is 0.01 in this case.

**Note that these probabilities are abitrarily assigned by me. In an application, they would be returned by an actual LLM, which we omit here for simplicity.**



Then, with the target vector containing the word indices, we can garner these probabilities corresponding to the target word indices:

In [12]:
targets = torch.tensor([[0, 9, 10, 11, 4, 5, 6, 7, 8]])

# Gather the probabilities
probabilities = torch.gather(model_outputs, 2, targets.unsqueeze(2))

print(probabilities)

tensor([[[0.9900],
         [0.8500],
         [0.8900],
         [0.9900],
         [0.9900],
         [0.9900],
         [0.9900],
         [0.9900],
         [0.9900]]])


According to the [TorchMetric perplexity documentation](https://torchmetrics.readthedocs.io/en/stable/text/perplexity.html), the input is a probability score, 

> - ``preds`` (:class:`~torch.Tensor`): Probabilities assigned to each token in a sequence with shape
    [batch_size, seq_len, vocab_size]

but the results are inflated when providing the inputs directly. However, when providing log-probabilities, we can reproduce the results from earlier:

In [13]:
import torchmetrics
from torchmetrics.text import Perplexity

print("torchmetrics version:", torchmetrics.__version__)

perp = Perplexity()
perp(torch.log(model_outputs), targets)

torchmetrics version: 0.11.4


tensor(1.0485)

In [14]:
import numpy as np

def calculate_perplexity(probabilities):
    log_probs = np.log2(probabilities)
    avg_log_prob = np.mean(log_probs)
    perplexity = 2 ** (-avg_log_prob)
    return perplexity

true_sentence = "The quick brown fox jumps over the lazy dog"
sentence_1 = "The fast black cat jumps over the lazy dog"

s1_word_proba = [0.99, 0.85, 0.89, 0.99, 0.99, 0.99, 0.99, 0.99]
perplex = calculate_perplexity(s1_word_proba)
print("Perplexity sentence 1:", perplex)

Perplexity sentence 1: 1.04333190315947
