# Unit 3

## Calculating Perplexity in Language Models

### Introduction to Perplexity

Welcome back to the "Scoring LLM Outputs with Logprobs and Perplexity" course. In the previous lesson, you explored how to compare sentence likelihoods using log probabilities. In this lesson, you’ll learn about perplexity—a metric used to evaluate how well a language model predicts text.

Perplexity gives us a sense of how “surprised” a model is by a sequence. A lower value means the model finds the sentence more natural; a higher value suggests the model finds it awkward or unexpected.

By the end of this lesson, you'll understand what perplexity means, how it's related to log probabilities, and how to approximate it using OpenAI’s API.

### Understanding Perplexity

In traditional NLP, perplexity is computed as the exponential of the average negative log-likelihood of a token sequence:

$$Perplexity = e^{-\frac{1}{N}\sum_{i=1}^{N}\log P(w_i)}$$

Let’s break this down:

  * $N$ is the total number of tokens in the sequence.
  * $P(w\_i)$ is the probability that the model assigns to the $i$-th token, given the previous tokens.
  * $\\log P(w\_i)$ is the log probability of the $i$-th token.
  * The sum $\\sum\_{i=1}^{N}\\log P(w\_i)$ adds up the log probabilities for all tokens in the sequence.
  * Dividing by $N$ gives the average log probability per token.
  * Multiplying by $-1$ turns this into the average negative log probability (also called the average "surprisal").
  * Taking the exponential ($e...$) converts this average surprisal back into the original probability scale, giving us the perplexity.

A lower perplexity means the model is less "surprised" by the sequence (it assigns higher probabilities to the tokens), while a higher perplexity means the model finds the sequence less predictable.

### Example: Approximating Perplexity

We’ll compare two sentences—one fluent, one awkward—and use the log probability of the generated token to approximate perplexity.

```python
import math
from openai import OpenAI

client = OpenAI()

sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep."
]

for sentence in sentences:
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=sentence,
        max_tokens=1,
        logprobs=5
    )
    # Get the logprob of the generated token (first in the list)
    logprob = response.choices[0].logprobs.token_logprobs[0]
    perplexity = math.exp(-logprob)
    print(f"Sentence: {sentence}")
    print(f"Log probability: {logprob:.4f}")
    print(f"Approximate Perplexity: {perplexity:.2f}\n")
```

Example output:

```
Sentence: Cats sleep on the windowsill.
Log probability: -3.8244
Approximate Perplexity: 45.80

Sentence: Cats windowsill the on sleep.
Log probability: -5.7439
Approximate Perplexity: 312.27
```

Explanation of the output:

  * For "Cats sleep on the windowsill.", the log probability is higher (less negative), and the perplexity is much lower (45.80). This means the model finds this sentence more natural and likely.
  * For "Cats windowsill the on sleep.", the log probability is lower (more negative), and the perplexity is much higher (312.27). This means the model finds this sentence much less likely or more surprising.

The absolute values of perplexity may vary depending on the model and API version, but the key point is that the fluent sentence has a significantly lower perplexity than the awkward one. This demonstrates that perplexity can be used to compare the relative fluency or naturalness of different sentences according to the language model.

### Interpreting Results and Common Pitfalls

  * Lower perplexity means the sentence is more natural to the model.
  * Higher perplexity indicates the sentence is confusing or unlikely.
  * We’re approximating perplexity with one token—not a full sequence—so results are directional, not absolute.
  * Make sure `max_tokens` is set to 1 or more to ensure a generated token is returned.

### Summary and Next Steps

In this lesson, you learned how to approximate perplexity using log probabilities from OpenAI’s API. While not a full traditional measure, this technique provides a useful proxy for evaluating sentence fluency. In the next unit, you’ll compare multiple models using perplexity to evaluate their fluency on the same sentence.



## Implementing the Perplexity Formula

Now that you understand the concept of perplexity and have seen how it relates to log probabilities, let's focus on the core mathematical relationship between them. In this exercise, you'll create a function that calculates perplexity directly from a log probability value.

Your task is to implement the calculate_perplexity function that:

Takes a log probability value as input
Applies the formula: perplexity = exp(-logprob)
Includes error handling for invalid inputs
Returns the calculated perplexity
This exercise isolates the mathematical operation behind perplexity calculation, helping you build a solid foundation before we move on to comparing models using this metric in the next lesson.

```python
import math

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a single log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
        
    Raises:
        TypeError: If input is not a number
    """
    # TODO: Check if input is a valid number and raise TypeError with message
    # "Log probability must be a number" if it's not a number
    
    # TODO: Calculate perplexity using the formula: exp(-logprob)
    
    # TODO: Return the calculated perplexity value
    pass

# Test the function with sample values
if __name__ == "__main__":
    test_values = [
        -1.5,    # Should give perplexity of about 4.48
        -0.5,    # Should give perplexity of about 1.65
        -3.0,    # Should give perplexity of about 20.09
        0.0      # Should give perplexity of 1.0
    ]
    
    print("Testing perplexity calculation:")
    print("-" * 40)
    
    for logprob in test_values:
        try:
            ppl = calculate_perplexity(logprob)
            print(f"Log probability: {logprob:.4f} | Perplexity: {ppl:.2f}")
        except Exception as e:
            print(f"Error with input {logprob}: {e}")
    
    # Try with an invalid input
    try:
        calculate_perplexity("not a number")
    except Exception as e:
        print(f"\nExpected error with invalid input: {e}")
```



## Applying Perplexity to Real Sentences

## Flexible Token Generation for Perplexity Analysis

## Error Handling for Perplexity Calculations

## Error Handling for Perplexity Calculations