# Unit 3

## Calculating Perplexity in Language Models

### Introduction to Perplexity

Welcome back to the "Scoring LLM Outputs with Logprobs and Perplexity" course. In the previous lesson, you explored how to compare sentence likelihoods using log probabilities. In this lesson, you’ll learn about perplexity—a metric used to evaluate how well a language model predicts text.

Perplexity gives us a sense of how “surprised” a model is by a sequence. A lower value means the model finds the sentence more natural; a higher value suggests the model finds it awkward or unexpected.

By the end of this lesson, you'll understand what perplexity means, how it's related to log probabilities, and how to approximate it using OpenAI’s API.

### Understanding Perplexity

In traditional NLP, perplexity is computed as the exponential of the average negative log-likelihood of a token sequence:

$$Perplexity = e^{-\frac{1}{N}\sum_{i=1}^{N}\log P(w_i)}$$

Let’s break this down:

  * $N$ is the total number of tokens in the sequence.
  * $P(w\_i)$ is the probability that the model assigns to the $i$-th token, given the previous tokens.
  * $\\log P(w\_i)$ is the log probability of the $i$-th token.
  * The sum $\\sum\_{i=1}^{N}\\log P(w\_i)$ adds up the log probabilities for all tokens in the sequence.
  * Dividing by $N$ gives the average log probability per token.
  * Multiplying by $-1$ turns this into the average negative log probability (also called the average "surprisal").
  * Taking the exponential ($e...$) converts this average surprisal back into the original probability scale, giving us the perplexity.

A lower perplexity means the model is less "surprised" by the sequence (it assigns higher probabilities to the tokens), while a higher perplexity means the model finds the sequence less predictable.

### Example: Approximating Perplexity

We’ll compare two sentences—one fluent, one awkward—and use the log probability of the generated token to approximate perplexity.

```python
import math
from openai import OpenAI

client = OpenAI()

sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep."
]

for sentence in sentences:
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=sentence,
        max_tokens=1,
        logprobs=5
    )
    # Get the logprob of the generated token (first in the list)
    logprob = response.choices[0].logprobs.token_logprobs[0]
    perplexity = math.exp(-logprob)
    print(f"Sentence: {sentence}")
    print(f"Log probability: {logprob:.4f}")
    print(f"Approximate Perplexity: {perplexity:.2f}\n")
```

Example output:

```
Sentence: Cats sleep on the windowsill.
Log probability: -3.8244
Approximate Perplexity: 45.80

Sentence: Cats windowsill the on sleep.
Log probability: -5.7439
Approximate Perplexity: 312.27
```

Explanation of the output:

  * For "Cats sleep on the windowsill.", the log probability is higher (less negative), and the perplexity is much lower (45.80). This means the model finds this sentence more natural and likely.
  * For "Cats windowsill the on sleep.", the log probability is lower (more negative), and the perplexity is much higher (312.27). This means the model finds this sentence much less likely or more surprising.

The absolute values of perplexity may vary depending on the model and API version, but the key point is that the fluent sentence has a significantly lower perplexity than the awkward one. This demonstrates that perplexity can be used to compare the relative fluency or naturalness of different sentences according to the language model.

### Interpreting Results and Common Pitfalls

  * Lower perplexity means the sentence is more natural to the model.
  * Higher perplexity indicates the sentence is confusing or unlikely.
  * We’re approximating perplexity with one token—not a full sequence—so results are directional, not absolute.
  * Make sure `max_tokens` is set to 1 or more to ensure a generated token is returned.

### Summary and Next Steps

In this lesson, you learned how to approximate perplexity using log probabilities from OpenAI’s API. While not a full traditional measure, this technique provides a useful proxy for evaluating sentence fluency. In the next unit, you’ll compare multiple models using perplexity to evaluate their fluency on the same sentence.



## Implementing the Perplexity Formula

Now that you understand the concept of perplexity and have seen how it relates to log probabilities, let's focus on the core mathematical relationship between them. In this exercise, you'll create a function that calculates perplexity directly from a log probability value.

Your task is to implement the calculate_perplexity function that:

Takes a log probability value as input
Applies the formula: perplexity = exp(-logprob)
Includes error handling for invalid inputs
Returns the calculated perplexity
This exercise isolates the mathematical operation behind perplexity calculation, helping you build a solid foundation before we move on to comparing models using this metric in the next lesson.

```python
import math

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a single log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
        
    Raises:
        TypeError: If input is not a number
    """
    # TODO: Check if input is a valid number and raise TypeError with message
    # "Log probability must be a number" if it's not a number
    
    # TODO: Calculate perplexity using the formula: exp(-logprob)
    
    # TODO: Return the calculated perplexity value
    pass

# Test the function with sample values
if __name__ == "__main__":
    test_values = [
        -1.5,    # Should give perplexity of about 4.48
        -0.5,    # Should give perplexity of about 1.65
        -3.0,    # Should give perplexity of about 20.09
        0.0      # Should give perplexity of 1.0
    ]
    
    print("Testing perplexity calculation:")
    print("-" * 40)
    
    for logprob in test_values:
        try:
            ppl = calculate_perplexity(logprob)
            print(f"Log probability: {logprob:.4f} | Perplexity: {ppl:.2f}")
        except Exception as e:
            print(f"Error with input {logprob}: {e}")
    
    # Try with an invalid input
    try:
        calculate_perplexity("not a number")
    except Exception as e:
        print(f"\nExpected error with invalid input: {e}")
```



## Applying Perplexity to Real Sentences

Excellent job implementing the perplexity calculation function! Now let's put that function to work with real language model outputs. In this exercise, you'll apply your perplexity function to analyze how "surprised" a language model is by different sentences.

Your tasks are to:

Complete the calculate_perplexity function you created earlier.
Extract log probabilities from the OpenAI API responses.
Calculate perplexity for each sentence in the list.
Format and display the results clearly.
By comparing perplexity scores between grammatical and ungrammatical sentences, you'll see firsthand how this metric reflects the model's understanding of natural language patterns — a key skill for evaluating language model performance.

```python
import math
from openai import OpenAI

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a single log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
        
    Raises:
        TypeError: If input is not a number
    """
    # TODO: Check if input is a valid number and raise TypeError with message
    # "Log probability must be a number" if it's not a number
    
    # TODO: Calculate perplexity using the formula: exp(-logprob)
    
    # TODO: Return the calculated perplexity value
    pass

# Initialize the OpenAI client
client = OpenAI()

# List of sentences to analyze
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep.",
    "The weather is nice today.",
    "Today nice the weather is."
]

# Process each sentence
for sentence in sentences:
    # Call the API to get log probabilities
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )

    # TODO: Extract the log probability of the first token from the response
    
    # TODO: Calculate perplexity using our function
    
    # TODO: Print the results in a clear format with the sentence, log probability 
    # (formatted to 4 decimal places), and perplexity (formatted to 2 decimal places)
    # Add a separator line between sentences for readability
```

```python
import math
from openai import OpenAI

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a single log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
        
    Raises:
        TypeError: If input is not a number
    """
    # Check if input is a valid number and raise TypeError with message
    # "Log probability must be a number" if it's not a number
    if not isinstance(log_probability, (int, float)):
        raise TypeError("Log probability must be a number")
    
    # Calculate perplexity using the formula: exp(-logprob)
    perplexity = math.exp(-log_probability)
    
    # Return the calculated perplexity value
    return perplexity

# Initialize the OpenAI client
# Replace with your actual API key or use environment variables
client = OpenAI()

# List of sentences to analyze
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep.",
    "The weather is nice today.",
    "Today nice the weather is."
]

# Process each sentence
for sentence in sentences:
    # Call the API to get log probabilities
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )

    # Extract the log probability of the first token from the response
    # The response structure for logprobs is a bit nested.
    # We are interested in the log probability of the single token we requested.
    log_probability = response.choices[0].logprobs.content[0].logprob

    # Calculate perplexity using our function
    perplexity = calculate_perplexity(log_probability)
    
    # Print the results in a clear format with the sentence, log probability 
    # (formatted to 4 decimal places), and perplexity (formatted to 2 decimal places)
    print(f"Sentence: {sentence}")
    print(f"Log Probability: {log_probability:.4f}")
    print(f"Perplexity: {perplexity:.2f}")
    
    # Add a separator line between sentences for readability
    print("-" * 30)
```

## Flexible Token Generation for Perplexity Analysis

Fantastic work applying perplexity to real sentences! Now let's make our code more flexible by allowing for different token generation settings. In our previous examples, we used a fixed value of 1 for the max_tokens parameter, but what happens if we generate more tokens?

Your task is to modify the perplexity analysis code to make the max_tokens parameter configurable. This will allow you to:

Experiment with different token generation settings
Compare how perplexity changes with different max_tokens values
Create more versatile code that can be adapted to different analysis needs
By making this parameter configurable, you'll gain deeper insights into how language models respond to different contexts and learn an important programming principle: parameterization makes code more reusable and powerful for experimentation.

```python
import math
from openai import OpenAI

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
    """
    if not isinstance(log_probability, (int, float)):
        raise TypeError("Log probability must be a number")
    
    return math.exp(-log_probability)

# TODO: Modify this function to accept a max_tokens parameter with a default value of 1
def analyze_sentence_perplexity(sentence):
    """
    Analyze the perplexity of a sentence using the OpenAI API.
    
    Args:
        sentence (str): The sentence to analyze
        
    Returns:
        tuple: (log_probability, perplexity)
    """
    client = OpenAI()
    
    # TODO: Update this API call to use the max_tokens parameter instead of the fixed value
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Fixed value that needs to be configurable
        logprobs=True,
        top_logprobs=5
    )

    # Get logprob of the first token generated
    logprob = response.choices[0].logprobs.content[0].logprob

    # Compute perplexity
    perplexity = calculate_perplexity(logprob)
    
    return logprob, perplexity

# Test sentences
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep."
]

# TODO: Create a list of different max_tokens values to test (e.g., [1, 3, 5])

for sentence in sentences:
    print(f"Analyzing: \"{sentence}\"")
    print("-" * 50)
    
    # TODO: Replace this single call with a loop that tests different max_tokens values
    logprob, perplexity = analyze_sentence_perplexity(sentence)
    print(f"Logprob: {logprob:.4f} | Perplexity: {perplexity:.2f}")
    print("-" * 30)
    
    print("\n")
```

```python
import math
from openai import OpenAI

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
    """
    if not isinstance(log_probability, (int, float)):
        raise TypeError("Log probability must be a number")
    
    return math.exp(-log_probability)

def analyze_sentence_perplexity(sentence, max_tokens=1):
    """
    Analyze the perplexity of a sentence using the OpenAI API.
    
    Args:
        sentence (str): The sentence to analyze
        max_tokens (int): The number of tokens to generate. Defaults to 1.
        
    Returns:
        tuple: (log_probability, perplexity)
    """
    client = OpenAI()
    
    # Update this API call to use the max_tokens parameter
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=max_tokens,
        logprobs=True,
        top_logprobs=5
    )

    # Get logprob of the first token generated
    # Note: For perplexity of a full sentence, you would typically sum logprobs and divide by the number of tokens.
    # However, this exercise focuses on the logprob of the *first* generated token to show flexibility.
    logprob = response.choices[0].logprobs.content[0].logprob

    # Compute perplexity
    perplexity = calculate_perplexity(logprob)
    
    return logprob, perplexity

# Test sentences
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep."
]

# Create a list of different max_tokens values to test
max_tokens_values = [1, 3, 5]

for sentence in sentences:
    print(f"Analyzing: \"{sentence}\"")
    print("-" * 50)
    
    # Replace this single call with a loop that tests different max_tokens values
    for max_tokens in max_tokens_values:
        logprob, perplexity = analyze_sentence_perplexity(sentence, max_tokens=max_tokens)
        print(f"max_tokens: {max_tokens} | Logprob: {logprob:.4f} | Perplexity: {perplexity:.2f}")
    
    print("\n")
```

## Error Handling for Perplexity Calculations

Now that you've made your perplexity code more flexible, let's focus on making it more reliable. When working with external APIs like OpenAI, many things can go wrong — network issues, rate limits, or unexpected response formats.

Your task is to enhance the perplexity calculation code with proper error handling:

Add try-except blocks around the API call to catch potential exceptions.
Handle cases where log probability data might be missing.
Ensure the program continues processing other sentences even if one fails.
Provide clear error messages that help diagnose problems.
In real-world applications, robust error handling is just as important as the core functionality. By completing this exercise, you'll learn how to make your perplexity calculations work reliably even when things don't go as planned.

```python
import math
from openai import OpenAI
import time

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
    """
    if not isinstance(log_probability, (int, float)):
        raise TypeError("Log probability must be a number")
    
    return math.exp(-log_probability)

# Initialize the OpenAI client
client = OpenAI()

# List of sentences to analyze
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep.",
    "The weather is nice today.",
    "Today nice the weather is."
]

# Process each sentence
for sentence in sentences:
    print(f"Processing: \"{sentence}\"")
    
    # TODO: Add a try-except block to handle all errors for this sentence
    
    # TODO: Add a try-except block specifically for the API call
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )
        
    # TODO: Add a try-except block for safely extracting the log probability
    # TODO: Add checks to verify response contains the expected data
    logprob = response.choices[0].logprobs.content[0].logprob
    
    # Calculate perplexity
    perplexity = calculate_perplexity(logprob)
    
    # Print the results
    print(f"Log probability: {logprob:.4f}")
    print(f"Perplexity: {perplexity:.2f}")
    
    print("-" * 50 + "\n")
```

To improve the reliability of your perplexity calculation code, you need to add several layers of error handling. This includes catching issues with the API call itself and handling cases where the API response might be missing data.

Here is the revised code with proper `try-except` blocks and data validation.

```python
import math
from openai import OpenAI
from openai import OpenAI
import time

def calculate_perplexity(log_probability):
    """
    Calculate perplexity from a log probability value.
    
    Args:
        log_probability (float): The log probability value
        
    Returns:
        float: The calculated perplexity
    """
    if not isinstance(log_probability, (int, float)):
        raise TypeError("Log probability must be a number")
    
    return math.exp(-log_probability)

# Initialize the OpenAI client
client = OpenAI()

# List of sentences to analyze
sentences = [
    "Cats sleep on the windowsill.",
    "Cats windowsill the on sleep.",
    "The weather is nice today.",
    "Today nice the weather is."
]

# Process each sentence
for sentence in sentences:
    print(f"Processing: \"{sentence}\"")
    
    # Use a single try-except block to handle all potential errors for the sentence
    try:
        # Try-except block for the API call to handle network issues, rate limits, etc.
        try:
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": sentence}],
                max_tokens=1,
                logprobs=True,
                top_logprobs=5
            )
        except Exception as api_error:
            print(f"ERROR: API call failed for sentence. Details: {api_error}")
            # Skip to the next sentence
            print("-" * 50 + "\n")
            continue
            
        # Try-except block for safely extracting the log probability
        try:
            # Check if the response contains the expected data
            if (response.choices and 
                response.choices[0].logprobs and 
                response.choices[0].logprobs.content and 
                response.choices[0].logprobs.content[0].logprob is not None):
                
                logprob = response.choices[0].logprobs.content[0].logprob
                
                # Calculate perplexity
                perplexity = calculate_perplexity(logprob)
                
                # Print the results
                print(f"Log probability: {logprob:.4f}")
                print(f"Perplexity: {perplexity:.2f}")
            else:
                print("ERROR: Incomplete or missing log probability data in the API response.")
                
        except (AttributeError, IndexError, TypeError) as data_error:
            print(f"ERROR: Failed to parse API response. Details: {data_error}")

    except Exception as overall_error:
        print(f"An unexpected error occurred: {overall_error}")
    
    print("-" * 50 + "\n")
```

-----

### Key Improvements

  - **Nested `try-except` Blocks**: The code now uses a main `try-except` block to ensure that the program never crashes. Inside this, there are two separate `try-except` blocks: one for the API call and another for parsing the response. This helps you pinpoint the exact cause of the failure.
  - **Specific Error Messages**: The `except` blocks now provide more descriptive error messages, differentiating between an API failure and a data parsing error. This is crucial for diagnosing problems.
  - **Robust Data Validation**: The `if` statement checks for the existence of `response.choices`, `logprobs`, and `logprobs.content` before attempting to access them. This prevents `AttributeError` and `IndexError` when the API returns an unexpected or empty response.
  - **Graceful Handling**: The `continue` statement within the API error handling block ensures that if one sentence fails, the program simply moves on to the next one instead of halting completely. This is essential for processing large datasets.