# Unit 4

## Model Fluency Comparison in Language Models

### Introduction to Model Fluency

Welcome to the final lesson of the "Scoring LLM Outputs with Logprobs and Perplexity" course. In previous lessons, you explored how to extract log probabilities and calculate perplexity to evaluate language models. Building on that foundation, this lesson will focus on comparing the **fluency** of different language models. Model fluency is a crucial aspect of evaluating how well a model can generate coherent and natural-sounding text. By the end of this lesson, you will be able to assess model fluency using log probabilities and perplexity, providing you with a deeper understanding of model performance.

### Setting Up the Environment

Before we dive into the code, let's ensure your environment is ready. If you're working on your local machine, you'll need to install the `openai` library. You can do this using `pip`:

```bash
pip install openai
```

The `math` library is part of Python's standard library, so no installation is needed for it. However, if you're using the CodeSignal environment, these libraries are already pre-installed, allowing you to focus on the code without worrying about setup.

### Understanding the Code Structure

Let's break down the code snippet you'll be working with. This code is designed to evaluate the fluency of a sentence across different language models using log probabilities obtained from OpenAI's API. We start by importing the necessary libraries: `math` for mathematical operations and `OpenAI` for interacting with the language model. Next, we initialize the `OpenAI` client, which allows us to send requests to the model. We define a list of models and a sentence that we want to evaluate. The code processes the sentence for each model individually, creating a chat completion request for each one. This request specifies the model to use, the message content, and parameters such as `max_tokens`, `logprobs`, and `top_logprobs`. These parameters control the number of tokens generated, whether to return log probabilities, and how many top token probabilities to retrieve, respectively.

### Example: Evaluating Sentence Fluency Across Models

Now, let's see the code in action with a practical example. We have a sentence: "The president addressed the nation on live television." By running the code, we can evaluate the fluency of this sentence across different models based on the log probabilities of the first token generated by each model.

```python
import math
from openai import OpenAI

client = OpenAI()
models = ["gpt-3.5-turbo", "gpt-4"]
sentence = "The president addressed the nation on live television."

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Required, can't be 0
        logprobs=True,
        top_logprobs=5
    )
    
    # Use logprob of first generated token as approximation
    logprob = response.choices[0].logprobs.content[0].logprob
    perplexity = math.exp(-logprob)
    
    print(f"{model}:")
    print(f"  Log Probability (1st token): {logprob:.4f}")
    print(f"  Approx. Perplexity: {perplexity:.2f}\n")
```

When you run this code, you might see an output similar to:

```
Evaluating sentence fluency: "The president addressed the nation on live television."
gpt-3.5-turbo:
  Log Probability (1st token): -0.1500
  Approx. Perplexity: 1.16

gpt-4:
  Log Probability (1st token): -0.1000
  Approx. Perplexity: 1.11
```

In this example, the `gpt-4` model has a lower perplexity, indicating that it finds the sentence more fluent and predictable compared to `gpt-3.5-turbo`. This demonstrates how you can use log probabilities and perplexity to compare the fluency of different models.

## Extracting Token Text from API Responses

Now that you've seen how to extract log probabilities from model responses, let's dig deeper into the API response structure. In this exercise, you'll enhance our fluency comparison by extracting not just the log probability but also the actual text of the first token generated by each model.

Look for the TODO comment in the code. Your task is to add a line that extracts the token's text from the response object, similar to how we're already extracting the log probability. The token text is stored in the same nested object where we find the log probability.

By displaying both the token text and its probability, you'll gain more insight into how different models interpret and continue the same input text. This skill of navigating complex API responses will be valuable as you work with more advanced language model applications.

```python
from openai import OpenAI

client = OpenAI()

models = ["gpt-3.5-turbo", "gpt-4"]
sentence = "The president addressed the nation on live television."

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Required, can't be 0
        logprobs=True,
        top_logprobs=5
    )

    # TODO: Extract the token text from the response, similar to how we extract the log probability below
    
    # Use logprob of first generated token as approximation
    logprob = response.choices[0].logprobs.content[0].logprob

    print(f"{model}:")
    print(f"  First token: \"[Your code should extract this]\"")
    print(f"  Log Probability (1st token): {logprob:.4f}\n")

```

The text of the token is located in the same nested object as the log probability. You can access it using the `.token` attribute.

Here is the updated code:

```python
from openai import OpenAI

client = OpenAI()

models = ["gpt-3.5-turbo", "gpt-4"]
sentence = "The president addressed the nation on live television."

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )

    # Extract the token text and its log probability
    first_token_info = response.choices[0].logprobs.content[0]
    token_text = first_token_info.token
    logprob = first_token_info.logprob

    print(f"{model}:")
    print(f"  First token: \"{token_text}\"")
    print(f"  Log Probability (1st token): {logprob:.4f}\n")
```

### Explanation

The `response` object from the OpenAI API has a structured hierarchy. When you request `logprobs=True`, the `logprobs` object is added to the response. The token data is then found within this object at `response.choices[0].logprobs.content[0]`.

This object contains two key pieces of information:

  - `.token`: The actual text of the token (e.g., "The").
  - `.logprob`: The log probability of that token.

By extracting both pieces of information, you can now see exactly what the model predicts as the next token and how confident it is in that prediction.

## Calculating Perplexity for Model Comparison

Perfect! You've learned how to work with API response structures and extract log probabilities from model outputs. Now it's time to complete the fluency comparison by adding the missing piece — perplexity calculation.

Your objective is to calculate the approximate perplexity for each model's first token using the mathematical formula math.exp(-logprob). Look for the TODO comment in the code and add the perplexity calculation right after you extract the log probability.

Remember that lower perplexity values indicate better fluency — they show that a model finds the text more predictable and natural. This calculation will give you the complete picture for comparing how different models perceive the same input sentence.

```python
import math
from openai import OpenAI

client = OpenAI()

models = ["gpt-3.5-turbo", "gpt-4"]
sentence = "The president addressed the nation on live television."

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Required, can't be 0
        logprobs=True,
        top_logprobs=5
    )

    # Use logprob of first generated token as approximation
    logprob = response.choices[0].logprobs.content[0].logprob
    # TODO: Calculate the approximate perplexity using math.exp

    print(f"{model}:")
    print(f"  Log Probability (1st token): {logprob:.4f}")
    print(f"  Approx. Perplexity: {perplexity:.2f}\n")
```

You're right, the final step is to calculate the perplexity. The formula is `math.exp(-logprob)`.

Here is the completed code:

```python
import math
from openai import OpenAI

client = OpenAI()

models = ["gpt-3.5-turbo", "gpt-4"]
sentence = "The president addressed the nation on live television."

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Required, can't be 0
        logprobs=True,
        top_logprobs=5
    )

    # Use logprob of first generated token as approximation
    logprob = response.choices[0].logprobs.content[0].logprob
    # Calculate the approximate perplexity
    perplexity = math.exp(-logprob)

    print(f"{model}:")
    print(f"  Log Probability (1st token): {logprob:.4f}")
    print(f"  Approx. Perplexity: {perplexity:.2f}\n")
```

When you run this code, you'll see the perplexity values for each model, which will give you a direct comparison of their fluency for the given sentence. As you noted, the model with the lower perplexity value is considered more "fluent" or less "surprised" by the text.

## Evaluating Multiple Sentences for Fluency

Cosmo
Just now
Read message aloud
Excellent work on mastering single-sentence fluency evaluation! Now it's time to scale up your analysis and see how model fluency varies across different types of text.

Your objective is to modify the code to evaluate multiple sentences at once, giving you a broader view of model performance patterns. This will help you understand how the same model might handle different sentence structures or topics with varying levels of fluency.

Here's what you need to do:

Replace the single-sentence variable with a list of sentences
Add a loop to iterate through each sentence
Update the print statements to show which sentence is being evaluated
**Restrict your evaluation to only the "gpt-4" model.
Add error handling for math.exp(-logprob) using a try/except block to catch OverflowError and set perplexity to float('inf') if it occurs
By testing multiple sentences with the "gpt-4" model, you'll develop a more complete understanding of how a state-of-the-art model handles various text types and gain valuable experience working with nested data structures and robust error handling.

```python
import math
from openai import OpenAI

client = OpenAI()

models = ["gpt-4"]
# TODO: Replace this single sentence with a list of sentences to evaluate
sentence = "The president addressed the nation on live television."

# TODO: Add a loop here to iterate through each sentence in your list

print(f"Evaluating sentence fluency: \"{sentence}\"\n")

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Required, can't be 0
        logprobs=True,
        top_logprobs=5
    )

    # Use logprob of first generated token as approximation
    logprob = response.choices[0].logprobs.content[0].logprob
    # TODO: Add error handling for math.exp(-logprob) to catch OverflowError
    perplexity = math.exp(-logprob)

    print(f"{model}:")
    print(f"  Log Probability (1st token): {logprob:.4f}")
    print(f"  Approx. Perplexity: {perplexity:.2f}\n")
```

Here is the modified code that evaluates multiple sentences using a list, loops through them, and includes error handling for the `math.exp()` function.

```python
import math
from openai import OpenAI

client = OpenAI()

models = ["gpt-4"]
# Replace the single sentence with a list of sentences to evaluate
sentences = [
    "The president addressed the nation on live television.",
    "The sun rises in the east and sets in the west.",
    "A computer is a machine that can be programmed.",
    "Unusual phrases often lead to high perplexity."
]

# Process each sentence for the specified model
for model in models:
    for sentence in sentences:
        print(f"Evaluating sentence fluency for '{model}': \"{sentence}\"")
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": sentence}],
                max_tokens=1,
                logprobs=True,
                top_logprobs=5
            )

            # Use logprob of first generated token as approximation
            logprob = response.choices[0].logprobs.content[0].logprob

            # Add error handling for math.exp(-logprob) to catch OverflowError
            try:
                perplexity = math.exp(-logprob)
            except OverflowError:
                # Set perplexity to infinity if the logprob is too negative
                perplexity = float('inf')

            print(f"  Log Probability (1st token): {logprob:.4f}")
            print(f"  Approx. Perplexity: {perplexity:.2f}\n")
            
        except Exception as e:
            print(f"An error occurred while processing the sentence: {e}\n")
```

### Key Changes

  * **List of Sentences**: The `sentence` variable has been replaced with a list called `sentences`, containing several different examples.
  * **Nested Loop**: A new `for sentence in sentences:` loop has been added. This loop is nested inside the `for model in models:` loop, ensuring that each sentence is processed by each model in the list.
  * **Updated Print Statements**: The print statements have been modified to clearly indicate which model and sentence are currently being evaluated.
  * **OverflowError Handling**: A `try-except OverflowError` block is now wrapped around the `math.exp(-logprob)` calculation. This prevents the program from crashing if `logprob` is an extremely large negative number, which can happen with highly predictable tokens. If an `OverflowError` occurs, the perplexity is set to `float('inf')` to represent an extremely high value. This helps in understanding that the model found the sentence highly predictable, but the value exceeded the float limit.
  * **General Exception Handling**: A broader `try-except` block has been added to catch any other potential API or data parsing errors, ensuring the script continues to run even if a request fails.