# Unit 2

## Comparing Sentence Likelihoods Using Log Probabilities

### Introduction to Sentence Likelihoods

Welcome back to the course "Scoring LLM Outputs with Logprobs and Perplexity." In the previous lesson, we explored how log probabilities provide insights into a language model’s confidence when generating tokens. Now, we’ll build on that foundation by comparing sentence likelihoods using log probabilities.

Evaluating sentence likelihoods helps us understand how models judge different formulations of language. In this lesson, you’ll learn how to use the OpenAI API to score sentences based on the log probability of the model’s next token prediction.

### The Importance and Applications of Likelihoods

Likelihoods are a fundamental concept in language modeling and natural language processing. They measure how probable a sequence of words is according to a model, allowing us to quantify how “natural” or “expected” a sentence is. This is crucial for a variety of tasks:

  * **Model Evaluation:** Likelihoods are widely used to compare different language models and assess their performance on tasks like text generation, translation, and summarization.
  * **Error Detection:** By identifying sentences or tokens with unusually low likelihoods, we can spot errors, anomalies, or unnatural phrasing in generated text.
  * **Data Filtering:** Likelihood scores help filter out low-quality or irrelevant data when building datasets for training or evaluation.
  * **Downstream Applications:** Many applications—such as speech recognition, machine translation, and autocomplete—rely on likelihoods to rank candidate outputs and select the most plausible one.

Because of their versatility and interpretability, likelihoods (and their log-transformed versions, log probabilities) are a standard tool for both researchers and practitioners working with language models.

### Understanding the Code Structure

Let’s break down the code you’ll use in this unit. We begin by initializing the OpenAI client and defining a list of candidate sentences. For each sentence, we’ll pass it to the model and extract the log probability of the first predicted token, which gives us a proxy for how likely or “natural” the sentence feels to the model.

We use:

  * `logprobs=True` to return log probability data.
  * `top_logprobs=5` to retrieve scores for the top 5 candidate tokens.
  * `max_tokens=1` to generate exactly one token prediction.

### Extracting and Interpreting Log Probabilities

When you request `logprobs=True` from the OpenAI API, the response includes a `logprobs` object for each generated token. This object contains the log probability assigned to each token, as well as the top alternative tokens and their logprobs. The structure looks like this:

```json
response.choices[0].logprobs.content[0] = {
    "token": "<generated_token>",
    "logprob": <log_probability>,
    "top_logprobs": {
        "<token_1>": <logprob_1>,
        "<token_2>": <logprob_2>,
        ...
    }
}
```

A log probability closer to 0 means the model is more confident in that token. The plot below illustrates how log probability values relate to model confidence:

```
Confidence (probability)   Log Probability
-----------------------    ---------------
      1.0                        0
      0.5                   -0.693
      0.1                   -2.303
      0.01                  -4.605
```

By comparing logprob values for different sentences, you can infer which one the model finds more plausible.

### Example: Comparing Sentence Fluency

Let’s look at two example sentences:

  * "The sun is a star."
  * "The sun is a sandwich."

While both are syntactically valid, one is clearly more semantically coherent. We’ll use logprobs to see how the model scores them.

```python
import openai
client = openai.OpenAI()

sentences = [
    "The sun is a star.",
    "The sun is a sandwich."
]

for sentence in sentences:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )
    # Extract logprob for the first generated token
    logprob = response.choices[0].logprobs.content[0].logprob
    print(f"Sentence: {sentence}")
    print(f"Log probability of next token: {logprob:.3f}\n")
```

You might get an output like:

```
Sentence: The sun is a star.
Log probability of next token: -0.276

Sentence: The sun is a sandwich.
Log probability of next token: -1.841
```

The model assigns a much higher log probability to the first sentence, indicating it considers it more likely.

### Summary and Next Steps

In this lesson, you learned how to compare sentence plausibility by examining token-level log probabilities. This method allows you to go beyond just generating responses—now you can measure how confident the model is in its next move.

In the next unit, you’ll use this idea to calculate perplexity, a popular metric that quantifies overall sentence fluency using log probability averages. You're getting closer to evaluating language models like a pro—let’s keep going\!


## Extracting Log Probabilities from Responses

Now that you've learned about log probabilities and how they help measure a model's confidence, let's put this knowledge into practice! In this exercise, you'll work with a single sentence and extract its log probability score from the OpenAI API response.

Your tasks are to:

Extract the log probability from the nested API response structure
Format and print this value to three decimal places
Display both the sentence and its corresponding log probability
This hands-on experience will help you understand how to access the specific fields in the API response that contain log probability data. By completing this exercise, you'll take your first step toward comparing different sentences based on their likelihood scores.

```python
from openai import OpenAI

client = OpenAI()

sentence = "The sun is a star."

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": sentence}],
    max_tokens=1,  
    logprobs=True,
    top_logprobs=5
)

# TODO: Extract the logprob of the first generated token from the response
logprob = None

# TODO: Print the sentence and its log probability formatted to 3 decimal places
```

```python
from openai import OpenAI

client = OpenAI()

sentence = "The sun is a star."

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": sentence}],
    max_tokens=1,
    logprobs=True,
    top_logprobs=5
)

# TODO: Extract the logprob of the first generated token from the response
logprob = response.choices[0].logprobs.content[0].logprob

# TODO: Print the sentence and its log probability formatted to 3 decimal places
print(f"Sentence: {sentence}")
print(f"Log probability of the next token: {logprob:.3f}")
```

## Comparing Sentences with Log Probabilities

Excellent work extracting log probabilities from a single sentence! Now let's take it a step further by comparing multiple sentences. In this exercise, you'll implement the example we discussed in the lesson to see how log probabilities reveal a model's preference for plausible statements.

Your tasks are to:

Create a list with two sentences — one factual and one nonsensical.
Write a loop to process each sentence with the API.
Extract and display the log probability for each sentence.
When you complete this exercise, you'll see firsthand how the model assigns higher log probabilities (values closer to 0) to sentences that make more sense. This practical comparison will deepen your understanding of how language models evaluate different statements based on their likelihood.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a list containing two sentences:
# 1. "The sun is a star." (semantically coherent)
# 2. "The sun is a sandwich." (nonsensical)
sentences = []

# TODO: Write a loop to process each sentence in your list
# The code below needs to be inside your loop

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "The sun is a star."}],
    max_tokens=1,  # Must be >= 1
    logprobs=True,
    top_logprobs=5
)

# TODO: Extract the logprob of the first generated token
logprob = None

# TODO: Print the sentence and its log probability formatted to 3 decimal places
# Make sure to add a blank line between different sentence results

```

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a list containing two sentences:
# 1. "The sun is a star." (semantically coherent)
# 2. "The sun is a sandwich." (nonsensical)
sentences = [
    "The sun is a star.",
    "The sun is a sandwich."
]

# TODO: Write a loop to process each sentence in your list
# The code below needs to be inside your loop
for sentence in sentences:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": sentence}],
        max_tokens=1,  # Must be >= 1
        logprobs=True,
        top_logprobs=5
    )

    # TODO: Extract the logprob of the first generated token
    logprob = response.choices[0].logprobs.content[0].logprob

    # TODO: Print the sentence and its log probability formatted to 3 decimal places
    # Make sure to add a blank line between different sentence results
    print(f"Sentence: {sentence}")
    print(f"Log probability of the next token: {logprob:.3f}\n")
```

## Finding the Most Plausible Sentence

antastic job comparing sentences with log probabilities! Now let's put your skills to practical use by creating a function that automatically finds the most likely sentence from a group of options. In this exercise, you'll build a reusable tool that can identify which sentence a language model considers most natural.

Your tasks are to:

Complete the find_most_likely_sentence function that processes multiple sentences.
Make API calls to get log probabilities for each sentence.
Track and update which sentence has the highest log probability.
Return both the winning sentence and its log probability score.
This function represents a real-world application of what you've learned — it could be used in systems that need to choose the most natural-sounding option from several alternatives. By completing this exercise, you'll have a practical tool that demonstrates how log probabilities can guide decision-making in language processing applications.


```python
from openai import OpenAI

client = OpenAI()

def find_most_likely_sentence(sentences):
    """
    Find the sentence with the highest log probability of the first token.
    
    Args:
        sentences: A list of sentences to compare
        
    Returns:
        A tuple containing (most_likely_sentence, highest_logprob)
    """
    # TODO: Initialize variables to track the highest logprob and most likely sentence
    
    for sentence in sentences:
        # TODO: Make an API call to get the log probability for this sentence
        
        # TODO: Extract the logprob of the first generated token
        
        print(f"Sentence: {sentence}")
        print(f"Log probability: {logprob:.3f}\n")
        
        # TODO: Update tracking variables if this sentence has a higher logprob
    
    # TODO: Return the most likely sentence and its log probability

# Test sentences with varying degrees of plausibility
test_sentences = [
    "The sun is a star in our solar system.",
    "The sun is a planet in our solar system.",
    "The sun is a sandwich in our solar system.",
    "The sun provides light and heat to Earth."
]

# TODO: Call the function and store the result

# TODO: Print the most likely sentence and its log probability
```

```python
from openai import OpenAI

client = OpenAI()

def find_most_likely_sentence(sentences):
    """
    Find the sentence with the highest log probability of the first token.
    
    Args:
        sentences: A list of sentences to compare
        
    Returns:
        A tuple containing (most_likely_sentence, highest_logprob)
    """
    # TODO: Initialize variables to track the highest logprob and most likely sentence
    highest_logprob = float('-inf')
    most_likely_sentence = ""

    for sentence in sentences:
        # TODO: Make an API call to get the log probability for this sentence
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": sentence}],
            max_tokens=1,
            logprobs=True,
            top_logprobs=5
        )
        
        # TODO: Extract the logprob of the first generated token
        logprob = response.choices[0].logprobs.content[0].logprob
        
        print(f"Sentence: {sentence}")
        print(f"Log probability: {logprob:.3f}\n")
        
        # TODO: Update tracking variables if this sentence has a higher logprob
        if logprob > highest_logprob:
            highest_logprob = logprob
            most_likely_sentence = sentence
    
    # TODO: Return the most likely sentence and its log probability
    return (most_likely_sentence, highest_logprob)

# Test sentences with varying degrees of plausibility
test_sentences = [
    "The sun is a star in our solar system.",
    "The sun is a planet in our solar system.",
    "The sun is a sandwich in our solar system.",
    "The sun provides light and heat to Earth."
]

# TODO: Call the function and store the result
most_likely, likelihood_score = find_most_likely_sentence(test_sentences)

# TODO: Print the most likely sentence and its log probability
print("---")
print(f"The most plausible sentence is:\n'{most_likely}'")
print(f"With a log probability of: {likelihood_score:.3f}")
```