# Evaluating Model Outputs

This notebook demonstrates how to obtain and use logprobs from the Completions API. Many of these examples are adapted from the ["Using logprobs" from the OpenAI Cookbook](https://cookbook.openai.com/examples/using_logprobs#0-imports-and-utils).


## Logprobs and Classification

The first thing to notice is that the Responses and Completions APIs can return logprobs. Not every model provider will return logprobs. Two key parameters to obtain logprobs are:

- `logprobs`: whether to retunr the log rpobabilities of the output tokens. If true, returns the logprobs of each output token returned in the content message.
- `top_logprobs`: An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. This parameter requires `logprobs = True`.

Log probabilities or logprobs are $log(p)$ where $p$ is the probability of a token occurring at a specific position based on the previous tokens in the context.

In [None]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [None]:
from openai import OpenAI
import numpy as np
import os
client = OpenAI()

First, we establish a simple interface that we can use for our experiments.

In [None]:
def get_completion(
    input: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "input": input,
        "max_output_tokens": max_tokens,
        "temperature": temperature,
        "tools": tools,
        "include": ["message.output_text.logprobs"] if logprobs else [],
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.responses.create(**params)
    return completion

In [None]:
headlines = [
    "War and Peace in the Modern Era",
    "'War and Peace' in the Modern Era",
    "The Art of the Deal",
    # NYT
    "Louvre Closed After Thieves Steal ‘Priceless’ Jewels in Brazen Daylight Robbery",
    "The Risk That Built America",
    "Who should the Dodgers rather face in the World Series, the Mariners or the Blue Jays?",
    #New Yorker
    "Justin Trudeau and Katy Perry's Teen-Age Dream",
    "Shohei Ohtani and the Dodgers Are a Sight to Behold", 
    "A Tech Millionaire's Costly Quest to Prove His Brother Was Murdered", 
    "The AI Boom and the Spectre of 1929"
]

In [None]:
CLASSIFICATION_PROMPT = """You will be given a headline of a news article.
Classify the article into one of the following categories: Business,  Politics, Sports, and Art.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article headline: {headline}"""

We can use the interface to obtain the classification that we requested. However, it does not show the logprobs of the different top options.

In [None]:
for headline in headlines:
    print(f"\nHeadline: {headline}")
    response = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o-mini",
    )
    print(f"Category: {response.output_text}\n")

Showing the top two options with their log and linear probabilities.

In [None]:
for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o-mini",
        logprobs=True,
        top_logprobs=2,
    )
    top_n_logprobs = API_RESPONSE.output[0].content[0].logprobs[0].top_logprobs
    output_content = ""
    for i, logprob in enumerate(top_n_logprobs, start=1):
        output_content += (
            f"Output token {i}: {logprob.token}, "
            f"logprobs: {logprob.logprob}, "
            f"linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%\n"
        )
    print(output_content)
    print("\n")

In this classification task, we see the usefulness of logprobs: 
+ We can determine the degree to which a model is "sure" about a classification that it has proposed. 
+ Based on logprobs, we can set a threshold under which human assistance is needed. 
+ Alternatively, we can set the logic of our code to provide several options if the logprobs are within a threshold.