# Evaluating Model Outputs

We can use logprobs to evaluate model outputs and reduce the risk of providing our users with hallucinated or inconsistent responses.

Assume a simple RAG with a single document.

In [1]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [None]:
# Article retrieved
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 - 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

# Questions that can be easily answered given the article
easy_questions = [
    "What nationality was Ada Lovelace?",
    "What was an important finding from Lovelace's seventh note?",
]

# Questions that are not fully covered in the article
medium_questions = [
    "Did Lovelace collaborate with Charles Dickens",
    "What concepts did Lovelace build with Charles Babbage",
]

In [13]:
from openai import OpenAI
import numpy as np
client = OpenAI()

We use our simple interface.

In [14]:
def get_completion(
    input: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "input": input,
        "max_output_tokens": max_tokens,
        "temperature": temperature,
        "tools": tools,
        "include": ["message.output_text.logprobs"] if logprobs else [],
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.responses.create(**params)
    return completion

The prompt below indicates that the model should determine if it has sufficient information to answer a question. We can assess the response to this query using logprobs.

In [3]:
PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""

In [30]:

def has_sufficient_context_for_answer(article, question):
    output = ""
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=article, question=question
                ),
            }
        ],
        model="gpt-4o-mini",
        logprobs=True,
    )
    output += f'Question: {question}\n'
    logprob = API_RESPONSE.output[0].content[0].logprobs[0]
    output += f'has_sufficient_context_for_answer: {logprob.token}, logprobs: {logprob.logprob}, linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%\n'
    return(output)

### Easy Questions

In [31]:

for qn in easy_questions:
    output = has_sufficient_context_for_answer(ada_lovelace_article, qn)
    print(output)

Question: What nationality was Ada Lovelace?
has_sufficient_context_for_answer: True, logprobs: -0.0, linear probability: 100.0%

Question: What was an important finding from Lovelace's seventh note?
has_sufficient_context_for_answer: True, logprobs: -2e-06, linear probability: 100.0%



### Medium Questions

In [32]:
for qn in medium_questions:
    output = has_sufficient_context_for_answer(ada_lovelace_article, qn)
    print(output)

Question: Did Lovelace collaborate with Charles Dickens
has_sufficient_context_for_answer: False, logprobs: -1e-06, linear probability: 100.0%

Question: What concepts did Lovelace build with Charles Babbage
has_sufficient_context_for_answer: True, logprobs: -0.000262, linear probability: 99.97%

