# Evaluating Model Outputs

We can evaluate a model's confidence in its results by using perplexity. Perplexity is a measure of uncertainty that can be calculated by exponentiating the negative of the average of the logprobs. 

+ Perplexity can be used to assess the result of an individual model run.
+ It can also be used to compare the relative confidence of results between model runs. 

Low perplexity or high confidence does not guarantee accuracy, but it can be a helpful signal when paired with other evaluation metrics. 

In [None]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [None]:
from openai import OpenAI
import numpy as np
import os 

client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

In [None]:
prompts = [
    # Low perplexity: Clear topic, common structure, highly predictable vocabulary
    "Explain how photosynthesis works in simple terms.",
    # Medium preplexity: Narrative freedom, but familiar theme and constraints.
    "Write a short story about a traveler who realizes the journey mattered more than the destination.",
    # High perplexity: Abstract concept, creative freedom, unpredictable vocabulary
    "Describe the taste of a color that only exists for one second at dusk, using metaphors from mathematics and weather."
]

In [None]:
def get_completion(
    input: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "input": input,
        "max_output_tokens": max_tokens,
        "temperature": temperature,
        "tools": tools,
        "include": ["message.output_text.logprobs"] if logprobs else [],
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.responses.create(**params)
    return completion

In [None]:

for prompt in prompts:
    API_RESPONSE = get_completion(
        [{"role": "user", "content": prompt}],
        model="gpt-4o-mini",
        logprobs=True,
    )
    logprobs = [token.logprob for token in API_RESPONSE.output[0].content[0].logprobs]
    response_text = API_RESPONSE.output[0].content[0].text
    response_text_tokens = [token.token for token in API_RESPONSE.output[0].content[0].logprobs]
    max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
    max_token_length = max(len(s) for s in response_text_tokens)
    

    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
    formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

    perplexity_score = np.exp(-np.mean(logprobs))
    
    print("\n\n\nPrompt:".ljust(max_starter_length), prompt)
    print("Response:".ljust(max_starter_length), response_text, "\n")
    print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
    print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
    print("\nPerplexity:".ljust(max_starter_length), perplexity_score, "\n")


## 💡 Frequently Asked Questions (FAQ)

**Q: How do we evaluate Retrieval?**
A: We check if the "Gold Standard" documents (the ones we know contain the answer) appear in the top-k results returned by the vector DB.
- **Recall@k**: Did the right doc appear in the top k?
- **MRR (Mean Reciprocal Rank)**: How high up on the list was the right doc?

**Q: What is Perplexity?**
A: A measure of how "surprised" a model is by the text. Lower perplexity means the model predicts the text well. It's often used to measure the fluency of a model, though it correlates poorly with factual accuracy.

**Q: What is Hallucination rate?**
A: The percentage of generated answers that contain information not supported by the retrieved context. Minimizing this is the primary goal of RAG.
