# GenAI Evaluation: A Quick Demo

Welcome! This notebook is a brief demonstration of several common techniques used to evaluate the output of Generative AI models. We'll walk through simple, clear examples for each method. Pay close attention to how each technique measures a different aspect of quality.

In [9]:
import litellm
import os

if os.getenv("OPENAI_API_KEY"):
    litellm.openai_key = os.getenv("OPENAI_API_KEY")

# If using Vocareum, you can also set your API key here directly
# Uncomment and replace the string with your actual Vocareum API key
# litellm.openai_key = "voc-**********"

if (litellm.openai_key or "").startswith("voc-"):
    litellm.api_base = "https://openai.vocareum.com/v1"
    print("Detected vocareum API key. Using Vocareum OpenAI API base URL.")

Detected vocareum API key. Using Vocareum OpenAI API base URL.


## 1. Setup: Imports and Configuration

First, we import the necessary libraries. This cell brings in tools for data handling, math, and specific evaluation metrics from libraries like `scikit-learn` and `evaluate`.

In [1]:
import math, random, re, sys, platform, time, json
from typing import List, Dict

import numpy as np
import pandas as pd
from evaluate import load
from sentence_transformers import SentenceTransformer

# Set a seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

## 2. Exact Match (EM)

**Exact Match** is the simplest evaluation metric. It checks if the model's output is identical to the reference answer. We often apply a simple normalization step (like lowercasing) first.

It's useful for tasks with a single, clear correct answer, like multiple-choice questions.

In [None]:
# Let's compare predicted fruit names with the correct labels.
preds = ["Apple", "banana ", " Orange"]
labels = ["apple", "banana", "grape"]


def normalize(s: str) -> str:
    """Normalize a string by lowercasing and stripping whitespace."""
    return s.lower().strip()


def exact_match(pred: str, label: str) -> int:
    """Return 1 if normalized strings are identical, else 0."""
    return int(normalize(pred) == normalize(label))


# Calculate EM score for each pair
em_scores = [exact_match(p, l) for p, l in zip(preds, labels)]

# The final score is the average of individual scores
em_accuracy = sum(em_scores) / len(em_scores)

print(f"Individual Scores: {em_scores}")
print(f"Average Exact Match Accuracy: {em_accuracy:.2f}")

Individual Scores: [1, 1, 0]
Average Exact Match Accuracy: 0.67


## 3. Lexical Similarity (ROUGE)

When answers can be phrased differently, Exact Match is too strict. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** measures the overlap of words (n-grams) between the prediction and the label.

It's commonly used for evaluating text summarization.

In [3]:
# Define a prediction and a reference text
pred = "the quick brown fox"
label = "the fox is quick and brown"

# Load the ROUGE metric from the 'evaluate' library
rouge = load("rouge")

# Compute the scores
results = rouge.compute(predictions=[pred], references=[label])

# ROUGE-1 measures unigram (single word) overlap.
# ROUGE-L measures the longest common subsequence.
print(f"ROUGE-1 Score: {results['rouge1']:.4f}")
print(f"ROUGE-L Score: {results['rougeL']:.4f}")

ROUGE-1 Score: 0.8000
ROUGE-L Score: 0.6000


## 4. Semantic Similarity

What if two sentences mean the same thing but use different words? **Semantic Similarity** addresses this. We convert sentences into numerical vectors (embeddings) and measure the cosine similarity between them. A score near 1.0 means very similar in meaning, while a score near 0.0 means very different.

This is great for evaluating paraphrasing or nuanced answers.

In [None]:
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Define prediction and label sentences
labels = ["A dog is a loyal pet", "Cats are independent animals", "The sky is blue"]
preds = [
    "Dogs make great companions",
    "A cat is a solitary creature",
    "The ocean is vast",
]

# 3. Generate embeddings for each list
pred_embeddings = model.encode(preds)
label_embeddings = model.encode(labels)

# 4. Calculate cosine similarity for each pair
for i in range(len(preds)):
    # The formula is: (A dot B) / (||A|| * ||B||)
    similarity = np.dot(pred_embeddings[i], label_embeddings[i]) / (
        np.linalg.norm(pred_embeddings[i]) * np.linalg.norm(label_embeddings[i])
    )
    print(
        f"Pair {i + 1}:\n  Pred:  '{preds[i]}'\n  Label: '{labels[i]}'\n  Similarity: {similarity:.4f}\n"
    )

Pair 1:
  Pred:  'Dogs make great companions'
  Label: 'A dog is a loyal pet'
  Similarity: 0.6147

Pair 2:
  Pred:  'A cat is a solitary creature'
  Label: 'Cats are independent animals'
  Similarity: 0.6848

Pair 3:
  Pred:  'The ocean is vast'
  Label: 'The sky is blue'
  Similarity: 0.3098



## 5. Functional Correctness

For code generation, we need to know if the code actually works. **Functional Correctness** can evaluate this by running the generated code against a set of unit tests. The score is the proportion of tests that pass.

In [None]:
# This function is supposed to reverse and capitalize a string,
# but it has a bug: it fails if the string contains a number.
def reverse_and_capitalize(s: str) -> str:
    """Reverse and capitalize a string, with a hidden bug."""
    if any(char.isdigit() for char in s):
        return "ERROR - CONTAINS DIGITS"
    return s[::-1].upper()


# Test cases: one prediction will trigger the bug
code_preds = ["hello", "world1", "python"]
test_labels = ["OLLEH", "1DLROW", "NOHTYP"]

# Run the generated code against the test labels
results = []
for pred_code, label in zip(code_preds, test_labels):
    output = reverse_and_capitalize(pred_code)
    print(f"Input: '{pred_code}' -> Output: '{output}', Expected: '{label}'")
    results.append(output == label)

pass_rate = sum(results) / len(results)
print(f"\nProportion of tests passed: {pass_rate:.2f}")

Input: 'hello' -> Output: 'OLLEH', Expected: 'OLLEH'
Input: 'world1' -> Output: 'ERROR - CONTAINS DIGITS', Expected: '1DLROW'
Input: 'python' -> Output: 'NOHTYP', Expected: 'NOHTYP'

Proportion of tests passed: 0.67


## 6. Pass@k

Sometimes we ask a model to generate multiple (`k`) possible answers for one problem. The **Pass@k** metric measures if at least one of these `k` attempts is correct. If any sample is correct, the entire set is considered a "pass" (score of 1).

In [None]:
def pass_at_k(samples: List[str], label: str) -> int:
    """Return 1 if any sample in the list matches the label, else 0."""
    return int(any(s == label for s in samples))


# The model generated 4 possible answers for "Name a primary color."
label = "blue"
samples = ["red", "yellow", "green", "blue"]

# Check if any of the 4 samples is correct
pass_score = pass_at_k(samples, label)

print(f"Samples: {samples}")
print(f"Label: {label}")
print(f"Pass@4 Score: {pass_score}")

Samples: ['red', 'yellow', 'green', 'blue']
Label: blue
Pass@4 Score: 1


## 7. LLM-as-a-Judge

For complex, subjective tasks (like creativity or helpfulness), we can use another powerful LLM to act as a judge. We provide the judge with the prediction, the reference (if any), and a detailed **rubric**. The judge then provides a score and reasoning.

**Note**: To make this demo run without an API key, we will *simulate* the LLM judge's behavior with a mock function.

In [14]:
# This is our rubric for the judge.
RUBRIC = """
Score 1.0 if the predicted animal is the same as the label.
Score 0.5 if the prediction is a different animal but from the same biological class (e.g., both are mammals).
Score 0.0 otherwise (e.g., a mammal and a reptile).
"""


# A MOCK function to simulate an LLM judge's response.
# It returns a score based on a pre-defined logic that follows the rubric.
def llm_as_judge(pred: str, label: str, rubric: str) -> float:
    """A simulated LLM judge that scores animal predictions."""
    from litellm import completion
    import re

    resp = completion(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert judge that scores answers based on a given rubric. "
                    "You respond with a reason, and then FINAL SCORE: <score>."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Rubric:\n{rubric}\n\nPrediction: {pred}\nLabel: {label}\n\nWhat score should be assigned"
                    " based on the rubric? Respond with scores in the rubric."
                ),
            },
        ],
    )

    resp_text = resp.choices[0].message.content
    print(f"LLM Judge Response:\n{resp_text}\n")

    pattern = r"FINAL SCORE:\s*([0-9]*\.?[0-9]+)"
    match = re.search(pattern, resp_text)
    if match:
        score = float(match.group(1))
        return score
    else:
        print(
            "Warning: Could not find FINAL SCORE in the response. Defaulting to None."
        )
        return None


# --- Test Case 1: Perfect Match ---
score1 = llm_as_judge(pred="Lion", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score1}\n")

# --- Test Case 2: Same Class ---
score2 = llm_as_judge(pred="Tiger", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score2}\n")

# --- Test Case 3: Different Class ---
score3 = llm_as_judge(pred="Snake", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score3}\n")

LLM Judge Response:
Reason: The prediction "Lion" exactly matches the label "Lion," so it meets the criterion for a full match.

FINAL SCORE: 1.0

--> Final Score: 1.0

LLM Judge Response:
Reason: Tiger and lion are different animals, but both are mammals (both big cats), so they fall under the "different animal but same biological class" case.

FINAL SCORE: 0.5

--> Final Score: 0.5

LLM Judge Response:
The prediction "Snake" is a reptile while the label "Lion" is a mammal, so they are different biological classes and do not match the same animal â€” score 0.0.

FINAL SCORE: 0.0

--> Final Score: 0.0



## Conclusion

That's a quick tour of several key evaluation methods! Each one has its strengths and is suited for different types of tasks. Choosing the right metric is crucial for understanding your model's performance and making improvements.

<br /><br /><br /><br /><br /><br /><br /><br /><br />