# Exercise: GenAI Evaluation Medley

Let's practice exact evaluation, AI-as-judge mechanics, and benchmarking by completing small, focused coding tasks.

## Outline

We will cover the following evaluation techniques:
1. Exact Match: Implement a function to compute the exact match score between predicted and reference answers.
2. Lexical Similarity: Calculate ROUGE scores to assess the overlap between predicted and reference texts.
3. Semantic Similarity: Use embeddings to compute cosine similarity between predicted and reference texts.
4. Functional Correctness: Evaluate code generation by executing predicted code and comparing outputs.
5. Pass@K: Implement the Pass@K metric.
6. LLM-as-a-Judge or AI-as-a-Judge: Use a language model to evaluate the quality of predictions based on a rubric.

## Setup
Now we import standard libraries used across exercises and set basic configuration.

In [None]:
# Student Task: Set up the OpenAI API key and base URL from environment variables
# TODO: If using Vocareum, set the API key directly in the code below

import litellm
import os
import numpy as np

if os.getenv("OPENAI_API_KEY"):
    litellm.openai_key = os.getenv("OPENAI_API_KEY")

# If using Vocareum, you can also set your API key here directly
# Uncomment and replace the string with your actual Vocareum API key
# litellm.openai_key = "voc-**********"

if (litellm.openai_key or "").startswith("voc-"):
    litellm.api_base = "https://openai.vocareum.com/v1"
    print("Using Vocareum OpenAI API base URL")

Using Vocareum OpenAI API base URL


## Exact Match (EM)
Let's compute exact-match accuracy after simple normalization (lowercase and trim).

In [105]:
# Student task: Implement exact_match and compute EM
# TODO: Complete the sections marked with **********

preds = ["Lima", "ayacucho", "Cusco", "Arequipa"]
labels = ["lima", "Ayacucho", "Cusco", "Trujillo"]


def normalize(s: str) -> str:
    """Normalize the string by lowercasing and stripping whitespace."""
    return s.lower().strip()


def exact_match(pred: str, label: str) -> int:
    # return 1 if normalized strings are identical, else 0
    return_value = "**********"

    # <<< START SOLUTION SECTION
    return_value = int(normalize(pred) == normalize(label))
    # >>> END SOLUTION SECTION

    return return_value


em_scores = [exact_match(p, l) for p, l in zip(preds, labels)]
em = sum(em_scores) / len(em_scores)
print("EM:", em)

assert em == 0.75, (
    f"EM should be 0.75, but got {em}. Please check your exact_match function."
)


EM: 0.75


## Lexical Similarity (ROUGE)

Let's compute ROUGE scores using the `evaluate` library.

Read more at: https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/README.md

In [106]:
# Student task: Compute ROUGE-L using LCS length
# Complete the sections with **********


# Define candidate and reference texts
pred = "The capital of Peru is Lima"
label = "Lima is the capital of Peru"


# Import the evaluate library
# **************

# Load the ROUGE metric
# **************

# Compute ROUGE scores
# **************

# <<< START SOLUTION SECTION

# Import the evaluate library
from evaluate import load

# Load the ROUGE metric
rouge = load("rouge")

# Compute ROUGE scores
results = rouge.compute(predictions=[pred], references=[label])


# >>> END SOLUTION SECTION

assert isinstance(results, dict), (
    f"Results should be a dictionary, but got {type(results)}. See the evaluate library documentation for ROUGE usage."
)
keys = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
for key in keys:
    assert key in results, (
        f"Missing key '{key}' in results. Expected keys: {keys}. See the evaluate library documentation for ROUGE usage."
    )

results


{'rouge1': np.float64(1.0),
 'rouge2': np.float64(0.6),
 'rougeL': np.float64(0.6666666666666666),
 'rougeLsum': np.float64(0.6666666666666666)}

## Semantic Similarity using Cosine Similarity

We'll use the `sentence-transformers` library to compute semantic similarity between predicted and reference sentences. The model "all-MiniLM-L6-v2" is a lightweight model that can run on GPUs.

Read more here: https://sbert.net/docs/quickstart.html

In [107]:
from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Some example sentences
sentences = [
    "Hi there!",
    "This is a test sentence.",
]

# 3. Generate embeddings
embeddings = model.encode(sentences)

# 4. Verify we have 2 embeddings of dimension 384 each
assert embeddings.shape == (2, 384)


In [108]:
# Student task: Write a semantically different prediction sentence and compute embeddings
# Complete the sections with **********

labels = ["Cusco is in Peru", "Ayacucho is a region", "Trujillo beaches are marvelous"]
preds = [
    "Peru includes Cusco",
    "Ayacucho is a department",
    # Write a sentence that is very semantically different from the prediction
    # "***********"
    # <<< START SOLUTION SECTION
    "Turquoise is a marvelous color",
    # >>> END SOLUTION SECTION
]


# Get the embeddings for each sentence
pred_embeddings = "**********"
label_embeddings = "**********"

# <<< START SOLUTION SECTION
pred_embeddings = model.encode(preds)
label_embeddings = model.encode(labels)
# >>> END SOLUTION SECTION

assert pred_embeddings.shape == (3, 384), (
    f"Expected shape (3, 384), got {pred_embeddings.shape}"
)
assert label_embeddings.shape == (3, 384), (
    f"Expected shape (3, 384), got {label_embeddings.shape}"
)

pred_embeddings, label_embeddings

(array([[ 0.06145667, -0.06237286, -0.03735719, ..., -0.00960599,
          0.03519065, -0.01402609],
        [-0.01852836, -0.03180068, -0.07411851, ..., -0.00197019,
          0.01199768,  0.01130261],
        [-0.01416574,  0.04564881,  0.05952034, ..., -0.03004984,
         -0.02417881, -0.04183076]], dtype=float32),
 array([[ 0.06701853, -0.04063964, -0.06178867, ...,  0.01089184,
         -0.01366578, -0.02568761],
        [ 0.0846475 ,  0.00272666, -0.06455816, ...,  0.04696646,
         -0.06039636, -0.00335864],
        [ 0.03886199, -0.02831236, -0.02234175, ...,  0.00904637,
         -0.02847461, -0.0095209 ]], dtype=float32))

In [110]:
# Calculate the cosine similarity for each pair of embeddings
# No changes needed in this cell, but if it fails, check the above cell

cosine_similarity = [
    # Cosine similarity for two vectors a and b is defined as:
    # cos_sim(a, b) = (a . b) / (||a|| * ||b||)
    # where (a . b) is the dot product of a and b,
    # and ||a|| and ||b|| are the magnitudes (norms) of vectors a and b respectively.
    float(
        np.dot(pred_embeddings[i], label_embeddings[i])
        / np.linalg.norm(pred_embeddings[i])
        / np.linalg.norm(label_embeddings[i])
    )
    for i in range(len(preds))
]

# Compute cosine similarity between the two embeddings
for i, (p, l, cos_sim) in enumerate(zip(preds, labels, cosine_similarity)):
    print(f"Pair {i + 1}:")
    print(f"  Pred: {p}")
    print(f"  Label: {l}")
    print(f"  Cosine Similarity: {cos_sim:.4f}\n")

# Check that the last pair has the lowest similarity
assert cosine_similarity[-1] < cosine_similarity[0], (
    "The last pair should have the lowest cosine similarity. Please check your prediction sentence."
)
assert cosine_similarity[-1] < cosine_similarity[1], (
    "The last pair should have the lowest cosine similarity. Please check your prediction sentence."
)


Pair 1:
  Pred: Peru includes Cusco
  Label: Cusco is in Peru
  Cosine Similarity: 0.9358

Pair 2:
  Pred: Ayacucho is a department
  Label: Ayacucho is a region
  Cosine Similarity: 0.7663

Pair 3:
  Pred: Turquoise is a marvelous color
  Label: Trujillo beaches are marvelous
  Cosine Similarity: 0.2680



## Functional Correctness
Let's evaluate code-generation by running a tiny function against unit tests (execution accuracy).

In [112]:
# Student task: Complete the evaluation of the sort_and_normalize function
# Complete the sections with **********


def sort_and_normalize(s: str) -> str:
    """Sort the words in the string"""

    # Our toy function will fail on this edge case
    if "armadillo" in s:
        s = s.replace("armadillo", "kitty")

    return " ".join(sorted(s.split()))


preds = [
    "the capybara is the largest rodent",
    "an armadillo has a hard shell",
    "elephants are the largest land animals",
]
labels = [
    "capybara is largest rodent the the",
    "a an armadillo hard has shell",
    "animals are elephants land largest the",
]

# Write tests to check if sort_and_normalize works correctly
results = [
    # "**********"
    # <<< START SOLUTION SECTION
    sort_and_normalize(p) == l
    # >>> END SOLUTION SECTION
    for p, l in zip(preds, labels)
]

print("Proportion of tests passed:", sum(results) / len(results))

assert sum(results) == 2, (
    f"2 tests should pass, but got {sum(results)}. Please check how your are evaluating the results."
)

Proportion of tests passed: 0.6666666666666666


## Pass@k

Let's simulate multiple samples for a single task and compute pass@k (1 if any sample equals the gold).

In [115]:
# Student task: Implement pass_at_k
# Complete the sections with **********

label = "Lima"
samples = ["Lima", "Arequipa", "Cusco", "Lima"]


# Implement pass_at_k with signature (samples: List[str], label: str) -> int
# **********


# <<< START SOLUTION SECTION
def pass_at_k(samples: List[str], label: str) -> int:
    return int(any(s == label for s in samples))


# >>> END SOLUTION SECTION

print("pass@4 =", pass_at_k(samples, label))

assert pass_at_k(samples, label) == 1, (
    f"pass@4 should be 1, but got {pass_at_k(samples, label)}. Please check your pass_at_k function."
)


pass@4 = 1


## LLM as a Judge

Let's create a function that calls an LLM to compare predicted values and reference values (if applicable) and return a score.

In [118]:
# Student task: Complete the LLM-as-a-judge function
# Complete the sections with **********


def llm_as_judge(pred: str, rubric: str, label: str | None = None) -> float:
    """Use an LLM to judge the quality of a prediction against a rubric and optional label."""
    from litellm import completion

    # Write a system prompt that instructs the LLM to use the rubric to score the prediction
    # The response should be formatted as:
    # <reasoning>...</reasoning>
    # <score>FLOAT_ANSWER</score>
    # where FLOAT_ANSWER is a float between 0 and 1.
    # We will extract FLOAT_ANSWER from the response later

    SYSTEM_PROMPT = "**********"

    # <<< START SOLUTION SECTION
    SYSTEM_PROMPT = f"""You are an expert evaluator. Use the following rubric to score the prediction.
    Format your response as:
    <reasoning>...</reasoning>
    <score>FLOAT_ANSWER</score>

    where FLOAT_ANSWER is a float between 0 and 1.

    RUBRIC:
    {rubric}
    """
    # >>> END SOLUTION SECTION

    # Create a user prompt with the prediction and, optionally, the label
    # **********

    # <<< START SOLUTION SECTION
    USER_PROMPT = f"Prediction: {pred}\n"
    if label is not None:
        USER_PROMPT += f"Label: {label}\n"
    # >>> END SOLUTION SECTION

    # Call the LLM using litellm with the system and user prompts (use the model gpt-5-nano)
    # See: https://github.com/BerriAI/litellm

    # response = completion(**********)

    # <<< START SOLUTION SECTION
    response = completion(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_PROMPT},
        ],
    )
    # >>> END SOLUTION SECTION

    text_response = response["choices"][0]["message"]["content"]
    print("LLM response:", text_response)

    # Extract FLOAT_ANSWER from the response

    # float_answer = **********

    # <<< START SOLUTION SECTION
    float_answer = float(
        text_response.split("<score>")[-1].split("</score>")[0].strip()
    )
    # >>> END SOLUTION SECTION

    return float_answer


# Write a rubric for evaluating if the prediction is the capital of the label country
# 1.0 if correct, 0.5 if a city in the same country, 0.0 otherwise

# **********

# <<< START SOLUTION SECTION
RUBRIC = """
* Return 1.0 if the prediction is the capital of the label value,
* Otherwise, 0.5 if the prediction is a city in the same country as the label value,
* Otherwise, return 0.0
"""
# >>> END SOLUTION SECTION

assert (
    llm_as_judge(
        pred="Manila",
        label="Philippines",
        rubric=RUBRIC,
    )
    == 1.0
), "Manila is the capital of the Philippines"

assert (
    llm_as_judge(
        pred="Cebu",
        label="Philippines",
        rubric=RUBRIC,
    )
    == 0.5
), "Cebu is a city in the Philippines, but not the capital"

assert (
    llm_as_judge(
        pred="Tokyo",
        label="Philippines",
        rubric=RUBRIC,
    )
    == 0.0
), "Tokyo is not in the Philippines"


LLM response: <reasoning>Manila is the capital city of the Philippines, which exactly matches the label value. Therefore, the prediction is correct.</reasoning>
<score>1.0</score>
LLM response: <reasoning> The label is Philippines (a country). The prediction Cebu is a city in the Philippines but not the capital (Manila). Therefore it matches the country but not the capital, yielding a score of 0.5. </reasoning>
<score>0.5</score>
LLM response: <reasoning>Tokyo is not the capital of the Philippines (the capital is Manila). It is not in the same country as the label, so the prediction is incorrect.</reasoning>
<score>0.0</score>


Congrats! You have completed the evaluation exercise. Proper evaluation is the bedrock for building reliable AI systems. Great job! 👏👏👏

<br /><br /><br /><br /><br /><br /><br /><br /><br />