# Unit 1

## Introduction to LLM Benchmarking & Basic QA Evaluation

# Introduction to LLM Benchmarking

-----

Welcome to the first lesson of our course, "Benchmarking LLMs with QA." In this lesson, we will explore the fundamentals of benchmarking large language models (LLMs). **Benchmarking** is the process of evaluating the performance of a system or component by comparing it against a set of predefined standards or datasets. It is crucial in understanding how well a model performs in various tasks, identifying its strengths and weaknesses, and guiding improvements.

## Why Benchmarking?

Benchmarking is essential for the development and refinement of LLMs, as it provides a systematic way to measure their capabilities and progress over time. By using standardized datasets and evaluation metrics, we can objectively assess the performance of different models and make informed decisions about their deployment and further development.

Some common types of LLM benchmarks include:

  * **Factual QA** (like TriviaQA, SQuAD)
  * **Multiple-choice reasoning** (like MMLU, ARC)
  * **Truthfulness & bias detection** (like TruthfulQA)
  * **Perplexity-based evaluation** (language fluency prediction)
  * **Semantic similarity** (embedding-based matching)
  * **Domain-specific tests** (custom internal benchmarks)

In this course, we’ll begin with factual QA benchmarks before expanding to other types in later lessons.

## Working with the TriviaQA Dataset

We will use the **TriviaQA** dataset, which contains a large collection of real-world question-answer pairs gathered from trivia websites. While TriviaQA is not a multiple-choice dataset, it is well-suited for evaluating factual question-answering capabilities.

For simplicity and performance, we’ve pre-selected and stored a 100-example subset for you, available at:

`triviaqa.csv`

This subset contains pairs of factual questions and short answers. Here are a few sample entries from the dataset:

```text
Question: What is the capital of France?
Answer: Paris

Question: In which year did the Titanic sink?
Answer: 1912
```

## Setting Up the Environment

Before we dive into the code, let's ensure that your environment is ready. For this lesson, you will need the `openai` and `csv` libraries. If you are working on your local machine, you can install the `openai` library using pip:

```bash
pip install openai
```

The `csv` module is part of Python's standard library, so no additional installation is needed. However, if you are using the CodeSignal environment, these libraries are already pre-installed, so you can focus on the lesson without worrying about setup.

## Loading and Understanding the TriviaQA Dataset

To load the dataset, we’ll use Python’s built-in `csv` module. Here is how you can read it:

```python
import csv

with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))
```

This code opens the `triviaqa.csv` file and reads its contents into a list of dictionaries, where each dictionary represents a question-answer pair. Understanding the structure of this dataset is crucial, as it will be the basis for our evaluation.

## Implementing Normalized Match Evaluation

To evaluate the performance of an LLM, we will use a technique called **normalized match**. This involves comparing the model's response to the correct answer by normalizing both texts. Normalization helps in removing any discrepancies due to case sensitivity or punctuation.

Let's look at the code that implements this evaluation:

```python
import re
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()

def normalize(text):
    return re.sub(r'[^a-z0-9]', '', text.lower())

correct = 0
for q in qa_pairs:
    prompt = f"Answer the following question with a short and direct fact:\n\n{q['question']}"
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content.strip()
    if normalize(q['answer']) == normalize(response):
        correct += 1
```

In this code, the `normalize` function removes all non-alphanumeric characters and converts the text to lowercase. This ensures that the comparison between the model's response and the correct answer is fair and consistent. We then iterate over each question-answer pair, generate a response using the `openai` library, and compare the normalized texts.

## Example: Calculating Normalized Accuracy

In this lesson, we introduced the concept of LLM benchmarking and demonstrated how to evaluate a language model using the TriviaQA dataset. We covered the setup of the environment, loading the dataset, and implementing a normalized match evaluation. However, we will not calculate the normalized accuracy just yet. The results with the current setup may not be optimal, but in the next unit, we will explore techniques like one-shot or few-shot learning to improve the model's performance. By doing so, you will gain a deeper understanding of how to enhance LLM capabilities through advanced evaluation techniques.

## Summary and Next Steps

In this lesson, we introduced the concept of LLM benchmarking and demonstrated how to evaluate a language model using the TriviaQA dataset. We covered the setup of the environment, loading the dataset, and implementing a normalized match evaluation. Although we have not calculated the normalized accuracy yet, we will explore more advanced techniques in the next unit to improve the model's performance. As you move forward, practice these concepts with the exercises provided. These will reinforce your understanding and prepare you for more advanced evaluation techniques in future lessons. Remember, benchmarking is a powerful tool in improving language models, and mastering it will enhance your skills in working with LLMs.



## Loading and Exploring TriviaQA Dataset

Now that you've learned about the importance of benchmarking LLMs, let's get hands-on with the TriviaQA dataset! In this exercise, you'll take your first step toward implementing a benchmark by loading and exploring the dataset.

Your task is to:

Open the TriviaQA dataset file at "triviaqa.csv".
Load the data using Python's csv module as a list of dictionaries.
Print the first three question-answer pairs in a readable format.
This practical experience with dataset handling will build the foundation for the evaluation techniques we'll explore next. Understanding your benchmark data is the first crucial step in any effective LLM evaluation process.

```python
# Loading and exploring the TriviaQA dataset

import csv

# TODO: Open the TriviaQA dataset file and read it using csv.DictReader
# The file is located at "triviaqa.csv"


# TODO: Print the first 3 question-answer pairs from the dataset
# Format each pair as "Question: [question]" and "Answer: [answer]"
# Add an empty line between pairs for better readability


```

```python
# Loading and exploring the TriviaQA dataset

import csv

# Open the TriviaQA dataset file and read it using csv.DictReader
with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))

# Print the first 3 question-answer pairs from the dataset
# Format each pair as "Question: [question]" and "Answer: [answer]"
# Add an empty line between pairs for better readability
for i in range(3):
    print(f"Question: {qa_pairs[i]['question']}")
    print(f"Answer: {qa_pairs[i]['answer']}\n")
```

## Text Normalization for Fair Comparisons

Now that you've successfully loaded the TriviaQA dataset, let's dive into a key component of LLM evaluation: text normalization! In benchmarking, we need to compare model outputs with reference answers fairly, regardless of capitalization or punctuation differences.

Your task is to:

Review the provided normalize function, which converts text to lowercase and removes non-alphanumeric characters.
Test this function with the provided example strings that contain various formatting styles.
Print both the original and normalized versions of each test string.
Observe how different text formats are standardized to a common form.
Understanding text normalization will help you build more robust evaluation systems that focus on content rather than formatting when comparing LLM outputs to reference answers.

```python
# Implementing and testing the normalize function for LLM evaluation

import re

# TODO: Review this normalize function and make sure it correctly converts text 
# to lowercase and removes all non-alphanumeric characters
def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Test cases with varied punctuation, spaces, and capitalization
test_strings = [
    "Paris",
    "Paris, France",
    "PARIS",
    "paris",
    "1912",
    "Year 1912!",
    "Albert Einstein",
    "Albert   Einstein",
    "albert-einstein",
    "The Pacific Ocean",
    "42.195 kilometers",
    "H2O (water)"
]

# TODO: Implement code to test the normalize function with each string
# For each test string, print both the original and normalized versions

```

```python
# Implementing and testing the normalize function for LLM evaluation

import re

# TODO: Review this normalize function and make sure it correctly converts text
# to lowercase and removes all non-alphanumeric characters
def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Test cases with varied punctuation, spaces, and capitalization
test_strings = [
    "Paris",
    "Paris, France",
    "PARIS",
    "paris",
    "1912",
    "Year 1912!",
    "Albert Einstein",
    "Albert   Einstein",
    "albert-einstein",
    "The Pacific Ocean",
    "42.195 kilometers",
    "H2O (water)"
]

# TODO: Implement code to test the normalize function with each string
# For each test string, print both the original and normalized versions

print("--- Text Normalization Examples ---")
print("Original Text                      | Normalized Text")
print("-----------------------------------|------------------")

for s in test_strings:
    normalized_s = normalize(s)
    # Using ljust for alignment
    print(f"{s.ljust(35)}| {normalized_s}")

print("-----------------------------------|------------------")
print("\nObservations:")
print("1. All characters are converted to lowercase.")
print("2. Punctuation (commas, exclamation marks, parentheses, hyphens) and spaces are removed.")
print("3. Only alphanumeric characters (a-z, 0-9) remain.")
print("4. This process standardizes different variations of the same answer, allowing for fair comparison.")
```

## Comparing Answers Beyond Surface Formatting

Now that you've explored text normalization with test strings, let's put this knowledge into practice! In this exercise, you'll see how normalization helps us accurately compare answers that look different but contain the same information.

Your task is to:

Observe two strings that represent the same answer but with different formatting (one as an expected answer and one as a model's response).
Apply the normalize function to both strings.
Compare the normalized versions to check if they match.
Print the original strings, their normalized forms, and the comparison result.
This hands-on practice with string comparison demonstrates why normalization is essential for fair LLM evaluation — it helps us focus on the content rather than superficial differences in formatting when determining if an answer is correct.

```python
# Comparing answers using text normalization

import re

def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Define the expected answer and model response
expected_answer = "Albert Einstein born in 1879, Germany"
model_response = "ALBERT EINSTEIN (Born in 1879, Germany)"

# TODO: Normalize both strings using the normalize function


# TODO: Compare the normalized strings and store the result in a boolean variable


# TODO: Print the original strings, their normalized versions, and whether they match

```

```python
# Comparing answers using text normalization

import re

def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Define the expected answer and model response
expected_answer = "Albert Einstein born in 1879, Germany"
model_response = "ALBERT EINSTEIN (Born in 1879, Germany)"

# TODO: Normalize both strings using the normalize function
normalized_expected = normalize(expected_answer)
normalized_response = normalize(model_response)

# TODO: Compare the normalized strings and store the result in a boolean variable
answers_match = normalized_expected == normalized_response

# TODO: Print the original strings, their normalized versions, and whether they match
print("--- Answer Comparison with Normalization ---")
print(f"Original Expected Answer: '{expected_answer}'")
print(f"Original Model Response:  '{model_response}'")
print("-" * 40)
print(f"Normalized Expected: '{normalized_expected}'")
print(f"Normalized Response: '{normalized_response}'")
print("-" * 40)
print(f"Do the normalized answers match? {answers_match}")
print("\nConclusion:")
print("Although the original strings look different due to capitalization, punctuation, and wording,")
print("their normalized forms are identical. This demonstrates that text normalization is crucial")
print("for evaluating LLM outputs fairly, as it allows us to correctly identify answers that are")
print("semantically the same, regardless of minor formatting variations.")
```

## Evaluating a Single LLM Response

Excellent work with text normalization! Now let's take the next step and apply what you've learned to evaluate an actual LLM response. In this exercise, you'll work with a real model to see how your normalization techniques help in fair answer comparison.

Your task is to:

Load the first three questions from the TriviaQA dataset.
Send each question to an LLM using the OpenAI API.
Compare the model's response with the expected answer using normalization.
Print the results to see how well the model performed.
This hands-on experience with a real LLM will show you how the evaluation techniques we've been discussing work in practice, bringing you one step closer to building complete benchmarking systems.

```python
# Evaluating a single LLM response against a reference answer

import csv
import re
from openai import OpenAI

def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Initialize the OpenAI client
client = OpenAI()

# Load the first three questions from the TriviaQA dataset
with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))
    questions_data = qa_pairs[:3]  # Take the first three questions

for question_data in questions_data:
    question = question_data['question']
    expected_answer = question_data['answer']

    # TODO: Create a prompt for the LLM that includes the question
    prompt = ""

    # TODO: Send the prompt to the OpenAI API and get the response
    response = ""

    # TODO: Normalize both the expected answer and the model's response

    # TODO: Check if the normalized texts match

    # TODO: Print the results including the question, expected answer, model response,
    # normalized versions of both, and whether they match
```

Unfortunately, I am unable to directly use the OpenAI API to complete this task. The provided code requires an API key, which is a private credential, and I do not have the ability to access or use external APIs with personal authentication.

However, I can still provide you with the completed code structure that you can run yourself once you have your API key configured. The code below shows how to integrate the normalization function with an API call to a hypothetical LLM and then compare the results.

You would need to install the OpenAI Python library (`pip install openai`) and then set up your API key, for example, by setting an environment variable named `OPENAI_API_KEY`.

```python
# Evaluating a single LLM response against a reference answer

import csv
import re
from openai import OpenAI

def normalize(text):
    """
    Normalizes text by converting to lowercase and removing all non-alphanumeric characters.
    This helps in fair comparison of LLM outputs with reference answers.
    """
    return re.sub(r'[^a-z0-9]', '', text.lower())

# Initialize the OpenAI client
# Ensure your OPENAI_API_KEY environment variable is set
client = OpenAI()

# Load the first three questions from the TriviaQA dataset
with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))
    questions_data = qa_pairs[:3]  # Take the first three questions

print("--- LLM Evaluation Results ---")
for question_data in questions_data:
    question = question_data['question']
    expected_answer = question_data['answer']

    # Create a prompt for the LLM that includes the question
    # The model parameter might need to be adjusted based on the latest available models
    prompt = f"Answer the following question concisely: {question}"

    # Send the prompt to the OpenAI API and get the response
    # We use a try-except block to handle potential API errors
    try:
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",  # Or another suitable model
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ]
        )
        model_response = completion.choices[0].message.content
    except Exception as e:
        model_response = f"API Error: {e}"

    # Normalize both the expected answer and the model's response
    normalized_expected = normalize(expected_answer)
    normalized_response = normalize(model_response)

    # Check if the normalized texts match
    answers_match = normalized_expected == normalized_response

    # Print the results
    print("-" * 40)
    print(f"Question: {question}")
    print(f"Expected Answer: {expected_answer}")
    print(f"Model Response: {model_response}")
    print(f"Normalized Expected: {normalized_expected}")
    print(f"Normalized Response: {normalized_response}")
    print(f"Match: {'Yes' if answers_match else 'No'}")

```

This video explains how to get your own OpenAI API key, which is necessary to run the provided code.