# Unit 1

## Introduction to LLM Benchmarking & Basic QA Evaluation

# Introduction to LLM Benchmarking

-----

Welcome to the first lesson of our course, "Benchmarking LLMs with QA." In this lesson, we will explore the fundamentals of benchmarking large language models (LLMs). **Benchmarking** is the process of evaluating the performance of a system or component by comparing it against a set of predefined standards or datasets. It is crucial in understanding how well a model performs in various tasks, identifying its strengths and weaknesses, and guiding improvements.

## Why Benchmarking?

Benchmarking is essential for the development and refinement of LLMs, as it provides a systematic way to measure their capabilities and progress over time. By using standardized datasets and evaluation metrics, we can objectively assess the performance of different models and make informed decisions about their deployment and further development.

Some common types of LLM benchmarks include:

  * **Factual QA** (like TriviaQA, SQuAD)
  * **Multiple-choice reasoning** (like MMLU, ARC)
  * **Truthfulness & bias detection** (like TruthfulQA)
  * **Perplexity-based evaluation** (language fluency prediction)
  * **Semantic similarity** (embedding-based matching)
  * **Domain-specific tests** (custom internal benchmarks)

In this course, we’ll begin with factual QA benchmarks before expanding to other types in later lessons.

## Working with the TriviaQA Dataset

We will use the **TriviaQA** dataset, which contains a large collection of real-world question-answer pairs gathered from trivia websites. While TriviaQA is not a multiple-choice dataset, it is well-suited for evaluating factual question-answering capabilities.

For simplicity and performance, we’ve pre-selected and stored a 100-example subset for you, available at:

`triviaqa.csv`

This subset contains pairs of factual questions and short answers. Here are a few sample entries from the dataset:

```text
Question: What is the capital of France?
Answer: Paris

Question: In which year did the Titanic sink?
Answer: 1912
```

## Setting Up the Environment

Before we dive into the code, let's ensure that your environment is ready. For this lesson, you will need the `openai` and `csv` libraries. If you are working on your local machine, you can install the `openai` library using pip:

```bash
pip install openai
```

The `csv` module is part of Python's standard library, so no additional installation is needed. However, if you are using the CodeSignal environment, these libraries are already pre-installed, so you can focus on the lesson without worrying about setup.

## Loading and Understanding the TriviaQA Dataset

To load the dataset, we’ll use Python’s built-in `csv` module. Here is how you can read it:

```python
import csv

with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))
```

This code opens the `triviaqa.csv` file and reads its contents into a list of dictionaries, where each dictionary represents a question-answer pair. Understanding the structure of this dataset is crucial, as it will be the basis for our evaluation.

## Implementing Normalized Match Evaluation

To evaluate the performance of an LLM, we will use a technique called **normalized match**. This involves comparing the model's response to the correct answer by normalizing both texts. Normalization helps in removing any discrepancies due to case sensitivity or punctuation.

Let's look at the code that implements this evaluation:

```python
import re
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()

def normalize(text):
    return re.sub(r'[^a-z0-9]', '', text.lower())

correct = 0
for q in qa_pairs:
    prompt = f"Answer the following question with a short and direct fact:\n\n{q['question']}"
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content.strip()
    if normalize(q['answer']) == normalize(response):
        correct += 1
```

In this code, the `normalize` function removes all non-alphanumeric characters and converts the text to lowercase. This ensures that the comparison between the model's response and the correct answer is fair and consistent. We then iterate over each question-answer pair, generate a response using the `openai` library, and compare the normalized texts.

## Example: Calculating Normalized Accuracy

In this lesson, we introduced the concept of LLM benchmarking and demonstrated how to evaluate a language model using the TriviaQA dataset. We covered the setup of the environment, loading the dataset, and implementing a normalized match evaluation. However, we will not calculate the normalized accuracy just yet. The results with the current setup may not be optimal, but in the next unit, we will explore techniques like one-shot or few-shot learning to improve the model's performance. By doing so, you will gain a deeper understanding of how to enhance LLM capabilities through advanced evaluation techniques.

## Summary and Next Steps

In this lesson, we introduced the concept of LLM benchmarking and demonstrated how to evaluate a language model using the TriviaQA dataset. We covered the setup of the environment, loading the dataset, and implementing a normalized match evaluation. Although we have not calculated the normalized accuracy yet, we will explore more advanced techniques in the next unit to improve the model's performance. As you move forward, practice these concepts with the exercises provided. These will reinforce your understanding and prepare you for more advanced evaluation techniques in future lessons. Remember, benchmarking is a powerful tool in improving language models, and mastering it will enhance your skills in working with LLMs.



## Loading and Exploring TriviaQA Dataset

Now that you've learned about the importance of benchmarking LLMs, let's get hands-on with the TriviaQA dataset! In this exercise, you'll take your first step toward implementing a benchmark by loading and exploring the dataset.

Your task is to:

Open the TriviaQA dataset file at "triviaqa.csv".
Load the data using Python's csv module as a list of dictionaries.
Print the first three question-answer pairs in a readable format.
This practical experience with dataset handling will build the foundation for the evaluation techniques we'll explore next. Understanding your benchmark data is the first crucial step in any effective LLM evaluation process.

```python
# Loading and exploring the TriviaQA dataset

import csv

# TODO: Open the TriviaQA dataset file and read it using csv.DictReader
# The file is located at "triviaqa.csv"


# TODO: Print the first 3 question-answer pairs from the dataset
# Format each pair as "Question: [question]" and "Answer: [answer]"
# Add an empty line between pairs for better readability


```

```python
# Loading and exploring the TriviaQA dataset

import csv

# Open the TriviaQA dataset file and read it using csv.DictReader
with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))

# Print the first 3 question-answer pairs from the dataset
# Format each pair as "Question: [question]" and "Answer: [answer]"
# Add an empty line between pairs for better readability
for i in range(3):
    print(f"Question: {qa_pairs[i]['question']}")
    print(f"Answer: {qa_pairs[i]['answer']}\n")
```

## Text Normalization for Fair Comparisons

## Comparing Answers Beyond Surface Formatting

## Evaluating a Single LLM Response

## Evaluating a Single LLM Response