# Week 6.1: Introduction to Evaluation

## Objective:
* Understand the fundamentals of LLM evaluation
* Learn how different prompt formulations affect model performance
* Compare Multi-Choice Question Answering (MCQA) vs Generative evaluation approaches
* Explore how to use lighteval for systematic evaluation
* Analyze evaluation results using metrics like loglikelihood accuracy and exact match

=� NOTE: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

## Introduction to LLM Evaluation

Evaluating Large Language Models is crucial for understanding their capabilities and limitations. In this notebook, we'll explore how different prompt formulations can dramatically affect model performance on the same task.

### Key Terminology

Before we begin, let's clarify two important evaluation paradigms we'll be using:

#### 1. **Multi-Choice Question Answering (MCQA) Evaluation**
- **What it means**: The model calculates probability scores for each predefined choice
- **How it works**: Given a question and choices A, B, C, D, the model assigns a likelihood score to each option
- **What we measure**: Whether the model assigns the highest probability to the correct choice
- **Metric used**: `loglikelihood_acc_norm` (normalized accuracy based on log probabilities)
- **Key point**: The model doesn't generate new text - it only scores existing options

#### 2. **Generative Evaluation**
- **What it means**: The model generates free-form text as its answer
- **How it works**: Given a question, the model produces its own text response token by token
- **What we measure**: Whether the generated text matches the expected answer
- **Metric used**: `quasi_exact_match` (flexible string matching)
- **Key point**: The model creates new text rather than selecting from choices

### Important Distinction
"Generative evaluation" refers to **how the model produces answers** (by generating text), not how we judge those answers. It's different from "LLM-as-a-judge" evaluation, where another LLM would evaluate the quality of responses. We will cover LLM-as-a-judge evaluation in the another notebook.

### Why Compare Both Approaches?
The same model can perform very differently when:
- Asked to select the best option (MCQA) vs. generate an answer (Generative)
- Different prompt formulations can favor one approach over the other
- Understanding these differences helps us design better evaluation strategies

Let's explore these concepts through hands-on examples!

## Install Dependencies

In [None]:
!pip install lighteval==0.6.2
!pip install great-tables
!pip install polars

## Import Required Libraries

In [None]:
import string
import os
from datetime import timedelta
from types import ModuleType
from ast import literal_eval

# For data visualization
from great_tables import GT
import polars as pl
import polars.selectors as cs
from datasets import load_dataset

# For evaluation
import lighteval
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.model_config import BaseModelConfig, VLLMModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
from lighteval.metrics.metrics import Metrics
from lighteval.tasks.lighteval_task import LightevalTaskConfig, Doc
from lighteval.utils.utils import as_list, EnvConfig
from lighteval.utils.imports import is_accelerate_available, is_tgi_available

## Configuration

In [None]:
# Define cache directory and sample size for demonstration
cache_dir = "tmp"
max_samples = 10  # Small sample for demonstration; remove for full evaluation

## Understanding the ARC Dataset

We'll use the AI2 Reasoning Challenge (ARC) dataset for our experiments. ARC is a multiple-choice science question dataset designed to test AI systems' reasoning abilities.

Each question in ARC has:
- A question text
- Multiple choice options (typically 4)
- A correct answer key

Let's explore the dataset structure:

In [None]:
# Load and examine a sample from the ARC dataset
arc_dataset = load_dataset("allenai/ai2_arc", "ARC-Challenge", split="test")
print("Sample ARC question:")
print(arc_dataset[0])

## Defining Our Evaluation Task

We'll create a custom task configuration that allows us to test different prompt formulations on the same dataset.

In [None]:
class ArcExplorationTask(LightevalTaskConfig):
    """Custom task configuration for exploring different ARC prompt formulations."""
    
    def __init__(self, name, prompt_function, metric):
        super().__init__(
            name=name,
            prompt_function=prompt_function,
            metric=as_list(metric),
            # It's a custom defined task
            suite=["custom"],
            # This defines our dataset and subsets
            hf_repo="allenai/ai2_arc",
            hf_subset="ARC-Challenge",
            hf_avail_splits=["train", "validation", "test"],
            evaluation_splits=["test"],
            # The few shot sample selection parameters
            few_shots_split="validation",
            few_shots_select="random", 
            # Other task parameters
            stop_sequence=[".", "\n"],
            generation_size=100,
        )

## Defining Evaluation Metrics

We'll use two key metrics to evaluate our models:

1. **Loglikelihood Accuracy (Normalized)**: For MCQA evaluation - measures if the model assigns the highest probability to the correct choice
2. **Quasi-Exact Match**: For generative evaluation - measures if the generated text matches the expected answer

In [None]:
# Define our evaluation metrics
metric_mcqa = Metrics.loglikelihood_acc_norm  # For multiple choice
metric_gen = Metrics.quasi_exact_match        # For generation

## Prompt Formulations

Now let's define different ways to present the same question to the model. Each formulation represents a different hypothesis about what might help the model perform better.

### 1. Base Formulation
The simplest possible prompt - just the question itself.

In [None]:
def arc_base(line, task_name: str = None):
    """Base prompt: just the question.
    
    Example output:
    'Cities control the amount of pollution?'
    """
    query = f"{line['question']}"
    choices = line["choices"]["text"]
    
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

### 2. Context Formulation
Add explicit "Question:" and "Answer:" labels to provide structure.

In [None]:
def arc_context(line, task_name: str = None):
    """Add context with Question/Answer structure.
    
    Example output:
    'Question: Cities control the amount of pollution?
     Answer: '
    """
    query = f"Question: {line['question']}"
    query += "\nAnswer: "
    choices = line["choices"]["text"]
    
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

### 3. Context with Choices
Include the multiple choice options directly in the prompt.

In [None]:
letters = list(string.ascii_uppercase)

def arc_context_choices(line, task_name: str = None):
    """Include choices in the prompt.
    
    Example output:
    'Question: Cities control the amount of pollution?
     A. The air stays cleaner
     B. The air becomes more polluted
     C. Cars run more efficiently
     D. It becomes safer to drive
     Answer: '
    """
    query = f"Question: {line['question']}\n"
    query += "\n".join([f"{letters[ix]}. {choice}" 
                       for ix, choice in enumerate(line["choices"]["text"])])
    query += "\nAnswer: "
    choices = line["choices"]["text"]
    
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

### 4. Context with Letter Labels
Show choices in the prompt but expect letter responses (A, B, C, D).

In [None]:
def arc_context_labels(line, task_name: str = None):
    """Show choices but evaluate on letter labels.
    
    Same prompt as arc_context_choices, but expects 'A', 'B', 'C', or 'D' as answer.
    """
    query = f"Question: {line['question']}\n"
    query += "\n".join([f"{letters[ix]}. {choice}" 
                       for ix, choice in enumerate(line["choices"]["text"])])
    query += "\nAnswer: "
    # Key difference: choices are now letters instead of full text
    choices = [letters[ix] for ix in range(len(line["choices"]["text"]))]
    
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

## Creating Our Task Suite

Now let's combine all our prompt formulations into a suite of tasks to evaluate.

In [None]:
# Create a module to hold our tasks
task_module = ModuleType("task_module")
task_module.__file__ = ".",
task_module.TASKS_TABLE = [
    ArcExplorationTask(
        name="arc_base", 
        prompt_function=arc_base, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context", 
        prompt_function=arc_context, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context_choice", 
        prompt_function=arc_context_choices, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context_labels", 
        prompt_function=arc_context_labels, 
        metric=[metric_mcqa, metric_gen]
    )
]

task_names = ["arc_base", "arc_context", "arc_context_choice", "arc_context_labels"]

## Running the Evaluation

Now let's set up and run our evaluation pipeline. We'll use a small model (SmolLM-1.7B) for demonstration purposes.

In [None]:
# Initialize accelerator if available
if is_accelerate_available():
    from accelerate import Accelerator, InitProcessGroupKwargs
    accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=3000))])
else:
    accelerator = None

In [ ]:
# Set up evaluation tracking with save_details enabled
evaluation_tracker = EvaluationTracker(
    output_dir=cache_dir,
    save_details=True,  # IMPORTANT: This saves full prompts and other details
    # Optionally push results to HuggingFace Hub
    # push_to_hub=True,
    # hub_results_org="your_username", 
)

# Configure pipeline parameters
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.ACCELERATE,
    env_config=EnvConfig(cache_dir=cache_dir),
    override_batch_size=1,
    max_samples=max_samples,  # Remove this for full evaluation
    custom_tasks_directory=task_module
)

# Configure the model
model_config = BaseModelConfig(
    pretrained="HuggingFaceTB/SmolLM-1.7B",
    dtype="bfloat16",
    use_chat_template=False,
)

# Format tasks for evaluation (3 shots, 0 for generative)
tasks = ",".join([f"custom|{task}|3|0" for task in task_names])

In [None]:
# Create and run the evaluation pipeline
print("Starting evaluation...")
pipeline = Pipeline(
    tasks=tasks,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config,
)

pipeline.evaluate()
pipeline.save_and_push_results()
print("Evaluation complete!")

In [ ]:
# Extract and process results
results = pipeline.get_results()["results"]
results_processed = []

for eval_name, eval_results in results.items():
    results_processed.append({
        "Prompt function": (eval_name.split(":")[1] if ":" in eval_name else eval_name).replace("_", " "), 
        "Quasi Exact Match": eval_results["qem"], 
        "Normalized Accuracy": eval_results["acc_norm"]
    })

# Create a polars DataFrame
results_data = pl.from_dicts(results_processed, strict=False)

# Display results in a nice table
(GT(results_data)
    .tab_header("Evaluation Results by Prompt Formulation")
    .tab_spanner(label="Metrics", columns=["Quasi Exact Match", "Normalized Accuracy"])
)

## Analyzing Results

Let's examine how different prompt formulations affected model performance.

## Viewing the Full Prompts Sent to the Model

One of the most valuable features of lighteval is the ability to see exactly what prompts are sent to the model. This is crucial for debugging and understanding model behavior. Let's examine the full prompts that were used in our evaluation.

In [ ]:
# Load the detailed results which contain the full prompts
import glob

# Find the latest evaluation results
detail_files = glob.glob(f"{cache_dir}/details/HuggingFaceTB/SmolLM-1.7B/**/*.parquet", recursive=True)

# Load one of the detail files to examine the full prompts
if detail_files:
    # Let's look at the arc_base task details
    arc_base_file = [f for f in detail_files if "arc_base" in f][0]
    details_df = load_dataset("parquet", data_files=arc_base_file, split="train")
    
    print(f"Loaded details from: {arc_base_file}")
    print(f"\nColumns available in details: {details_df.column_names}")
else:
    print("No detail files found. Make sure the evaluation has completed.")

### Examining the Full Prompts

Now let's look at the actual prompts that were sent to the model. The `full_prompt` column contains exactly what the model received, including any few-shot examples.

In [ ]:
# Let's examine the first full prompt from the arc_base task
if 'full_prompt' in details_df.column_names:
    print("=== FULL PROMPT SENT TO MODEL (arc_base formulation) ===\n")
    print(details_df[0]['full_prompt'])
    print("\n" + "="*50 + "\n")
    
    # Also show the individual components
    print("Components of this evaluation sample:")
    print(f"- Example (query only): {details_df[0]['example']}")
    print(f"- Gold answer: {details_df[0]['gold']}")
    print(f"- Model prediction: {details_df[0]['predictions']}")
    print(f"- Choices provided: {details_df[0]['choices']}")
else:
    print("The 'full_prompt' column is not available. Make sure save_details=True was set.")

### Comparing Full Prompts Across Different Formulations

Let's compare the full prompts for the same question across all our different formulations to see exactly how they differ.

In [ ]:
# Load full prompts from all formulations and compare
prompt_comparison = {}

for detail_file in detail_files:
    # Extract task name from filename
    task_name = None
    for name in task_names:
        if name in detail_file:
            task_name = name
            break
    
    if task_name:
        # Load the details
        task_details = load_dataset("parquet", data_files=detail_file, split="train")
        
        if 'full_prompt' in task_details.column_names:
            # Store the first prompt for comparison
            prompt_comparison[task_name] = task_details[0]['full_prompt']

# Display the prompts side by side
print("=== COMPARING FULL PROMPTS ACROSS FORMULATIONS ===\n")

for task_name, full_prompt in sorted(prompt_comparison.items()):
    print(f"\n{'='*60}")
    print(f"TASK: {task_name}")
    print(f"{'='*60}")
    print(full_prompt)
    print(f"{'='*60}\n")

### Understanding Few-Shot Examples in the Prompts

Notice that the full prompts include few-shot examples (3 in our case). These examples help the model understand the task format. Let's analyze how few-shot examples are included.

In [ ]:
# Analyze the structure of few-shot examples
if prompt_comparison:
    # Take the arc_context_choice prompt as an example
    example_prompt = prompt_comparison.get('arc_context_choice', list(prompt_comparison.values())[0])
    
    # Count the number of "Question:" occurrences to understand few-shot structure
    num_questions = example_prompt.count("Question:")
    
    print(f"Number of questions in the full prompt: {num_questions}")
    print(f"This includes {num_questions - 1} few-shot examples + 1 actual test question\n")
    
    # Split the prompt to show structure
    prompt_parts = example_prompt.split("Question:")
    
    if len(prompt_parts) > 2:
        print("Structure of the prompt:")
        print("1. Few-shot examples (with answers)")
        print("2. Test question (without answer - model needs to complete this)")
        
        # Show the last question (the actual test question)
        print(f"\nThe actual test question starts with:\nQuestion:{prompt_parts[-1][:200]}...")
        
    # Check the metadata
    if detail_files:
        task_details = load_dataset("parquet", data_files=detail_files[0], split="train")
        print(f"\nMetadata from evaluation:")
        print(f"- Number of few-shot examples requested: {task_details[0].get('num_asked_few_shots', 'N/A')}")
        print(f"- Number of effective few-shot examples: {task_details[0].get('num_effective_few_shots', 'N/A')}")
        print(f"- Was the prompt truncated?: {task_details[0].get('truncated', 'N/A')}")

### Creating a Summary Table of Full Prompts

Let's create a more structured view to understand the differences between formulations.

In [ ]:
# Create a summary of how each formulation structures its prompts
prompt_summaries = []

for detail_file in detail_files:
    # Extract task name
    task_name = None
    for name in task_names:
        if name in detail_file:
            task_name = name
            break
    
    if task_name:
        # Load details
        task_details = load_dataset("parquet", data_files=detail_file, split="train")
        
        if len(task_details) > 0:
            sample = task_details[0]
            
            # Extract just the test question part (without few-shot examples)
            full_prompt = sample.get('full_prompt', '')
            # Find the last occurrence of the question pattern
            test_question_start = full_prompt.rfind(sample['example']) if 'example' in sample else -1
            
            if test_question_start != -1:
                test_question_only = full_prompt[test_question_start:]
            else:
                test_question_only = "Could not extract test question"
            
            prompt_summaries.append({
                "Task": task_name.replace("_", " ").title(),
                "Full Prompt Length": len(full_prompt),
                "Contains Few-Shot": "Yes" if full_prompt.count(sample.get('example', '')) > 1 else "No",
                "Test Question Format": test_question_only[:150] + "..." if len(test_question_only) > 150 else test_question_only,
                "Choices Format": sample.get('choices', 'N/A'),
                "Gold Answer": sample.get('gold', 'N/A'),
                "Model Prediction": sample.get('predictions', 'N/A')
            })

# Display as a table
if prompt_summaries:
    summary_df = pl.from_dicts(prompt_summaries)
    display(GT(summary_df)
        .tab_header("Full Prompt Analysis Across Formulations")
        .fmt_number(columns=["Full Prompt Length"], decimals=0)
    )

## Detailed Analysis: Looking at Individual Examples

Let's examine specific examples to understand how the model responds to different prompt formulations.

In [None]:
# Load detailed results
path = f"{cache_dir}/details/HuggingFaceTB/SmolLM-1.7B/"
results = {}

for root, _, files in os.walk(path):
    for file in files:
        if "|" in file:
            eval_name = file.split("|")[1]
            results[eval_name] = load_dataset("parquet", data_files=f"{root}/{file}")["train"]

In [None]:
# Process and display detailed results
transformed_data = []
keys = ["example", "gold", "predictions", "metrics"]

# Look at first few examples
for ix in range(min(5, max_samples)):
    for key in keys:
        cur_sample = {"Sample": f"Sample {ix}", "Type": key.capitalize()}
        
        for eval_name, df in sorted(results.items()):
            try:
                cur_result = literal_eval(results[eval_name][ix][key])
                if isinstance(cur_result, list):
                    if len(cur_result) == 1:
                        cur_sample[eval_name] = cur_result[0]
                    else:
                        cur_sample[eval_name] = "\n".join([str(i) for i in cur_result])
                elif isinstance(cur_result, dict):
                    for metric, value in cur_result.items():
                        cur_sample[eval_name] = str(value)
                        cur_sample["Type"] = f"{key.capitalize()}: {metric}"
            except:
                cur_sample[eval_name] = results[eval_name][ix][key]
                
        # Replace newlines for better display
        for k, v in cur_sample.items():
            if isinstance(v, str):
                cur_sample[k] = v.replace("\n", "<br />")
        
        transformed_data.append(cur_sample)

In [None]:
# Create and display detailed comparison table
pl_data = pl.from_dicts(transformed_data, strict=False, infer_schema_length=200)

(GT(pl_data.head(20))
    .tab_header("Comparing Different Prompt Formulations - Detailed Examples")
    .tab_spanner(label="Prompt Variations", columns=cs.starts_with("arc"))
    .tab_stub(rowname_col="Type", groupname_col="Sample")
    .fmt_markdown(columns=cs.starts_with("arc"))
)

## Key Insights

From our experiments, we can observe several important patterns:

1. **Base Format Challenges**: The simplest prompt (just the question) often struggles, especially in generative mode where the model may not understand it should provide a specific answer.

2. **Context Helps**: Adding "Question:" and "Answer:" labels provides structure that helps the model understand the task format.

3. **Choices in Prompt**: Including the multiple choice options directly in the prompt significantly helps the model select from valid options.

4. **Label vs Full Text**: When choices are shown but the model must predict letters (A, B, C, D), performance may differ from predicting the full text answer.

5. **MCQA vs Generation**: The same prompt can perform very differently when evaluated as multiple choice (loglikelihood) versus generation (exact match).

## Exercises

1. **Try Different Models**: Replace `SmolLM-1.7B` with other models (e.g., `microsoft/phi-2`, `TinyLlama/TinyLlama-1.1B-Chat-v1.0`) and compare results.

2. **Create New Prompt Formulations**: Design your own prompt templates. For example:
   - Add instructions like "Choose the best answer:"
   - Use different formatting (numbered lists instead of letters)
   - Add few-shot examples in the prompt

3. **Explore Other Datasets**: Adapt this code to work with other multiple-choice datasets like:
   - `commonsense_qa`
   - `winogrande`
   - `hellaswag`

4. **Analyze Error Patterns**: Look at which types of questions benefit most from different prompt formulations.

5. **Statistical Significance**: With full evaluation (remove `max_samples`), calculate confidence intervals for the performance differences.

## Additional Readings

For more in-depth understanding of LLM evaluation concepts, especially LLM-as-a-Judge approaches:

- [LLM & VLM-as-a-Judge: A Comprehensive Guide](https://dylandigitalgarden.com/Dylan/2024/July/July+31%2C+2024+LLM+%26+VLM-as-a-Judge) - This article provides detailed insights into using LLMs as evaluators, including best practices, limitations, and real-world applications. While this notebook focused on traditional metrics (exact match, loglikelihood), LLM-as-a-Judge represents a more sophisticated evaluation paradigm that we'll explore in later notebooks.

## Conclusion

This notebook demonstrated how seemingly minor changes in prompt formulation can significantly impact model performance. Key takeaways:

- **Prompt engineering matters**: The way we present tasks to LLMs can be as important as the model itself
- **Evaluation methodology affects results**: MCQA vs generative evaluation can yield very different insights
- **Systematic testing is crucial**: Tools like lighteval enable rigorous comparison of different approaches
- **Context and structure help**: Adding labels and formatting generally improves model understanding

In the next notebooks, we'll explore more advanced evaluation techniques including synthetic data generation and agentic evaluation methods.