# CS 4650 - Natural Language Processing - HW - 0 
Georgia Tech, Summer 2025 (Instructor: Kartik Goyal)

Welcome to the first full programming assignment for CS 4650! 

In this assignment, you will be implementing different evaluation methodologies for Large Language Models (LLMs) across three key tasks: Multiple Choice Questions (MCQ), Machine Translation (MT), and Short Story Generation (SSG). This assignment will cover fundamental evaluation metrics, prompt engineering techniques, and analysis of model performance across different domains.

DO NOT CHANGE the names of any of the files and contents outside the cells where you have to write code.

NOTE: DO NOT USE ANY OTHER EXTERNAL LIBRARIES FOR THIS ASSIGNMENT

<!-- TODO: add deadlines -->

The assignment is broken down into 6 Sections. The sections are as follows:

| Section | Part                                      | Points |
|---------|-------------------------------------------|--------|

<!-- TODO: assign points appropriately. -->



All the best and happy coding!

## 0. Setup

In [None]:
%load_ext autoreload
%autoreload 2

# Check what version of Python is running
import sys 
print(sys.version)

In [2]:
# Importing required libraries - DO NOT CHANGE THIS CELL

import torch
import torch.nn.functional as F
import random
import numpy as np
from collections import Counter, defaultdict
import re
from dataclasses import dataclass
from typing import Callable, Dict, List, Sequence, Tuple, Optional
from tqdm import tqdm

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer

In [3]:
# Defining global constants - DO NOT CHANGE THESE VALUES

RANDOM_SEED = 42
PADDING_VALUE = 0
UNK_VALUE     = 1
BATCH_SIZE = 128

torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# This is how we select a GPU if it's available on your computer or in the Colab environment.
print('Device of execution - ', device)

## 1. Utility Classes [15 points - Programming]

### 1.1. LLM Generation Configuration [0 points]

The following cell defines a configuration class for LLM generation. This is so that it is easier to keep track of the parameters for the generation in later parts of the notebook.


Here are the generation hyperparameters we will be using: 

- temperature: Controls randomness in token selection. A value of 0 means deterministic (always pick highest probability token), while higher values (e.g., 0.7-1.0) increase randomness. Lower temperature produces more focused/predictable text, higher temperature produces more diverse/creative text. 

- top_p (nucleus sampling): Only considers tokens whose cumulative probability exceeds a threshold p.
  For example, if top_p=0.9, the model selects from the smallest set of tokens whose cumulative probability reaches 90%.
  This helps balance between diversity and quality by dynamically adjusting the candidate pool.

- top_k: Limits token selection to only the k highest probability tokens. For example, if top_k=40,
  only the 40 most likely next tokens are considered. This prevents the model from selecting very low probability tokens.

- max_new_tokens: Sets the maximum length of generated text by limiting the number of tokens the model will produce.
  This prevents excessively long outputs and controls computational resources.


In [5]:
# DO NOT CHANGE THIS CELL

@dataclass
class LLMGenerationConfig:
    """
    Configuration class for LLM generation parameters.
    
    Args:
        temperature (float): Controls randomness in sampling. Higher values make output more random.
        top_p (float): Nucleus sampling parameter. Only tokens with cumulative probability <= top_p are considered.
        top_k (int): Only the top k tokens are considered for sampling.
        max_new_tokens (int): Maximum number of new tokens to generate.
    """
    temperature: float = 0.7
    top_p: Optional[float] = 0.95
    top_k: Optional[int] = 40
    max_new_tokens: int = 100




### 1.2. LLM Wrapper Class [8 points]

In the following cell, implement a wrapper around the Huggingface transformers API so that we can run inference on the model. 

You need to implement the following: 

- Get the tokenizer and model from Huggingface and assign it to the `tokenizer` and `model` attributes.
- Implement the `generate()` method to run inference on the model. We need this for generating the completions from the LLMs. 
- Implement the `logits()` method to get the logits for the next token. We will need this for MCQ task.  
- Implement the `perplexity()` method to get the perplexity of the model. We will need this when we need this for evaluating the SSG task. 



In [6]:
class LLM:
    """
    A wrapper class for Hugging Face language models that provides a unified interface
    for text generation, logit computation, and perplexity calculation.
    
    If transformers library is not available, falls back to deterministic stubs.
    """

    def __init__(self, hf_id: str = "gpt2", device: str = None):
        """
        Initialize the LLM wrapper.
        
        Args:
            hf_id (str): Hugging Face model identifier
            device (str): Device to load model on ('cuda', 'cpu', 'mps')
        """      
        self.hf_id = hf_id
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        
        # Use auto-loading with device_map="auto" for faster loading and automatic memory management
        self.tokenizer = AutoTokenizer.from_pretrained(hf_id)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            
        self.model = AutoModelForCausalLM.from_pretrained(
            hf_id,
            device_map="auto",  # Automatically determine optimal device mapping
            torch_dtype=torch.float16  # Use half precision for faster loading and less memory
        ).eval()
                

    @torch.inference_mode()
    def generate(self, prompt: str, config: LLMGenerationConfig = None) -> str:
        """
        Generate text continuation for the given prompt using the underlying language model.
        
        This method takes a text prompt and generates additional text that continues from
        the prompt in a coherent manner. The generation process can be controlled through
        various parameters specified in the config object.
        
        The method tokenizes the input prompt, passes it through the model, and then
        decodes the generated token IDs back to text, excluding the original prompt tokens.
        
        Args:
            prompt (str): Input text prompt that the model will continue from
            config (LLMGenerationConfig): Configuration object containing generation parameters
                such as temperature, top_p, top_k, and max_new_tokens. If None, default
                parameters will be used.
            
        Returns:
            str: Generated text continuation without the original prompt. The text is
                stripped of leading/trailing whitespace and special tokens are removed
                during decoding.
        """
        if self.model is None:
            # Deterministic stub - always returns "A" for MCQ compatibility
            return "A"

        config = config or LLMGenerationConfig()
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.device)
            
        if config.temperature > 0:
            output = self.model.generate(
                input_ids,
                do_sample=True, 
                temperature=config.temperature,
                top_p=config.top_p,
                top_k=config.top_k,
                max_length=input_ids.shape[1] + config.max_new_tokens,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        else:
            output = self.model.generate(
                input_ids,
                do_sample=False, 
                max_length=input_ids.shape[1] + config.max_new_tokens,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        generated = self.tokenizer.decode(
            output[0, input_ids.shape[1]:], 
            skip_special_tokens=True
        )
        return generated.strip()

    @torch.inference_mode()
    def logits(self, prompt: str) -> torch.Tensor:
        """
        Get next-token logits for the given prompt.
        
        This method computes and returns the logits (raw, unnormalized prediction scores) 
        for the next token that would follow the given prompt. These logits represent the model's
        prediction distribution over the entire vocabulary for the next token position.
        
        Args:
            prompt (str): Input text prompt for which to compute next-token predictions
            
        Returns:
            torch.Tensor: A tensor of shape (vocab_size,) containing the logits for each
                possible next token in the vocabulary. Higher values indicate tokens the
                model considers more likely to follow the prompt.
        """
        tokens = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        outputs = self.model(**tokens)
        # Return logits for the last token position
        return outputs.logits[0, -1].cpu()

    @torch.inference_mode()
    def perplexity(self, text: str) -> float:
        """
        Calculate perplexity of the given text.
        
        Perplexity is a measurement of how well a probability model predicts a sample.
        Lower perplexity indicates the model is better at predicting the text.
        It is calculated as the exponentiated average negative log-likelihood of a sequence.
        
        Args:
            text (str): Input text for which to calculate perplexity
        Returns:
            float: Perplexity value
        """
        if self.model is None:
            return 100.0  # Fixed stub value

        encodings = self.tokenizer(text, return_tensors="pt").to(self.device)
        
        with torch.no_grad():
            outputs = self.model(**encodings)
            logits = outputs.logits
        
        # Shift logits and labels for next-token prediction
        shift_logits = logits[:, :-1].contiguous()
        shift_labels = encodings.input_ids[:, 1:].contiguous()
        
        # Calculate cross-entropy loss
        loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.size(-1)), 
            shift_labels.view(-1)
        )
        
        return np.exp(loss.item())

In the following cells, let's load two different LLMs and test each of the methods we implemented in the previous cell. 

In [None]:
# DO NOT CHANGE THIS CELL

llama = LLM(hf_id="meta-llama/Llama-3.1-8B-Instruct", device=device)

In [None]:
# DO NOT CHANGE THIS CELL
prompt = "Hello, how are you?"

# Greedy decoding 
generation_config = LLMGenerationConfig(
    temperature=0,
    top_p=None,
    top_k=None,
    max_new_tokens=10
)

# Let's test the generate method using greedy decoding 
assert llama.device.type == "cuda", "Device is not loaded to cuda"
assert llama.generate(prompt, generation_config) == "I am doing well, thanks for asking. I", "Greedy decoding is incorrect"
# TODO: does temp=0 case return the same outputs even with different hardware? 
# TODO: how to test temp > 0 case? 
# TODO: make sure device is loaded to cuda 


assert torch.argmax(llama.logits(prompt)) == 358, "Logit is incorrect"
# assert llama.perplexity(prompt) == 1.0, "Perplexity is incorrect"
assert np.isclose(llama.perplexity(prompt), 17.969428099556087, atol=1e-1), "Perplexity is incorrect"
print("All tests passed!")

In [9]:
# DO NOT CHANGE THIS CELL

# qwen = LLM(hf_id="Qwen/Qwen2.5-7B-Instruct", device=device)

In [10]:
# DO NOT CHANGE THIS CELL

# # Let's test the generate method using greedy decoding 
# assert qwen.device.type == "cuda", "Device is not loaded to cuda"
# assert qwen.generate(prompt, generation_config) == "I'm sorry, but as a language model,", "Greedy decoding is incorrect"
# assert torch.argmax(qwen.logits(prompt)) == 358, "Logit is incorrect"
# assert np.isclose(qwen.perplexity(prompt), 7.505416801107283, atol=1e-1), "Perplexity is incorrect"
# print("All tests passed!")

### 1.3. Embedding Model [4 points]

Next, we will implement a wrapper around the Huggingface SentenceTransformer API for generating embeddings. 
Embedding models convert text into dense vector representations (embeddings) that capture semantic meaning.
These vectors allow us to measure similarity between texts in a high-dimensional space.
#
Key points about embedding models:
1. They transform variable-length text inputs into fixed-dimension vectors
2. They have a maximum context length, so longer inputs will be truncated
3. Similar texts will have embeddings that are close to each other in the vector space
#
We'll use these embeddings when evaluating LLM outputs based on semantic similarity rather than
exact string matching, which is particularly useful for tasks like machine translation and
short story generation where multiple valid outputs are possible.


In [11]:
class EmbeddingModel:
    """
    A wrapper around the Huggingface SentenceTransformer API for generating embeddings.
    This model creates semantic embeddings that can be used for measuring similarity
    between texts.
    """

    def __init__(self, hf_id: str = "sentence-transformers/all-MiniLM-L6-v2", dim: int = None):
        """
        Initialize the embedding model with the specified model ID.
        
        Args:
            hf_id (str): Hugging Face model ID for the SentenceTransformer model.
                         Default is "sentence-transformers/all-MiniLM-L6-v2".
            dim (int): Not used for SentenceTransformer models as the dimension is
                       determined by the model itself, but kept for API compatibility.
        """
        self.model = SentenceTransformer(hf_id,  trust_remote_code=True)
        self.dim = self.model.get_sentence_embedding_dimension()

    def embed(self, text: str) -> torch.Tensor:
        """
        Create an embedding for the given text using the SentenceTransformer model.
        
        Args:
            text (str): Input text to embed. Can be of any length.
            
        Returns:
            torch.Tensor: Embedding vector representing the semantic content of the input text.
        """
        # SentenceTransformer returns numpy array, convert to torch tensor
        embedding = self.model.encode(text, convert_to_tensor=True)
        return embedding

LaBSE (Language-agnostic BERT Sentence Embedding) is a multilingual embedding model specifically trained for machine translation tasks. It can encode sentences from 109 different languages into a shared embedding space, allowing for effective cross-lingual similarity comparison. This makes it particularly useful for evaluating machine translation outputs by measuring semantic similarity between translations and references, rather than relying on exact string matching.


In [None]:
# DO NOT CHANGE THIS CELL 

labse = EmbeddingModel(hf_id="sentence-transformers/LaBSE")


embedding1 = labse.embed("Hello, how are you?")
embedding2 = labse.embed("Goodbye, see you later!")


assert embedding1.shape == embedding2.shape
assert embedding1.shape == (768,)
assert embedding2.shape == (768,)

assert torch.isclose(torch.cosine_similarity(embedding1, embedding2, dim=0), torch.tensor(0.4207), atol=1e-1)
assert torch.isclose(torch.cosine_similarity(embedding1, embedding1, dim=0), torch.tensor(1.0), atol=1e-3)
assert torch.isclose(torch.cosine_similarity(embedding2, embedding2, dim=0), torch.tensor(1.0), atol=1e-3)

print("All tests passed!")

The following is an embedding model from Alibaba's GTE (General Text Embedding) family. 
It is a multilingual embedding model supporting 70 languages.

In [None]:
# DO NOT CHANGE THIS CELL 

gte = EmbeddingModel(hf_id="Alibaba-NLP/gte-multilingual-base")


# TODO: implement test.

embedding1 = gte.embed("Hello, how are you?")
embedding2 = gte.embed("Goodbye, see you later!")

assert embedding1.shape == embedding2.shape
assert embedding1.shape == (768,)
assert embedding2.shape == (768,)

assert torch.isclose(torch.cosine_similarity(embedding1, embedding2, dim=0), torch.tensor(0.5835), atol=1e-1)
assert torch.isclose(torch.cosine_similarity(embedding1, embedding1, dim=0), torch.tensor(1.0), atol=1e-3)
assert torch.isclose(torch.cosine_similarity(embedding2, embedding2, dim=0), torch.tensor(1.0), atol=1e-3)
print("All tests passed!")

## 2. Multiple Choice Questions (MCQ) Evaluation [20 points - Programming]

In this secton, we'll be evaluating LLMs on the [MMLU (Massive Multitask Language Understanding) benchmark](https://arxiv.org/abs/2009.03300), a multiple choice question benchmark. It covers a wide range of subjects including mathematics, history, computer science, law, and more, making it a comprehensive test of an LLM's knowledge and reasoning abilities.

We'll implement two different evaluation approaches:

1. **Regex-based Evaluation**: A simple approach that extracts answers from model outputs using regular expressions, then compares them to ground truth answers.

2. **Logit-based Evaluation**: A more sophisticated approach that directly accesses the model's internal probability distributions (logits) to determine which answer choice the model predicts as most likely.

Each approach has its advantages and limitations, which we'll explore throughout this section.





### 2.1. Helper Functions [3 points]

In this section, we define some useful helper functions for evaluating the LLMs. 


N-grams are contiguous sequences of n items from a given sample of text or speech. 
For example, in the sentence "I love natural language processing":
- 1-grams (unigrams): ["I", "love", "natural", "language", "processing"]
- 2-grams (bigrams): [("I", "love"), ("love", "natural"), ("natural", "language"), ("language", "processing")]
- 3-grams (trigrams): [("I", "love", "natural"), ("love", "natural", "language"), ("natural", "language", "processing")]
#
N-grams are useful for various NLP tasks including:
- Language modeling: Predicting the next word based on previous words
- Text similarity: Comparing documents based on shared n-grams
- Machine translation evaluation: Metrics like BLEU use n-gram overlap
#
Implement the function below which extracts all possible n-grams from a sequence of tokens. 


In [24]:
def _ngrams(tokens: Sequence[str], n: int) -> List[Tuple[str, ...]]:
    """
    Extract n-grams from a sequence of tokens.
    
    Args:
        tokens (Sequence[str]): A sequence of tokens (words, characters, etc.)
        n (int): The size of each n-gram
        
    Returns:
        List[Tuple[str, ...]]: A list of tuples, where each tuple contains n consecutive tokens
                              from the input sequence
    """
    return [tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)]

### 2.2. Regex-based Accuracy [4 points]

In the following cell, implement a function which extracts the first letter generated by the LLM and compares it with the reference answer. We implement this using a regular expression. 

In [25]:
_mcq_regex = re.compile(r"\b([A-D])\b", re.IGNORECASE)

def compute_regex_accuracy(generation: str, reference: str) -> float:
    """
    Extract the first letter A-D from the generation and compare with reference.
    
    Args:
        generation (str): Generated text from the model
        reference (str): Ground truth answer (A, B, C, or D)
        
    Returns:
        float: 1.0 if correct, 0.0 if incorrect
    """
    match = _mcq_regex.search(generation)
    pred = match.group(1).upper() if match else None
    return float(pred == reference.strip().upper())

In [310]:
# TODO: implement test. Have a dummy generated text, feed it into the function, and see if it matches the reference answer correctly. 

### 2.3. Logit-based Accuracy. 

In the following cell, implement a function which calculates the accuracy based on the logits for the first token generated by the LLM.
That is, calculate the most likely token based on the model's logits and compare it with the reference answer. 


In [27]:
def compute_logit_accuracy(logits: torch.Tensor, reference: str, model: LLM) -> float:
    """
    Compute accuracy based on logits for the first token generated by the LLM.
    
    Args:
        logits (torch.Tensor): Logits from the model for the first token
        reference (str): Ground truth answer (A, B, C, or D)
        model (LLM): The language model used for generation, needed for tokenizer access
        
    Returns:
        float: 1.0 if the token with highest logit matches reference, 0.0 otherwise
    """
    # get the predicted letter from the logits
    pred_idx = int(torch.argmax(logits))
    pred = model.tokenizer.decode([pred_idx]).strip()

    # TODO: am I handling whitespace correctly? 

    # compute accuracy
    return float(pred == reference)


In [312]:
# TODO: implement test. Have a test case prompt for which the answer is very obvious. Make sure logit is correctly extracted.  

### 2.5. Run MCQ Evaluation

In the following cells we will run evaluation on the LLMs with the MMLU benchmark.

In [28]:
# DO NOT CHANGE THIS CELL

# Load MMLU dataset from Huggingface
mmlu_test_raw = load_dataset("lighteval/mmlu", "all", split="test")

# Convert to format expected by evaluate_pair_metric
mmlu_test_data = []
LETTERS = ["A", "B", "C", "D"]

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mmlu_test_raw:
    eval_item = {
        "question": item["question"], 
        "option_A": item["choices"][0],
        "option_B": item["choices"][1],
        "option_C": item["choices"][2],
        "option_D": item["choices"][3],
        "reference": LETTERS[item["answer"]],
        "reference_idx": item["answer"]
    }
    mmlu_test_data.append(eval_item)


In [None]:
# Acc: 
# - regex: 0.967 for 30 samples. 
# - logit: 0.933 for 30 samples. 

# TODO: See how long evals take for the Llama and Qwen on PACE GPUs. 
# This shouldn't take long given that students have access to PACE GPUs. 
# ~120 seconds for Llama on A40 for 30 samples. 
# TODO: [the reported scores](https://huggingface.co/Qwen/Qwen2-7B-Instruct#evaluation) 
# are around ~70% for both Llama and Qwen. Let's prompt/hp tune to match these scores? 

# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    top_p=None,
    top_k=None,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

llama_mmlu_scores = defaultdict(list)
for item in tqdm(mmlu_test_data[:30], desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = llama.generate(prompt, config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    llama_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama)
    llama_mmlu_scores["logit_accuracy"].append(logit_accuracy)


for metric_name, metric_scores in llama_mmlu_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

In [None]:
# Acc: 5 seconds for 30 samples. 
# - regex: 0.933 for 30 samples. 
# - logit: 0.900 for 30 samples. 

# TODO: See how long evals take for the Llama and Qwen on PACE GPUs. 
# This shouldn't take long given that students have access to PACE GPUs. 
# ~120 seconds for Llama on A40 for 30 samples. 
# TODO: [the reported scores](https://huggingface.co/Qwen/Qwen2-7B-Instruct#evaluation) 
# are around ~70% for both Llama and Qwen. Let's prompt/hp tune to match these scores? 

# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    top_p=None,
    top_k=None,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

qwen_mmlu_scores = defaultdict(list)
for item in tqdm(mmlu_test_data[:30], desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = qwen.generate(prompt, config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    qwen_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen)
    qwen_mmlu_scores["logit_accuracy"].append(logit_accuracy)


for metric_name, metric_scores in qwen_mmlu_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

### 2.6 Shuffling Choices.

When evaluating LLMs on multiple choice questions, we need to be careful about memorization effects.
Memorization occurs when an LLM has seen very similar questions during training,
and can simply recall the correct answer rather than reasoning about the question.
#
For example, if an LLM was trained on practice tests that contained the exact same multiple choice
question with answers in the same order (A, B, C, D), it might memorize that "A" was correct without
actually understanding the question.
#
To help distinguish between true reasoning ability and memorization, we can randomly shuffle the order
of answer choices while preserving which answer is correct. This way, even if the LLM has seen the
question before, it needs to identify the correct answer based on content rather than position.
# 
Re-evaluate the two models in the shuffled case and report the resulting accuracies. 

In [344]:
# DO NOT CHANGE THIS CELL

# Load MMLU dataset from Huggingface
mmlu_test_raw = load_dataset("lighteval/mmlu", "all", split="test")

# Convert to format expected by evaluate_pair_metric
shuffled_mmlu_test_data = []
LETTERS = ["A", "B", "C", "D"]

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mmlu_test_raw:
    shuffled_indices = list(range(len(item["choices"])))
    eval_item = {
        "question": item["question"], 
        "option_A": item["choices"][shuffled_indices[0]],
        "option_B": item["choices"][shuffled_indices[1]],
        "option_C": item["choices"][shuffled_indices[2]],
        "option_D": item["choices"][shuffled_indices[3]],
        "reference": LETTERS[item["answer"]],
        "reference_idx": item["answer"]
    }
    shuffled_mmlu_test_data.append(eval_item)


In [None]:
# Acc: 5 seconds for 30 samples. 
# - regex: 0.933 for 30 samples. 
# - logit: 0.900 for 30 samples. 

# TODO: See how long evals take for the Llama and Qwen on PACE GPUs. 
# This shouldn't take long given that students have access to PACE GPUs. 
# ~120 seconds for Llama on A40 for 30 samples. 
# TODO: [the reported scores](https://huggingface.co/Qwen/Qwen2-7B-Instruct#evaluation) 
# are around ~70% for both Llama and Qwen. Let's prompt/hp tune to match these scores? 

# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    top_p=None,
    top_k=None,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

shuffled_llama_mmlu_scores = defaultdict(list)
for item in tqdm(shuffled_mmlu_test_data[:30], desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = llama.generate(prompt, config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    shuffled_llama_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama)
    shuffled_llama_mmlu_scores["logit_accuracy"].append(logit_accuracy)


for metric_name, metric_scores in shuffled_llama_mmlu_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

In [None]:
# Acc: 5 seconds for 30 samples. 
# - regex: 0.933 for 30 samples. 
# - logit: 0.900 for 30 samples. 

# TODO: See how long evals take for the Llama and Qwen on PACE GPUs. 
# This shouldn't take long given that students have access to PACE GPUs. 
# ~120 seconds for Llama on A40 for 30 samples. 
# TODO: [the reported scores](https://huggingface.co/Qwen/Qwen2-7B-Instruct#evaluation) 
# are around ~70% for both Llama and Qwen. Let's prompt/hp tune to match these scores? 

# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    top_p=None,
    top_k=None,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

qwen_mmlu_scores = defaultdict(list)
for item in tqdm(shuffled_mmlu_test_data[:30], desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = qwen.generate(prompt, config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    qwen_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen)
    qwen_mmlu_scores["logit_accuracy"].append(logit_accuracy)


for metric_name, metric_scores in qwen_mmlu_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

## 3. Machine Translation Evaluation [25 points - Programming]

In this section, we'll be evaluating LLMs on German to English machine translation tasks. In machine translation, models automatically converting text from one language to another while preserving meaning and fluency.

We'll implement three different evaluation metrics:

1. **BLEU Score**: A precision-based metric that measures the overlap of n-grams between the machine translation and reference translations. It's one of the most widely used metrics in machine translation evaluation.
# 
2. **N-gram Overlap (BLEU)**: BLEU works by counting matching sequences of words (n-grams) between the candidate and reference translations. For example, a 1-gram (unigram) checks individual word matches, while a 4-gram checks matches of four consecutive words. This approach focuses on precision but can miss semantic equivalence when different words express the same meaning.
#
3. **Embedding-based Similarity**: Instead of direct word matching, we use an embedding model to convert sentences into vector representations that capture semantic meaning. 

Each metric has its strengths and limitations, which we'll explore throughout this section. We'll use these metrics to evaluate the translation capabilities of our LLMs.


### 3.1. BLEU Score [8 points]

BLEU (Bilingual Evaluation Understudy) is a widely used metric for evaluating machine translation quality. It works by:

1. Comparing n-grams (sequences of n consecutive words) between the candidate translation and reference translation
2. Calculating precision scores for different n-gram sizes (typically 1-4)
3. Applying a brevity penalty to penalize translations that are too short

The final BLEU score is a geometric mean of these precision scores, multiplied by the brevity penalty.
(In practice, there are a lot more bells and whistles. See [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master) for a popular implementation.)

In the following cells, we'll implement:
- A helper function to generate n-grams from a sequence of tokens
- A function to calculate modified precision for each n-gram size
- The main BLEU score computation function


In [347]:
def _modified_precision(candidate: List[str], reference: List[str], n: int) -> float:
    """
    Calculate modified precision for n-grams used in the BLEU score calculation.
    
    This function computes the clipped count of n-grams in the candidate translation
    that appear in the reference translation, divided by the total number of n-grams
    in the candidate.
    
    Args:
        candidate (List[str]): List of tokens from the candidate translation
        reference (List[str]): List of tokens from the reference translation
        n (int): The n-gram size to consider
            
    Returns:
        float: The modified precision score for the specified n-gram size
    """
    cand_ngrams = Counter(_ngrams(candidate, n))
    ref_ngrams = Counter(_ngrams(reference, n))
    
    overlap = {ng: min(count, ref_ngrams[ng]) for ng, count in cand_ngrams.items()}
    num = sum(overlap.values())
    denom = sum(cand_ngrams.values())
    
    return (num / denom) if denom > 0 else 0.0

def compute_bleu(candidate: str, reference: str, max_n: int = 4) -> float:
    """
    Compute BLEU score between candidate and reference translations.
    
    The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating
    machine translation quality. It measures the precision of n-grams in the
    candidate translation with respect to the reference translation, combined
    with a brevity penalty for short translations.
    
    Args:
        candidate (str): Generated translation text
        reference (str): Reference translation text
        max_n (int): Maximum n-gram order to consider (default: 4)
        
    Returns:
        float: BLEU score between 0 and 1, where higher values indicate better translations
    """
    cand_tokens = candidate.split()
    ref_tokens = reference.split()
    
    # Calculate modified precision for each n-gram order
    precisions = []
    for n in range(1, max_n + 1):
        precisions.append(_modified_precision(cand_tokens, ref_tokens, n))
    
    # Geometric mean with smoothing
    eps = 1e-9
    geo_mean = np.exp(sum(np.log(p + eps) for p in precisions) / max_n)
    
    # Brevity penalty
    c, r = len(cand_tokens), len(ref_tokens)
    bp = 1.0 if c > r else np.exp(1 - r / max(c, 1))
    
    return bp * geo_mean

In [None]:
# TODO: implement test. 

### 3.2. N-gram Overlap. 

N-gram overlap is a simple metric that measures how many n-grams (sequences of n consecutive words) are shared between the generated text and the reference text. Unlike BLEU, which considers precision across multiple n-gram lengths with a brevity penalty, n-gram overlap focuses on a single n-gram size and calculates the recall - what fraction of reference n-grams appear in the candidate text.

This metric is useful for evaluating:
- Content coverage: How much of the reference content is captured in the generation
- Lexical similarity: The degree to which the same word sequences are used

N-gram overlap is particularly helpful for tasks where we want to ensure specific information from a reference is included in the generated text, such as summarization or story generation.

Implement the `compute_ngram_overlap()` function below.

In [348]:
def compute_ngram_overlap(candidate: str, reference: str, n: int = 2) -> float:
    """
    This function computes the n-gram overlap between a candidate text and a reference text.
    
    The n-gram overlap is calculated as the fraction of unique reference n-grams that are also
    found in the candidate text. This metric helps evaluate how well the generated text
    captures the content of the reference text.
    
    Args:
        candidate (str): Generated text to be evaluated
        reference (str): Reference text to compare against
        n (int): The size of n-grams to consider (default: 2, which means bigrams)
        
    Returns:
        float: Fraction of reference n-grams found in candidate, ranging from 0.0 to 1.0,
              where higher values indicate better overlap
    """
    cand_ngrams = set(_ngrams(candidate.split(), n))
    ref_ngrams = set(_ngrams(reference.split(), n))
    
    if not ref_ngrams:
        return 0.0
        
    return len(cand_ngrams & ref_ngrams) / len(ref_ngrams)

In [278]:
# TODO: implement test. 

### 3.3. Embedding-based Similarity


Embedding-based similarity metrics go beyond surface-level text matching by capturing semantic relationships between words and phrases. Unlike n-gram overlap or BLEU, which rely on exact matches, embedding-based methods can recognize when different words express similar meanings.

This approach works by:
1. Converting texts into dense vector representations (embeddings) using pre-trained models
2. Measuring the similarity between these vectors using metrics like cosine similarity
3. Producing a score that reflects semantic similarity rather than lexical overlap

Implement the `compute_embedding_similarity()` function below to calculate the semantic similarity between generated and reference texts.


In [349]:
def compute_embedding_similarity(
    generation: str, 
    reference: str, 
    embedder: EmbeddingModel = None
) -> float:
    """
    This function computes the semantic similarity between a generated text and a reference text
    using embeddings and cosine similarity.
    
    Embedding-based similarity captures semantic meaning beyond exact word matches, allowing
    for evaluation of paraphrases and texts that convey similar meaning with different words.
    
    Args:
        generation (str): Generated text to be evaluated
        reference (str): Reference text to compare against
        embedder (EmbeddingModel): Model to create text embeddings. If None, a default model is used.
        
    Returns:
        float: Cosine similarity between the embeddings, ranging from -1.0 to 1.0,
              where higher values indicate greater semantic similarity
    """
    gen_emb = embedder.embed(generation).numpy()
    ref_emb = embedder.embed(reference).numpy()
    
    return F.cosine_similarity(gen_emb, ref_emb, dim=1)

In [None]:
# TODO: implement test. 

### 3.4. Test Machine Translation Metrics

Now let's test our machine translation metrics on a small dataset of German-to-English translations.

In this section, we will:
1. Load a test dataset containing German source sentences and English reference translations
2. Use our LLM to generate English translations from the German source
3. Evaluate the translations using the metrics we've implemented (BLEU and n-gram overlap)

This will demonstrate how these metrics can be used to assess machine translation quality.
Note that in a real-world scenario, you would typically compare multiple MT systems or
evaluate against human judgments to get a more comprehensive assessment.


In [15]:
# DO NOT CHANGE THIS CELL
# load data

source_text = open("data/mt/test.de-en.de", 'r').read().split("\n")
target_text = open("data/mt/test.de-en.en", 'r').read().split("\n")

mt_test_raw = list(zip(source_text, target_text))


In [16]:
# DO NOT CHANGE THIS CELL

# Convert to format expected by evaluate_pair_metric
mt_test_data = []

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mt_test_raw:
    eval_item = {
        "source_language": "de",
        "target_language": "en",
        "source_text": item[0],
        "target_text": item[1]
    }
    mt_test_data.append(eval_item)


In [None]:
# Acc: 
# - regex: 0.967 for 30 samples. 
# - logit: 0.933 for 30 samples. 

# TODO: See how long evals take for the Llama and Qwen on PACE GPUs. 
# This shouldn't take long given that students have access to PACE GPUs. 
# ~120 seconds for Llama on A40 for 30 samples. 
# TODO: [the reported scores](https://huggingface.co/Qwen/Qwen2-7B-Instruct#evaluation) 
# are around ~70% for both Llama and Qwen. Let's prompt/hp tune to match these scores? 

# DO NOT CHANGE THIS CELL


max_new_tokens = max(len(item["target_text"].split()) for item in mt_test_data)
print(f"max_new_tokens: {max_new_tokens}")

config = LLMGenerationConfig(
    temperature=0.7,
    top_p=0.9,
    top_k=None,
    max_new_tokens=max_new_tokens
)
prompt_template = open("prompts/mt/default.txt").read()

llama_mt_scores = defaultdict(list)
for item in tqdm(mt_test_data[:5], desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["source_text"]
    reference = item["target_text"]

    completion = llama.generate(prompt, config)[:len(reference)]
    print(f"prompt: {prompt}")
    print(f"COMPLETION: {completion}")
    print(f"REFERENCE: {reference}")
    print("-"*100)

    # # compute BLEU score accuracy 
    # bleu_score = compute_bleu(completion, reference)
    # llama_mt_scores["bleu"].append(bleu_score)

    # # compute n-gram overlap
    # ngram_overlap = compute_ngram_overlap(completion, reference)
    # llama_mt_scores["ngram_overlap"].append(ngram_overlap)

    # compute embedding similarity
    embedding_similarity = compute_embedding_similarity(completion, reference)
    llama_mt_scores["embedding_similarity"].append(embedding_similarity)


for metric_name, metric_scores in llama_mt_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

In [159]:
# TODO: semantic similarity 

## 4. Short Story Generation Evaluation [25 points - Programming]

### 4.1. Distinct N-grams [6 points]

In this section, we'll implement a metric to evaluate the diversity of generated text.

Distinct n-grams is a common metric used to measure the diversity and repetitiveness of generated text.
It calculates the ratio of unique n-grams to the total number of n-grams in the text.

A higher distinct n-gram ratio indicates:
- More diverse vocabulary usage
- Less repetition in the generated text
- Potentially more creative and interesting content
#
This metric is particularly useful for evaluating story generation, where we want
the model to produce varied and engaging content rather than repetitive patterns.


In [160]:
def compute_distinct_ngrams(text: str, n: int = 2) -> float:
    """
    This function computes the distinct n-gram ratio, which is a measure of text diversity.
    
    The distinct n-gram ratio is calculated by dividing the number of unique n-grams
    by the total number of n-grams in the text. A higher ratio indicates more diverse text.
    
    Args:
        text (str): The generated text to evaluate
        n (int): The n-gram order (default: 2 for bigrams)
        
    Returns:
        float: The ratio of unique n-grams to total n-grams, ranging from 0.0 to 1.0
              where 1.0 means all n-grams are unique
    """
    tokens = text.split()
    total = max(len(tokens) - n + 1, 1)
    unique = len(set(_ngrams(tokens, n)))
    
    return unique / total

### 4.2. Embedding Diversity [7 points]

In this section, we'll implement a metric to evaluate the diversity of generated text using embeddings.

While distinct n-grams measure lexical diversity (word-level), embedding diversity captures semantic diversity.
This metric uses vector representations (embeddings) of the generated texts to measure how different they are
from each other in the semantic space.

The embedding diversity metric works by:
1. Converting each generated text into an embedding vector
2. Computing the cosine similarity between pairs of embeddings
3. Calculating 1 minus the average similarity as the diversity score

A higher embedding diversity score indicates:
- More varied semantic content across generations
- Less thematic repetition between different generated texts
- Potentially more creative exploration of the topic space


In [161]:
def compute_embedding_diversity(
    generations: List[str], 
    embedder: EmbeddingModel = None
) -> float:
    """
    This function computes the diversity of multiple generated texts using embeddings.
    
    The diversity score is calculated by embedding each generation, computing the cosine
    similarity between the first generation (used as reference) and all other generations,
    and then returning 1 minus the mean similarity. A higher score indicates more diverse
    generations.
    
    Args:
        generations (List[str]): A list of generated texts to evaluate for diversity
        embedder (EmbeddingModel): The embedding model to use for converting text to vectors.
                                  If None, a default EmbeddingModel will be instantiated.
        
    Returns:
        float: Diversity score ranging from 0.0 to 1.0, where higher values indicate
              more diverse generations. Returns 0.0 if fewer than 2 generations are provided.
    """
    if len(generations) < 2:
        return 0.0
        
    embedder = embedder or EmbeddingModel()
    
    # Use first generation as reference
    ref_emb = embedder.embed(generations[0]).numpy()
    
    similarities = []
    for gen in generations[1:]:
        gen_emb = embedder.embed(gen).numpy()
        similarities.append(F.cosine_similarity(gen_emb, ref_emb))
    
    # Diversity = 1 - mean similarity
    return 1.0 - np.mean(similarities)

In [162]:
# TODO: add test 

### 4.3. Perplexity Computation [6 points]

In this section, we'll implement a function to compute the perplexity of generated text.
Perplexity is a common metric used to evaluate language models, measuring how "surprised" 
a model is by a given text. Lower perplexity indicates that the model finds the text more 
predictable and natural. This metric will help students understand how to quantitatively 
assess the quality of language model outputs beyond just human evaluation.

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence:

$$P(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i|w_1,\dots,w_{i-1})\right)$$

where W = (w_1, w_2, ..., w_N) is a sequence of N words, and P(w_i|w_1,...,w_{i-1}) is the
conditional probability of word w_i given the preceding words w_1 through w_{i-1}.



In [163]:
def compute_perplexity(text: str, big_llm: LLM = None) -> float:
    """
    This function computes the perplexity of a given text using a language model.
    
    Perplexity is a measurement of how well a probability model predicts a sample.
    Lower perplexity indicates the language model is better at predicting the text.
    
    Args:
        text (str): The generated text to evaluate. This should be a coherent piece
                   of text that we want to measure the perplexity of.
        big_llm (LLM): A language model instance used for perplexity computation.
                      If None is provided, a default LLM instance will be created.
        
    Returns:
        float: The perplexity value of the input text. Lower values indicate the text
              is more predictable according to the language model.
    """
    big_llm = big_llm or LLM()  # Will fall back to stub if HF missing
    return big_llm.perplexity(text)

In [164]:
# TODO: add test

### 4.4. Coherence Metric [6 points]

Coherence is another important metric for evaluating language model outputs, particularly for
tasks like story generation. It measures how well the generated text maintains consistent meaning,
logical flow, and semantic relatedness to a reference text or prompt.
#
In this implementation, we use embedding-based semantic similarity to quantify coherence.
The model computes embeddings for both the generated text and reference text, then calculates
their cosine similarity. 
#
Higher similarity scores indicate better coherence between the generated and reference texts.


In [165]:
def compute_coherence(generations: List[str], reference: str, model: EmbeddingModel) -> float:
    """
    This function computes the coherence between generated text and a reference text
    by computing the embedding-based semantic similarity between the two texts.
    
    Coherence measures how well the generated text maintains consistent meaning and
    logical flow compared to the reference. Higher similarity suggests the
    generated text follows similar patterns to the reference.
    
    Args:
        generations (List[str]): The list of generated texts to evaluate for coherence
        reference (str): The reference text (gold standard) to compare against
        model (EmbeddingModel): The embedding model to use for computing text embeddings
        
    Returns:
        float: Coherence score based on embedding similarity. Higher values indicate
              better coherence between the generated and reference texts.
    """
    # Get the embedding for the reference text
    reference_embedding = model.embed(reference)
    
    # Get embeddings for all generated texts
    generation_embeddings = [model.embed(gen) for gen in generations]
    
    # Calculate cosine similarity between reference and each generation
    similarities = []
    for gen_embedding in generation_embeddings:
        # Compute cosine similarity
        similarity = F.cosine_similarity(reference_embedding, gen_embedding, dim=0)
        similarities.append(similarity)
    
    # Return the average similarity as the coherence score
    return float(np.mean(similarities))

In [166]:
# TODO: add test

### 4.5. Ensemble Evaluation Pipeline

Now we'll implement a function to evaluate ensemble metrics for short story generation.
This function will:
1. Generate multiple completions for each example in the dataset
2. Apply a metric function that works on multiple generations (like diversity or coherence)
3. Return the mean score across all examples and per-example scores
#
This is similar to evaluate_pair_metric() but designed for metrics that operate on
multiple generations rather than pairs of generations and references.


In [167]:
def evaluate_ensemble_metric(
    dataset: List[Dict],
    generate_fn: Callable[[Dict], str],
    metric_fn: Callable[[List[str], str], float],
    num_completions: int = 5,
) -> Dict[str, float]:
    """
    Evaluate a dataset using metrics that work on multiple generations per example.
    
    Args:
        dataset (List[Dict]): List of evaluation examples
        generate_fn (Callable): Function that generates text from an example
        metric_fn (Callable): Function that computes metric from list of generations and reference
        num_completions (int): Number of completions to generate per example
        
    Returns:
        Dict[str, float]: Results with 'mean' score and 'per_example' scores
    """
    scores = []
    for item in dataset:
        # Generate multiple completions
        hypotheses = [generate_fn(item) for _ in range(num_completions)]
        reference = item["reference"]
        scores.append(metric_fn(hypotheses, reference))
    
    return {
        "mean": float(np.mean(scores)),
        "per_example": scores
    }

### 4.6. Test Short Story Generation Metrics

In [168]:
# TODO: 

## 5. Sampling Hyperparameters and Prompt Optimization. 

In this section, we'll explore how different sampling hyperparameters and prompt engineering techniques affect the the LLM's performance on MMLU. 


Sampling hyperparameters like temperature, top_p, and top_k control how the model selects the next token during generation:
- **Temperature**: Controls randomness. Higher values (e.g., 1.5) produce more diverse outputs, while lower values (e.g., 0.1) make outputs more deterministic.
- **Top_p (nucleus sampling)**: Only considers tokens whose cumulative probability exceeds the threshold p. Lower values restrict to more likely tokens.
- **Top_k**: Only considers the k most likely next tokens. Lower values are more restrictive.




### 5.1. Sampling Hyperparameters. 

In this section, you will experiment with different sampling hyperparameters to see how they affect model performance. Try varying temperature, top_p, and top_k values and observe the impact on generation quality. Report the combinations which resulted in highest and lowest performance for MMLU.

In [169]:
## YOUR CODE HERE

### 5.2. Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting ([Wei 2022](https://arxiv.org/abs/2201.11903)) is a technique that encourages the model to break down complex reasoning tasks into intermediate steps. By asking the model to "think step by step" or show its reasoning process, CoT can significantly improve performance on tasks that require multi-step reasoning.

Implement CoT prompting for MMLU. Compare the model's performance with and without CoT prompting. Analyze how explicitly asking the model to reason through each answer choice affects its accuracy.



In [None]:
## YOUR CODE HERE

### 5.3. Few-shot Prompting.

Implement few-shot prompting for MMLU using the provided dev sets. Show how providing examples in the prompt can improve performance.

In [170]:
## YOUR CODE HERE

### 5.4. Maximize Performance.

Design prompt templates that maximize performance for each task (MCQ, MT, SSG). Consider instruction clarity, example quality, and output format specification.

In [171]:
## YOUR CODE HERE

## 6. Submitting Your Assignment

This is the end. Congratulations!  

Now, follow the steps below to submit your homework on Gradescope.

### 6.1. Programming

The programming will be evaluated through an autograder. To create the file to submit for autograder, follow the steps below -
1. Open a terminal from the root directory of the project
2. Run the collect_submission.py file
3. Agree to the Late Policy and Honor Pledge
4. After the file is executed, your root project will have a submission directory.
5. Submit all the contents of this file to GradeScope

### 6.2. Non-Programming

The analysis parts will be evaluated manually. For this, export the notebook to a PDF file, and submit it on GradeScope. Please ensure no written code or output is clipped when you create your PDF. One reliable way to do it is first download it as HTML through Jupyter Notebook and then print it to get PDF.