## Lab 1 â€“ Evaluating Language Models: hands-on with MMLU

The objective of this lab session is for you to get familiar with autoregressive language models, in particular we will work with a historically relevant model: GPT-2. You can find the huggingface model card [here](https://huggingface.co/openai-community/gpt2). 

We are going to set up the basic code to _evaluate_ GPT-2 on a multiple choice question-answering task (MCQA). This will involve downloading the data (2.) and inspecting it so that we understand how to build a prompt (3.). In other Tutorials we will experiment a bit more on how different prompts interact with different models, so it is important to understand how to dynamically generate prompts, in this Lab we will introduce `langchain`'s `PromptTeplate`.

Finally, there are different ways in which one could evaluate the model on MCQA. We will choose to implement one that compares the average log-probabilities over the tokens of the different possible answers (4.).

The lab should be runnable on CPU (albeit slightly slow). You can try to use [google colab's](https://colab.research.google.com/) free tier GPU. 

### 1. Imports & Global Config

In [1]:
!pip -q install datasets transformers langchain --upgrade

[0m

In [2]:
import random
import os
from typing import List, Dict

import datasets
from datasets import load_dataset, Dataset, DatasetDict

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

from langchain_core.prompts.prompt import PromptTemplate

MODEL_NAME = "meta-llama/Llama-2-13B-hf" 

  from .autonotebook import tqdm as notebook_tqdm


### 2. Loading the dataset

We are going to work with a dataset called Measuring Massive Multitask Language Understanding (MMLU), you can find the paper [here](https://arxiv.org/pdf/2009.03300).

MMLU was designed to be a challenging evaluation for language models, to prevent the saturation of many datasets of the time, when the first large-scale models where appearing and making existing benchmarks obsolete. As such, this will be a difficult task for GPT-2!

As we will see below, it is a dataset of multiple choice question answering (MCQA). As whenever one works with a new dataset, it is worth to spend some time looking at the data to get a feel for it!

Go to huggingface and find the `cais/mmlu` dataset. Note down the structure of the dataset.

We will work with the `high_school_biology` task for now. Note that this dataset does not have training split (they provide QA data from other datasets instead), so we will download just the `test` split data.

In [3]:

data = load_dataset("cais/mmlu", "high_school_biology", split="test")
data

Dataset({
    features: ['question', 'subject', 'choices', 'answer'],
    num_rows: 310
})

### 3. From data to prompts

We are going to use the `PromptTemplate` class of the `langchain` library to create prompts given a sample.

The template you have to implement for now is: ```{question}\n\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n\nAnswer:```

For the first sample in our high school biology split, your function should return

```
In a population of giraffes, an environmental change occurs that favors individuals that are tallest. As a result, more of the taller individuals are able to obtain nutrients and survive to pass along their genetic information. This is an example of\n\nA. directional selection.\nB. stabilizing selection.\nC. sexual selection.\nD. disruptive selection.\n\nAnswer:
```


In [4]:
def convert_to_MCQA_prompt(sample: dict) -> str:
    """Convert an MMLU sample into a multiple-choice prompt string using LangChain PromptTemplate.
    The prompt contains the question followed by labeled choices: A., B., C., D., ...
    """
    question = sample['question']
    choices = sample['choices']
    correct_answer = sample['answer']
    return f"{question}\n\nA. {choices[0]}\nB. {choices[1]}\nC. {choices[2]}\nD. {choices[3]}\n\nAnswer:"

convert_to_MCQA_prompt(data[0])

'In a population of giraffes, an environmental change occurs that favors individuals that are tallest. As a result, more of the taller individuals are able to obtain nutrients and survive to pass along their genetic information. This is an example of\n\nA. directional selection.\nB. stabilizing selection.\nC. sexual selection.\nD. disruptive selection.\n\nAnswer:'

### 4. Evaluating GPT-2

Using this MCQA prompt, we want to evaluate GPT-2 by looking at the logits of each possible answer, when appended at the end of the prompt. We will use `AutoModelForCausalLLM` to load the model, and then, for each variation of MCQA_prompt + choice, average the log probabilities of the tokens in the choice.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="cuda" # uncomment this for GPU usage
)

Fetching 3 files:   0%|          | 0/3 [02:18<?, ?it/s]


RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28)

Given the MCQA prompt, there are different ways to extract an answer from an autoregressive language model. One would be to simply choose a sampling algorithm and decode one token, then compare if this token is A, B, C or D. You could try this, and think on what are the problems of this decoding approach. In this lab, we will use an alternative method, that will let us practice handling the outputs of an autoregressive language model like GPT-2.

The idea is that we are going to 'force decode' the Prompt+answer pairs for each possible choice. Then, for each token on the choice, we can extract the logits, and convert those into log-probabilities using `torch.nn.functional.log_softmax`. At each position, this will be a token with the dimension of the vocabulary, and we need to find the log-probability of the appropiate token in the answer (through the token ID, which is the index of that token in the log-probabilities vector). Now we acerage, and that gives us a score for every answer.

The recomendation is to start messing around with the outputs of the generation:
```
outputs = model(input_ids)
logits = outputs.logits  # (1, seq_len, vocab_size)
log_probs = F.log_softmax(logits, dim=-1)
```
and try to extract the probabilities of different tokens and understand the dimensions of these tensors.

In [None]:

def run_gpt2(prompt: str, choices: List[str], model, tokenizer, device=None) -> str:
    """Evaluate which choice (A/B/C/D) GPT-2 prefers using log-probabilities (sum of token log-probs) of the actual answer.
    You should join the prompt which each possible choice, and average the logits over the tokens in the choice. Then return the logits for each option in a dictionary:
    {'A': logp_choice_A, 'B': logp_B, 'C': logp_C, 'D': logp_D}"""
    with torch.no_grad():
        # Implement extracting the log-probabilities for each choice.
        results = {}
        prompt_ids = tokenizer(prompt, return_tensors="pt", device=device)['input_ids'].cuda()
        for idx, choice in enumerate(choices):
            label = chr(ord('A') + idx)
            prompt_choice_ids = tokenizer(f"{prompt}{choice}", return_tensors="pt", device=device)['input_ids'].cuda()
            outputs = model(prompt_choice_ids)
            logits = outputs.logits
            log_probs = F.log_softmax(logits, dim=-1)
            choice_log_probs = log_probs[:, prompt_ids.size(1):, :]
            choice_ids = (prompt_choice_ids[:, prompt_ids.size(1):]).unsqueeze(-1) - 1
            force_decoded_log_probs = choice_log_probs.gather(-1, choice_ids).squeeze((0, -1))
            results[label] = force_decoded_log_probs.mean().item()

        return max(results, key=results.get)


results = run_gpt2(convert_to_MCQA_prompt(data[0]), data[0]['choices'], model, tokenizer, device='cuda')
print(results)

Now we can write a small loop to evaluate the model. 

- Can you think of any other ways of evaluating GPT-2 on a QA dataset?

- How is the performance? Why is it that way?

- Can you modify your code to be more efficient by batching the calculation?

- Can you visualize the results using `matplotlib` and `seaborn`?

In [None]:
results = []
correct = 0
for d in data:
    prompt = convert_to_MCQA_prompt(d)
    choices = list(d['choices']) if isinstance(d['choices'], (list, tuple)) else list(d['choices'].values())
    predicted = run_gpt2(prompt, choices, model, tokenizer)
    correct += 1 if predicted == ['A', 'B', 'C', 'D'][d['answer']] else 0
    print(d['question'])
    print(d['choices'])
    print('Predicted:', predicted)
    print('Answer:', ['A', 'B', 'C', 'D'][d['answer']])
    print()

print(f"Accuracy on {len(data)} samples: {correct / len(data):.4f}")