You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

**Description**: for a [4 GB quantized Llama 2 chat
model](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ) and the
[COPA](https://people.ict.usc.edu/~gordon/copa.html) classification task, this notebook
demonstrates that

1. multiple choice text generation is not that effective; it recovers 77% of the
   accuracy of OpenAI's `gpt-3.5-turbo`
2. CAPPr is effective; for this dataset, CAPPr recovers 91% of the accuracy of OpenAI's
   `gpt-3.5-turbo`, and will never require post-processing of the output.

**Contamination notice**: I don't know whether Llama 2 was trained on any COPA data. If
it was, but there's no interaction between the method (CAPPr vs text generation) and
training, then the difference between performances can be studied.

**Estimated run time**: ~6 min.

[Install packages](#install-packages)

[Pick chat vs no chat](#pick-chat-vs-no-chat)

[Utils](#utils)

[Load data](#load-data)

[The problem](#the-problem)

[Write prompt](#write-prompt)

[The solution](#the-solution)

# Install packages

In [1]:
# check correct CUDA version
import torch

_cuda_version = torch.version.cuda
_msg = (
    "Change the pip install auto-gptq command to the one for "
    f"{_cuda_version} based on the list here: "
    "https://github.com/PanQiWei/AutoGPTQ#quick-installation"
)

assert _cuda_version == "11.8", _msg

I don't wanna pay for renting an A100 so I need to use a semi-aggressively quantized
model. Something which fits on a T4. Need the latest `transformers`, `auto-gptq`, and
`optimum` according to this [HF blog
post](https://huggingface.co/blog/gptq-integration#autogptq-library--the-one-stop-library-for-efficiently-leveraging-gptq-for-llms).

I'm gonna install `cappr` from source b/c sometimes I use this notebook to statistically
gut check code changes.

I'll also install the `demos` extras for NLP datasets.

In your local env, you'd just do:

```
pip install "cappr[hf]"
```

In [None]:
!python -m pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git" \
auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ \
optimum

In [3]:
from __future__ import annotations
from pprint import pprint
from typing import Collection, Literal, Sequence

import datasets
import pandas as pd
from tqdm.auto import tqdm

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    pipeline,
)

from cappr import Example
from cappr.huggingface import classify

In [4]:
_msg = (
    "This notebook must run on a GPU. A T4 instance is sufficient for the models "
    "tested here."
)
assert torch.cuda.is_available(), _msg

In [5]:
DEVICE = "cuda"

# Pick chat vs no chat

In [6]:
model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
# model_id = "TheBloke/Llama-2-7B-GPTQ"

Bonus: CAPPr performs well with the non-chat model as well, which makes sense since CAPPr prompts can look like pretraining data.

# Utils

Copied from [here](https://github.com/kddubey/cappr/blob/main/demos/utils.py) so that this notebook can be run anywhere.

In [7]:
from __future__ import annotations
from typing import Optional, Union

from IPython.display import display
import pandas as pd


def display_df(
    df: pd.DataFrame,
    columns: Optional[list[str]] = None,
    num_rows: Union[int, None] = 3,
):
    """
    Displays `df.head(num_rows)[columns]` without truncating columns. If
    possible, render any newlines.
    """
    if columns is None:
        columns = df.columns
    if num_rows is None:
        num_rows = len(df)
    df_head_styled = df.head(num_rows)[columns].style
    with pd.option_context("max_colwidth", -1):
        # I'm not sure why try-except doesn't work w/ display(), so instead
        # check the necessary uniqueness condition before running it
        if df.index.is_unique:
            display(
                df_head_styled.set_properties(
                    **{"text-align": "left", "white-space": "pre-wrap"}
                )
            )
        else:
            # `Styler.apply` and `.applymap` are not compatible with non-unique
            # index or columns
            display(df_head_styled)


def remove_suffix(string: str, suffix: str):
    if string.endswith(suffix):
        return string[: -len(suffix)]
    return string


def remove_prefix(string: str, prefix: str) -> str:
    if string.startswith(prefix):
        return string[len(prefix) :]
    return string

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA) task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it has multi-token labels, in some sense. It also looks cool.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example 2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets.

In [8]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(datasets
                        .load_dataset('super_glue', task_id, split=split))


df = (pd.concat((load_super_glue('copa', 'train'),
                 load_super_glue('copa', 'validation')))
      .reset_index(drop=True)) # idx column is only unique w/in splits! fuhgetaboutit

In [9]:
len(df)

500

In [10]:
df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# The problem

The most straightforward way to solve this task is to point to each alternative with a
single letter, and hope that the LM samples/generates the correct letter. The prompt is
effectively a multiple choice question.

For example:

```
The man broke his toe because
A. He got a hole in his sock.
B. He dropped a hammer on his foot.
Answer A or B.
```

In [11]:
def _conjunction(question: Literal["cause", "effect"]):
    if question == "cause":
        return " because"
    elif question == "effect":
        return ", so"
    else:
        raise ValueError("question must be 'cause' or 'effect'. Got " f"{question}.")


def prompt(premise: str, question: Literal["cause", "effect"]):
    conjunction = _conjunction(question)
    return f'{premise.strip(". ")}{conjunction}'

In [12]:
def prompt_mc(
    premise: str, question: Literal["cause", "effect"], choice1: str, choice2: str
):
    return (
        f"{prompt(premise, question)}\n"
        f"A. {choice1}\n"
        f"B. {choice2}\n"
        "Answer A or B."
    )


df["prompt_mc"] = [
    prompt_mc(
        record["premise"], record["question"], record["choice1"], record["choice2"]
    )
    for record in df.to_dict("records")
]


display_df(df, columns=["prompt_mc", "label"])

Unnamed: 0,prompt_mc,label
0,My body cast a shadow over the grass because A. The sun was rising. B. The grass was cut. Answer A or B.,0
1,The woman tolerated her friend's difficult behavior because A. The woman knew her friend was going through a hard time. B. The woman felt that her friend took advantage of her kindness. Answer A or B.,0
2,The women met for coffee because A. The cafe reopened in a new location. B. They wanted to catch up with each other. Answer A or B.,1


(It turns out that GitHub doesn't render the newlines, but I promise they're there!)

The [notebook
here](https://github.com/kddubey/cappr/blob/main/demos/openai/superglue/copa.ipynb)
demonstrates that this prompt is effective for bigger OpenAI models: `text-davinci-003`
and `gpt-3.5-turbo`. Let's see how it performs for Llama 2.

Note after experiments: this prompt seems to be the best I can do w/ text generation.

In [13]:
# if we're using a chat model, let's make sure we're formatting the prompt correctly
llama_chat_template = """
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message} [/INST]
""".lstrip(
    "\n"
)

system_prompt_copa = (
    "Identify the cause or effect of a premise given two choices. Each choice "
    "is identified by a letter, A or B.\n"
    "Respond only with the letter corresponding to the correct cause or effect."
)

df["prompt_mc_chat"] = [
    llama_chat_template.format(
        system_prompt=system_prompt_copa, user_message=prompt_mc
    )
    for prompt_mc in df["prompt_mc"]
]

if "chat" in model_id.lower():
    prompt_mc_column = "prompt_mc_chat"
else:
    prompt_mc_column = "prompt_mc"

In [14]:
print(df[prompt_mc_column].iloc[0])

<s>[INST] <<SYS>>
Identify the cause or effect of a premise given two choices. Each choice is identified by a letter, A or B.
Respond only with the letter corresponding to the correct cause or effect.
<</SYS>>

My body cast a shadow over the grass because
A. The sun was rising.
B. The grass was cut.
Answer A or B. [/INST]



Set up a huggingface text generator:

In [15]:
generator = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

We need to create a PyTorch Dataset to batch the inputs.

In [16]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

In [17]:
# we'll do greedy decoding by default
generation_config = GenerationConfig(
    max_new_tokens=5,  # in case there are completions like "\nAnswer: A"
    num_return_sequences=1,
    eos_token_id=generator.tokenizer.eos_token_id,
    batch_size=1,
)

sequences = generator(
    TextsDataset(df[prompt_mc_column].tolist()),
    generation_config=generation_config,
    pad_token_id=generator.tokenizer.eos_token_id,  # suppress "Setting ..."
)

Welp, setting `batch_size` > 1 causes numerical errors. This next cell is gonna take 2-5 minutes.

In [18]:
completions_raw = []
for seq in tqdm(sequences, total=len(df), desc="Sampling"):
    completions_raw.append(seq[0]["generated_text"])

Sampling:   0%|          | 0/500 [00:00<?, ?it/s]

In [19]:
completions = [
    remove_prefix(completion, prompt_mc)
    for prompt_mc, completion in zip(df[prompt_mc_column], completions_raw)
]

In [20]:
pd.Series(completions).sample(n=10)

61          A. I forgot to
488    B. The woman needed
169       A. The dust came
301                      B
123        B. He collapsed
367                      A
237     B. He admitted his
268           B. The dough
413                      B
118           B. His palms
dtype: object

When you're doing text generation, you often have to write this sort of data-dependent
and model-dependent function:

In [21]:
def process_completion(
    completion: str,
    class_chars: Sequence[str],
    prefixes_remove: Collection[str] = ("Answer ",),
    strip_chars: str = " \n.",
    default=-1,
) -> int:
    if any(len(class_char) != 1 for class_char in class_chars):
        raise ValueError("Elements of class_chars must be a single character.")

    completion_stripped = completion.strip(strip_chars)
    if not completion_stripped:
        return default
    for prefix_remove in prefixes_remove:
        completion_stripped_rm = remove_prefix(completion_stripped, prefix_remove)
    if not completion:
        return default
    completion_char_lower = completion_stripped_rm[0].lower()
    class_chars_lower = [class_char.lower() for class_char in class_chars]
    try:
        return class_chars_lower.index(completion_char_lower)
    except ValueError:
        return default

In [22]:
class_chars = ('A', 'B')

In [23]:
pred_classes_text_gen = [
    process_completion(completion, class_chars)
    for completion in completions
]

How many of the sampled completions could be mapped to a label 0 or 1?

In [24]:
(pd.Series(pred_classes_text_gen) != -1).mean()

0.994

That's nice. But how accurate are the predictions?

In [25]:
(pred_classes_text_gen == df['label']).mean()

0.7

If you don't use an instruction-trained model, only a fraction of the LM's responses
contain a letter which is easy to parse. Computing acccuracy is kind of pointless; we
don't even have real predictions.

If you do use an instruction-trained model, you get real predictions, but they're
incorrect too often.

Unfortunately, for smaller or less-instruction-trained LMs, text generation can raise
more problems than it solves. This result aligns with the one found in the [demo for
`text-curie-001`](https://github.com/kddubey/cappr/blob/main/demos/superglue/copa.ipynb).
Sampling structured outputs from smaller language models is not statistically
performant.

What do we do?

# Write prompt

We should take advantage of the fact that models like Llama 2 were extensively trained for a simple task: predict the next token.

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe because
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this prompt.

In [26]:
df["prompt"] = [
    prompt(premise, question)
    for premise, question in zip(df["premise"], df["question"])
]

In [27]:
display_df(df, columns=["prompt", "choice1", "choice2", "label"])

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass because,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior because,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee because,The cafe reopened in a new location.,They wanted to catch up with each other.,1


Note: we need to lowercase the choices.

# Load model

In [28]:
model = generator.model
tokenizer = generator.tokenizer

Or if you need to load it from scratch, uncomment and run this cell:

In [29]:
# model = AutoModelForCausalLM.from_pretrained(
#     model_id, torch_dtype=torch.float16, device_map="auto"
# )
# tokenizer = AutoTokenizer.from_pretrained(model_id)

In [30]:
# warm up model
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(DEVICE))

# The solution

Also observe that CAPPr's interface may feel simpler than HuggingFace's generator pipeline. There are too many ways to sample text.

In [31]:
examples = [
    Example(
        prompt=record["prompt"] + " ",
        completions=(record["choice1"].lower(), record["choice2"].lower()),
        prior=None,
        end_of_prompt="",
    )
    for record in df.to_dict("records")
]

In [32]:
pprint(examples[0])

Example(prompt='My body cast a shadow over the grass because ',
        completions=('the sun was rising.', 'the grass was cut.'),
        prior=None,
        end_of_prompt='',
        normalize=True)


Set the batch size to something that'll work on your machine. On a T4, `batch_size = 32` works for this data.

In [33]:
batch_size = 32

In [34]:
pred_probs = classify.predict_proba_examples(
    examples, model_and_tokenizer=(model, tokenizer), batch_size=batch_size
)

conditional log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

For COPA, the scoring metric is accuracy.

In [35]:
(pred_probs.argmax(axis=1) == df["label"]).mean()

0.83

This 4 GB open source model beats OpenAI's `text-curie-001`, which was 80% accurate
according to the CAPPr demo
[here](https://github.com/kddubey/cappr/blob/main/demos/openai/superglue/copa.ipynb).
OpenAI's `gpt-3.5-turbo` is 91% accurate. So we've recovered 91% of its performance by
using CAPPr.