You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

**Description**: for a [GPTQd Mistral
7B](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ) and the
[SciQ](https://allenai.org/data/sciq) classification task, this notebook demonstrates
that CAPPr and text generation perform similarly. In general, you should expect similar
or identical performance when every completion is 1 token long. Not much to learn from
this demo, except that multiple choice questions can be super effective in the right
setting.

This notebook also demonstrates using the HF
[`cache`](https://cappr.readthedocs.io/en/latest/cappr.huggingface.classify_no_batch.html#cappr.huggingface.classify_no_batch.cache)
context manager.

**Estimated run time**: ~1 min.

In [1]:
!pip install torch==2.2.1 torchaudio torchvision transformers

In [None]:
!pip install auto-gptq --no-build-isolation \
optimum

In [None]:
!pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git"

In [3]:
from __future__ import annotations
from string import ascii_uppercase as alphabet
from typing import cast, Sequence

import datasets
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

from cappr import Example
from cappr import huggingface as hf

# Load data

In [4]:
df_tr = pd.DataFrame(datasets.load_dataset("sciq", split="train"))

In [5]:
df_tr.head()

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What type of organism is commonly used in prep...,viruses,protozoa,gymnosperms,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,tropical effect,muon effect,centrifugal effect,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,endothermic,unbalanced,reactive,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,zeta decay,beta decay,gamma decay,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,magma,greenhouse gases,carbon and smog,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


In [6]:
rng = np.random.default_rng(seed=123)

In [7]:
df_tr["possible_answers"] = [
    rng.choice([d1, d2, d3, correct_answer], size=4, replace=False).tolist()  # shuffle
    for d1, d2, d3, correct_answer in zip(
        df_tr["distractor1"],
        df_tr["distractor2"],
        df_tr["distractor3"],
        df_tr["correct_answer"],
    )
]

df_tr["label"] = [
    cast(list, possible_answers).index(correct_answer)
    for possible_answers, correct_answer in zip(
        df_tr["possible_answers"], df_tr["correct_answer"]
    )
]

In [8]:
len(df_tr)

11679

We don't need this much data. Sample down

In [9]:
df_tr_mini = df_tr.sample(n=500, random_state=123).reset_index(drop=True)

# Load model

In [10]:
model_name = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", trust_remote_code=False, revision="main"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
# warm up model
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(model.device))

# Shared prompt utilities

It turns out that we need to use a multiple choice prompt for this task, probably b/c it looks like school exams. Every question
has exactly 4 choices.

In [12]:
system_message = (
    "You are an expert at answering science questions. Choose the best answer "
    "for the multiple choice question. Respond only with the letter "
    "corresponding to the correct answer."
)

In [13]:
mc_choice_letters = alphabet[:4]

In [14]:
def multiple_choice(*choices) -> str:
    if len(choices) > len(alphabet):
        raise ValueError("There are more choices than letters.")
    letters_and_choices = [
        f"{letter}. {choice}" for letter, choice in zip(alphabet, choices)
    ]
    return "\n".join(letters_and_choices)

# Text generation

In [15]:
chat_template = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")


def prompt_text_gen(question: str, possible_answers: Sequence[str]):
    mc = multiple_choice(*possible_answers)
    return chat_template.format(
        system_message=system_message, prompt=question + "\n" + mc
    )


df_tr_mini["prompt_text_gen"] = [
    prompt_text_gen(record["question"], record["possible_answers"])
    for record in df_tr_mini.to_dict("records")
]
print(df_tr_mini["prompt_text_gen"].iloc[0])

<|im_start|>system
You are an expert at answering science questions. Choose the best answer for the multiple choice question. Respond only with the letter corresponding to the correct answer.<|im_end|>
<|im_start|>user
Calculations are described showing conversions between molar mass and what for gases?
A. volume
B. weight
C. density
D. length<|im_end|>
<|im_start|>assistant



We need to create a PyTorch Dataset to batch the inputs.

In [16]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

We'll do greedy sampling to better ensure that each completion is one of the class names.

In [17]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=2,
    return_full_text=False,
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

In [18]:
text_gen_dataset_tr = TextsDataset(df_tr_mini["prompt_text_gen"].tolist())

In [19]:
completions = []
for seq in tqdm(
    generator(
        text_gen_dataset_tr,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=16,
    ),
    total=len(text_gen_dataset_tr),
    desc="Sampling",
):
    completions.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/500 [00:00<?, ?it/s]



Let's see if the model generated answers like we asked.

In [20]:
pd.Series(completions).sample(n=10)

475     B.
145     B.
353     B.
182     A.
121     B.
411     C.
359     A.
287     D.
77      D.
5       C.
dtype: object

In [21]:
def process_completion_mc(
    completion: str,
    class_chars: Sequence[str],
    class_names: Sequence[str],
    default=-1,
) -> int:
    for i, name in enumerate(class_names):
        if name in completion.lower():
            return i
    for i, char in enumerate(class_chars):
        if char in completion:
            return i
    return default

In [22]:
preds_text_gen = [
    process_completion_mc(
        completion, class_chars=mc_choice_letters, class_names=possible_answers
    )
    for completion, possible_answers in zip(completions, df_tr_mini["possible_answers"])
]

How many of the completions could be mapped to a label?

In [23]:
(pd.Series(preds_text_gen) != -1).mean()

1.0

How accurate are the predictions?

In [24]:
(df_tr_mini["label"] == preds_text_gen).mean()

0.902

# CAPPr

In [25]:
chat_template_shared_instructions = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_prompt}
""".strip("\n")

chat_template_prompt = """
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [26]:
prompt_prefix = chat_template_shared_instructions.format(
    system_message=system_message,
    user_prompt="",
)


def prompt(question: str, possible_answers: Sequence[str]) -> str:
    mc = multiple_choice(*possible_answers)
    return chat_template_prompt.format(prompt=question + "\n" + mc)


prompts = [
    prompt(record["question"], record["possible_answers"])
    for record in df_tr_mini.to_dict("records")
]

In [27]:
# Here's what the model will see
print(prompt_prefix + " " + prompts[0])

<|im_start|>system
You are an expert at answering science questions. Choose the best answer for the multiple choice question. Respond only with the letter corresponding to the correct answer.<|im_end|>
<|im_start|>user
 Calculations are described showing conversions between molar mass and what for gases?
A. volume
B. weight
C. density
D. length<|im_end|>
<|im_start|>assistant



In [28]:
with hf.classify.cache((model, tokenizer), prompt_prefix) as cached:
    pred_probs = hf.classify.predict_proba(
        prompts=prompts,
        completions=list(mc_choice_letters),
        model_and_tokenizer=cached,
        batch_size=16,
    )

conditional log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

In [29]:
(df_tr_mini["label"] == pred_probs.argmax(axis=1)).mean()

0.908