You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

**Description**: for a [GPTQd Mistral
7B](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ) and the [Craigslist
Bargains](https://huggingface.co/datasets/craigslist_bargains) classification task, this
notebook demonstrates that CAPPr gets you +1% absolute accuracy compared to text
generation. In general, you should expect similar or identical performance when every
completion is 1 token long. This notebook also demonstrates using the `prior` keyword argument.

**Estimated run time**: ~15 min.

In [1]:
!pip install torch==2.2.1 torchaudio torchvision transformers

In [None]:
!pip install auto-gptq --no-build-isolation \
optimum

In [None]:
!pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git"

In [3]:
from __future__ import annotations
from typing import Sequence

import datasets
import pandas as pd
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from cappr import huggingface as hf

# Load training data

In [None]:
df_tr = pd.DataFrame(datasets.load_dataset("aladar/craigslist_bargains", split="train"))

In [5]:
len(df_tr)

5247

In [6]:
CLASS_NAMES: list[str] = sorted(set(df_tr["class_name"]))

# Write prompt

Instruction format pulled from
[here](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ#you-can-then-use-the-following-code).

In [7]:
chat_template = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [8]:
class_names_str = ", ".join(CLASS_NAMES)

In [9]:
system_message = (
    "You will be given a dialogue between a seller and a buyer about the sale of "
    "an item.\n"
    "Your task is to categorize the item being sold as one of these categories: "
    f"{class_names_str}\n\n"
    "You will answer with just the the correct category and nothing else."
)

In [10]:
def prompt(dialogue: str) -> str:
    user_message = (
        "Dialogue:\n"
        f'"""\n{dialogue}\n"""\n'
        "The item discussed in the dialogue above belongs to the category:"
    )
    return chat_template.format(
        system_message=system_message, prompt=user_message
    )

# Process training data

To ensure train and test data processing is equivalent, we'll apply the same function:
`process`.

In [11]:
def process(df: pd.DataFrame) -> pd.DataFrame:
    df["text"] = df["text"].fillna("")
    df["prompt"] = [prompt(dialogue) for dialogue in df["text"]]
    return df[["text", "label", "prompt", "class_name"]]

In [12]:
df_tr = process(df_tr)

In [13]:
len(df_tr)

5247

For computationally and statistically cheap experiments, we don't need all of this data.

In [14]:
df_tr_mini = df_tr.copy().sample(n=100, random_state=123).reset_index(drop=True)
len(df_tr_mini)

100

# Preview prompt

In [15]:
print(df_tr["prompt"].iloc[0])

<|im_start|>system
You will be given a dialogue between a seller and a buyer about the sale of an item.
Your task is to categorize the item being sold as one of these categories: bike, car, electronics, furniture, housing, phone

You will answer with just the the correct category and nothing else.<|im_end|>
<|im_start|>user
Dialogue:
"""
Buyer: Hi, not sure if the charger would work for my car. Can you sell it to me for $5?
Seller: It will work, i have never seen a car without a cigarette lighter port.\
Buyer: Still, can I buy it for $5? I'm on a tight budge
Seller: I think the lowest I would want to go is 8. 
Buyer: How about $6 and I pick it up myself? It'll save you shipping to me.
Seller: 7, and we have a deal.
Buyer: Eh, fine. $7.
"""
The item discussed in the dialogue above belongs to the category:<|im_end|>
<|im_start|>assistant



In [16]:
print(df_tr["class_name"].iloc[0])

phone


Ooh, little tricky since the buyer and seller only mention a charger. I bet most models
will incorrectly classify the dialogue as a car sale.

# Load model

In [None]:
model_name = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", trust_remote_code=False, revision="main"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

In [18]:
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(model.device))

# CAPPr

Many research datasets are intentionally balanced (so that only the model's likelihood is considered).
Balanced classes in real applications are rare. Let's see what the distribution
looks like for this more realistic dataset:

In [19]:
df_tr["class_name"].value_counts(normalize=True).sort_index()

bike           0.179150
car            0.133028
electronics    0.130741
furniture      0.247951
housing        0.202783
phone          0.106346
Name: class_name, dtype: float64

Indeed, not really balanced. Let's supply this prior to CAPPr and see how it does on a
small subset of the training data.

In [20]:
prior = (
    df_tr["class_name"]
    .value_counts(normalize=True)
    [CLASS_NAMES]
)

Also let's cache the shared prompt instructions.

In [21]:
chat_template_shared_instructions = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_prompt}
""".strip("\n")

chat_template_prompt = """
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [22]:
user_prompt = "Dialogue:\n"
prompt_prefix = chat_template_shared_instructions.format(
    system_message=system_message,
    user_prompt=user_prompt,
)


def prompt_suffix(dialogue: str):
    prompt = (
        f'"""\n{dialogue}\n"""\n'
        "The item discussed in the dialogue above belongs to the category:"
    )
    return chat_template_prompt.format(prompt=prompt)


df_tr_mini["prompt_suffix"] = [
    prompt_suffix(dialogue) for dialogue in df_tr_mini["text"]
]

In [23]:
# Here's what the model will see
print(prompt_prefix + " " + df_tr_mini["prompt_suffix"].iloc[0])

<|im_start|>system
You will be given a dialogue between a seller and a buyer about the sale of an item.
Your task is to categorize the item being sold as one of these categories: bike, car, electronics, furniture, housing, phone

You will answer with just the the correct category and nothing else.<|im_end|>
<|im_start|>user
Dialogue:
 """
Buyer: hi i am interested in a bike
Seller: Great! It's a terrific bike and I'm asking $200.
Buyer: It is quite a bike but my budget is closer to 100. Don't get me wrong is nice but is not new.
Seller: That's true, it's not new and does have some scuff marks. But it's in solid condition mechanically. How about $150?
Buyer: I see no problem with this. I agree.
"""
The item discussed in the dialogue above belongs to the category:<|im_end|>
<|im_start|>assistant



In [24]:
cached = hf.classify.cache_model((model, tokenizer), prompt_prefix)

In [25]:
pred_probs = hf.classify.predict_proba(
    prompts=df_tr_mini["prompt_suffix"],
    completions=CLASS_NAMES,
    model_and_tokenizer=cached,
    prior=prior,
)

conditional log-probs:   0%|          | 0/100 [00:00<?, ?it/s]

Accuracy:

In [26]:
(pred_probs.argmax(axis=1) == df_tr_mini["label"]).mean()

0.92

# Text generation

We need to create a PyTorch Dataset to batch the inputs.

In [27]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

In [28]:
text_gen_dataset_tr = TextsDataset(df_tr_mini["prompt"].tolist())

We'll do greedy sampling to better ensure that each completion is one of the class names.

In [29]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=5,
    return_full_text=False,
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

In [30]:
completions = []
for seq in tqdm(
    generator(
        text_gen_dataset_tr,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_tr),
    desc="Sampling",
):
    completions.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/100 [00:00<?, ?it/s]



Let's see if the model generated categories like we asked.

In [31]:
pd.Series(completions).sample(n=10)

10       furniture
26            bike
67       furniture
5              car
44         housing
7      electronics
27             car
87           phone
58            bike
74         housing
dtype: object

Nice, text generation works well here. Writing the post-processor is trivial.

In [32]:
def process_completion(completion: str, class_names: Sequence[str], default=-1) -> int:
    for i, name in enumerate(class_names):
        if name in completion.lower():
            return i
    return default

In [33]:
preds_text_gen = [
    process_completion(completion, CLASS_NAMES)
    for completion in completions
]

How many of the completions could be mapped to a label?

In [34]:
(pd.Series(preds_text_gen) != -1).mean()

1.0

How accurate are the predictions?

In [35]:
(df_tr_mini["label"] == preds_text_gen).mean()

0.9

# Evaluate on test data

It's important to do some things honorably.

<span style="font-family: Baskerville; font-size: 18px;">I solemnly swear that I
evaluated on the test set twice (once per pre-selected method), running only the
following cells in sequence once.</span>

<img src="../signature.png" alt="drawing" width="200"/>
<div style="width:200px"><hr/></div>

In [36]:
_df_raw_te = pd.DataFrame(
    datasets.load_dataset("aladar/craigslist_bargains", split="test")
)
len(_df_raw_te)

838

In [38]:
df_te = process(_df_raw_te)

In [39]:
df_te["prompt_suffix"] = [prompt_suffix(dialogue) for dialogue in df_te["text"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_te["prompt_suffix"] = [prompt_suffix(dialogue) for dialogue in df_te["text"]]


## Text generation

In [40]:
text_gen_dataset_te = TextsDataset(df_te["prompt"].tolist())

In [41]:
completions_te = []
for seq in tqdm(
    generator(
        text_gen_dataset_te,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_te),
    desc="Sampling",
):
    completions_te.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/838 [00:00<?, ?it/s]

In [42]:
preds_text_gen_te = [
    process_completion(completion, CLASS_NAMES)
    for completion in completions_te
]

How many of the completions could be mapped to a label?

In [43]:
(pd.Series(preds_text_gen_te) != -1).mean()

0.9976133651551312

How accurate are the predictions?

In [44]:
(preds_text_gen_te == df_te["label"]).mean()

0.8949880668257757

## CAPPr

In [45]:
pred_probs_te = hf.classify.predict_proba(
    prompts=df_te["prompt_suffix"],
    completions=CLASS_NAMES,
    model_and_tokenizer=cached,  # cached prompt_prefix
    prior=prior,  # estimated from independent training data
)

conditional log-probs:   0%|          | 0/838 [00:00<?, ?it/s]

How accurate are the predictions?

In [46]:
(pred_probs_te.argmax(axis=1) == df_te["label"]).mean()

0.9045346062052506

Hmm, [Refuel AI's few-shot
experiment](https://github.com/refuel-ai/autolabel/blob/main/examples/craigslist/example_craigslist.ipynb)
with `gpt-3.5-turbo` got 89% accuracy. But the way that experiment processed dialogues
seems off. So there isn't good evidence that this 4 GB model beats `gpt-3.5-turbo` just yet.