This notebook runs a tiny demo of Llama 2 (chat or raw) on a classification task using
CAPPr and then via sampling.

In [None]:
!pip install torch==2.2.1 torchaudio torchvision transformers

I don't wanna pay for renting an A100 so I need to use a semi-aggressively quantized
model. Something which fits on a T4. Need the latest `transformers`, `auto-gptq`, and
`optimum` according to this [HF blog
post](https://huggingface.co/blog/gptq-integration#autogptq-library--the-one-stop-library-for-efficiently-leveraging-gptq-for-llms).

In [None]:
!pip install auto-gptq --no-build-isolation \
optimum

In [None]:
!pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git"

In [2]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    pipeline,
)

from cappr.huggingface import classify as fast
from cappr.huggingface import classify_no_cache as slow

In [3]:
_msg = (
    "This notebook must run on a GPU. A T4 instance is sufficient for the models "
    "tested here."
)
assert torch.cuda.is_available(), _msg

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [4]:
# model_id = "TheBloke/Llama-2-7B-GPTQ"
model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [5]:
# warm up model
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(DEVICE))

Chat format is pulled from [this HF blog post](https://huggingface.co/blog/llama2#how-to-prompt-llama-2). I'm not sure why a `<s>` token is already included. `add_bos_token=True` by default, so it seems redunant. It doesn't end up making a significant difference.

In [6]:
llama_chat_template = """
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message} [/INST]
""".lstrip(
    "\n"
)

In [7]:
# Define a classification task
feedback_types = (
    "the product is too expensive",
    "the product uses low quality materials",
    "the product is difficult to use",
    "the product is great",
)

# Write a prompt
def prompt_func(product_review: str) -> str:
    system_prompt = "You are an expert at summarizing product reviews."
    user_message = f"This product review: {product_review}\n" "is best summarized as:"
    if "chat" not in model_id.lower():
        return user_message
    return llama_chat_template.format(
        system_prompt=system_prompt, user_message=user_message
    )

# Supply the texts you wanna classify
product_reviews = [
    "I can't figure out how to integrate it into my setup.",
    "Yeah it's pricey, but it's definitely worth it.",
]
prompts = [prompt_func(product_review) for product_review in product_reviews]
completions = feedback_types

In [8]:
print(prompts[0])

<s>[INST] <<SYS>>
You are an expert at summarizing product reviews.
<</SYS>>

This product review: I can't figure out how to integrate it into my setup.
is best summarized as: [/INST]



In [9]:
pred_probs_fast = fast.predict_proba(
    prompts, completions, model_and_tokenizer=(model, tokenizer)
)

log-probs:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
pred_probs_slow = slow.predict_proba(
    prompts, completions, model_and_tokenizer=(model, tokenizer)
)

log-probs (no cache):   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
pred_probs_fast.round(3)

array([[0.011, 0.013, 0.968, 0.009],
       [0.767, 0.029, 0.148, 0.056]])

In [12]:
pred_probs_slow.round(3)

array([[0.011, 0.013, 0.968, 0.008],
       [0.767, 0.029, 0.148, 0.056]])

IMO the second classification is wrong (it should be "the product is great"). It's at
least an understandable mistake.

The baseline to beat is sampling from the LM:

In [13]:
generator = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

generation_config = GenerationConfig(
    max_new_tokens=100,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    batch_size=1,
)

In [16]:
user_message = """
Every product review belongs to one of these lettered categories:
A. the product is too expensive
B. the product uses low quality materials
C. the product is difficult to use
D. the product is great

Which category does the following product review belong to?
Product review: "I can't figure out how to integrate it into my setup."

Respond only with the letter choice: A or B or C or D.
"""

system_prompt = (
    "You are an expert at categorizing product reviews. "
    "First, respond with the letter corresponding to the category "
    "which matches the given product review. And then provide an "
    "explanation."
)

if "chat" not in model_id.lower():
    prompt = user_message
else:
    prompt = llama_chat_template.format(
        system_prompt=system_prompt, user_message=user_message
    )

sequences = generator(
    prompt,
    generation_config=generation_config,
    pad_token_id=generator.tokenizer.eos_token_id,  # suppress "Setting ..."
)
for seq in sequences:
    response = seq["generated_text"].removeprefix(prompt)
    print(response)

A. The product is too expensive


I've ran this many times and get the desired result—the letter C—pretty rarely. Also,
setting `num_return_sequences` to a bigger number causes numerical instability.