You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

This notebook runs a tiny demo of a [GPTQd StableLM
3B](https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g) on a
classification task using CAPPr and then via sampling.

In [1]:
!pip install torch==2.2.1 torchaudio torchvision transformers

In [None]:
!pip install auto-gptq --no-build-isolation \
optimum

In [None]:
!pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git"

In [3]:
!git lfs install
!git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g

Git LFS initialized.
Cloning into 'stablelm-tuned-alpha-3b-gptq-4bit-128g'...
remote: Enumerating objects: 23, done.[K
remote: Total 23 (delta 0), reused 0 (delta 0), pack-reused 23[K
Unpacking objects: 100% (23/23), 593.64 KiB | 2.95 MiB/s, done.


In [4]:
!ls stablelm-tuned-alpha-3b-gptq-4bit-128g

config.json			  README.md
generation_config.json		  special_tokens_map.json
gptq_model-4bit-128g.safetensors  tokenizer_config.json
quantize_config.json		  tokenizer.json


In [5]:
from auto_gptq import AutoGPTQForCausalLM
import torch
from transformers import AutoTokenizer, GenerationConfig, pipeline

from cappr.huggingface import classify as fast
from cappr.huggingface import classify_no_cache as slow

In [6]:
_msg = (
    "This notebook must run on a GPU. A T4 instance is sufficient for the models "
    "tested here."
)
assert torch.cuda.is_available(), _msg

In [7]:
quantized_model_dir = "stablelm-tuned-alpha-3b-gptq-4bit-128g"
model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir, use_triton=False, use_safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained("StabilityAI/stablelm-tuned-alpha-7b")



Downloading (…)okenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [8]:
# warm up model
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(model.device))

Chat format is pulled from [the HF page](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b).

In [9]:
stablelm_chat_template = """
<|SYSTEM|># {system_prompt}
<|USER|>{user_message}<|ASSISTANT|>
""".strip("\n")

In [10]:
# Define a classification task
feedback_types = (
    "the product is too expensive",
    "the product uses low quality materials",
    "the product is difficult to use",
    "the product is great",
)


# Write a prompt
def prompt_func(product_review: str) -> str:
    system_prompt = "You are an expert at summarizing product reviews."
    user_message = f"This product review: {product_review}\nis best summarized as"
    return stablelm_chat_template.format(
        system_prompt=system_prompt, user_message=user_message
    )


# Supply the texts you wanna classify
product_reviews = [
    "I can't figure out how to integrate it into my setup.",
    "Yeah it's pricey, but it's definitely worth it.",
]
prompts = [prompt_func(product_review) for product_review in product_reviews]
completions = feedback_types

In [11]:
print(prompts[0])

<|SYSTEM|># You are an expert at summarizing product reviews.
<|USER|>This product review: I can't figure out how to integrate it into my setup.
is best summarized as<|ASSISTANT|>


In [12]:
pred_probs_fast = fast.predict_proba(
    prompts, completions, model_and_tokenizer=(model, tokenizer)
)

In [13]:
pred_probs_slow = slow.predict_proba(
    prompts, completions, model_and_tokenizer=(model, tokenizer)
)

In [None]:
# pred_probs_slow.round(3)
# Differences in the `fast` and `slow` module probabilities are not as small as I was
# hoping.

In [14]:
pred_probs_fast.round(3)

array([[0.221, 0.023, 0.502, 0.254],
       [0.327, 0.053, 0.204, 0.416]])

In [16]:
product_reviews

["I can't figure out how to integrate it into my setup.",
 "Yeah it's pricey, but it's definitely worth it."]

In [17]:
feedback_types  # AKA the "classes" of our classification problem

('the product is too expensive',
 'the product uses low quality materials',
 'the product is difficult to use',
 'the product is great')

Both predictions are correct.

For the first product review, the 3rd class' probability is the highest.

For the second product review, the 4th class' probability is highest (followed closely
by the first class).

Pretty impressive for an aggressively quantized 3B parameter model!

The baseline to beat is sampling from the LM:

In [18]:
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

generation_config = GenerationConfig(
    max_new_tokens=100,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    batch_size=1,
)

The model 'GPTNeoXGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartForCausalLM', 'Prop

^ The warning is wrong.

In [19]:
user_message = """
Every product review belongs to one of these lettered categories:
A. the product is too expensive
B. the product uses low quality materials
C. the product is difficult to use
D. the product is great

Which category does the following product review belong to?
Product review: "I can't figure out how to integrate it into my setup."

Respond only with the letter choice: A or B or C or D.
""".strip("\n")

system_prompt = (
    "You are an expert at categorizing product reviews. "
    "Respond with the letter corresponding to the category which the given "
    "product review belongs to."
)

prompt = stablelm_chat_template.format(
    system_prompt=system_prompt, user_message=user_message
)
print(prompt)

<|SYSTEM|># You are an expert at categorizing product reviews. Respond with the letter corresponding to the category which the given product review belongs to.
<|USER|>Every product review belongs to one of these lettered categories:
A. the product is too expensive
B. the product uses low quality materials
C. the product is difficult to use
D. the product is great

Which category does the following product review belong to?
Product review: "I can't figure out how to integrate it into my setup."

Respond only with the letter choice: A or B or C or D.<|ASSISTANT|>


In [20]:
sequences = generator(
    prompt,
    generation_config=generation_config,
    pad_token_id=generator.tokenizer.eos_token_id,  # suppress "Setting ..."
)
for seq in sequences:
    response = seq["generated_text"].removeprefix(prompt)
    print(response)

A. The product is too expensive.


Wrong. You can try setting `do_sample=True` in the `GenerationConfig` but it rarely
gives correct and consistently parseable results.