## LLM Compressor Workbench -- Getting Started

This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the [opendatahub/llmcompressor-workbench](https://quay.io/repository/opendatahub/llmcompressor-workbench) image.

We will show how a user can compress and evaluate a Large Language Model, first without data and then with a calibration dataset.

If you are not using the Workbench image, just be sure to have the latest llmcompressor installed, `pip install llmcompressor~=0.5`

### 1\) Compress a model

In [None]:
from llmcompressor.modifiers.quantization import QuantizationModifier

# model to compress
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# This recipe will quantize all Linear layers except those in the `lm_head`,
#  which is often sensitive to quantization. The W4A16 scheme compresses
#  weights to 4-bit integers while retaining 16-bit activations.
recipe = QuantizationModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head"]
)

In [None]:
# Load up model using huggingface API
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

In [None]:
# Run compression using `oneshot`
from llmcompressor import oneshot

model = oneshot(model=model, recipe=recipe, tokenizer=tokenizer)

In [None]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

### 2\) Evaluate compressed model

In [None]:
# Evaluate the model we just compressed using open-source LM Eval framework
import lm_eval
from lm_eval.utils import make_table

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args={
        "pretrained": model_dir,
        "add_bos_token": True,
    },
    # gsm8k details: https://paperswithcode.com/dataset/gsm8k
    # wikitext
    tasks=["gsm8k", "wikitext"],
)
make_table(results)

### 3\) Compress with a Calibration Dataset

Some more advanced compression techniques require a small dataset of calibration samples that are meant to be a representative random subset of the language the model will see at inference.

We will show how the previous section can be augmented with a more advanced compression algorithm and calibration dataset.

In [None]:
# We will use a new recipe running GPTQ (https://arxiv.org/abs/2210.17323)
# to reduce error caused by quantization. GPTQ requires a calibration dataset.
from llmcompressor.modifiers.quantization import GPTQModifier

recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

In [None]:
from datasets import load_dataset

# Create the calibration dataset, using Huggingface datasets API
dataset_id = "HuggingFaceH4/ultrachat_200k"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
num_calibration_samples = 512
max_sequence_length = 2048

# Load dataset
ds = load_dataset(dataset_id, split="train_sft")
# Shuffle and grab only the number of samples we need
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))

# Preprocess and tokenize into format the model uses
def preprocess(example):
    text = tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    return tokenizer(
        text,
        padding=False,
        max_length=max_sequence_length,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(preprocess, remove_columns=ds.column_names)

In [None]:
# oneshot modifies model in-place, so reload
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
# run oneshot again, with dataset
model = oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=max_sequence_length,
    num_calibration_samples=num_calibration_samples,
)

In [None]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-GPTQ-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)