# `int4` Weight Quantization

`llm-compressor` supports quantizing weights to `int4` for memory savings and inference acceleration with `vLLM`.

Note: `int4` mixed precision computation is supported on Nvidia GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper).

Note 🚨: The steps here will take around 20 minutes, depending on the connectivity. The most time consumming steps are the installation of llmcompressor (up to 5 mins) and the quantization step (which can take more than 10 mins)

## Installation

Installing llmcompressor may take a minute, depending on the bandwith available

In [None]:
!pip install -q llmcompressor

In [None]:
!pip list | grep llmcompressor


### Other dependencies

The next command may show the next ERRORs that can be dismiss:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.5.0 requires transformers<4.50,>4.0, but you have transformers 4.51.3 which is incompatible.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.5.1 requires compressed-tensors==0.9.4, but you have compressed-tensors 0.9.3 which is incompatible.

In [None]:
!pip install -q -U accelerate vllm boto3

## Quantize the model

There are 5 steps:
1. Load model
2. Prepare calibration data
3. Apply quantization
4. Evaluate accuracy in vLLM
5. Upload model to S3 (MinIO)

### Load model

Load the model using AutoModelForCausalLM for handling quantized saving and loading.

The model can be loaded from HuggingFace using something like the next:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
```

But to save time on the lab, we have prepared a MinIO bucket that already contains the model. 

As this workbench was created with a dataconnection attached, the required env vars for accessing the MinIO S3 bucket are defined

In [None]:
import os
import errno
from boto3 import client

MODEL_NAME = "base_model"
MODEL_DOWNLOAD_PATH = "/opt/app-root/src/base_model"

s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False)

# list all objects in the folder
objects = s3_client.list_objects(Bucket=s3_bucket_name, Prefix=MODEL_NAME)

# download each object in the folder
for object in objects['Contents']:
    file_name = object['Key']
    local_file_name = os.path.join(MODEL_DOWNLOAD_PATH, file_name.replace(MODEL_NAME, '')[1:])
    if not os.path.exists(os.path.dirname(local_file_name)):
        try:
            os.makedirs(os.path.dirname(local_file_name))
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                print("Error downloading model")
                raise
    s3_client.download_file(s3_bucket_name, file_name, local_file_name)

print('Model downloaded successfully from S3.')

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "/opt/app-root/src/base_model"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

### Prepare calibration data

Prepare the calibration data. When quantizing weigths of a model to int4 using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the ultrachat dataset. Some best practices include:
- 512 samples is a good place to start (increase if accuracy drops). We are going to use 256 to speed up the process.
- 2048 sequence length is a good place to start
- Use the chat template or instrucion template that the model is trained with


In [None]:
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 256  # 1024
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

# Load dataset.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    #concat_txt = example["instruction"] + "\n" + example["output"]
    #return {"text": concat_txt}
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

### Apply quantization

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm.

In our case, we will apply the default GPTQ recipe for int4 (which uses static group size 128 scales) to all linear layers.

NOTE: 🚨 The quantization step takes a long time to complete due to the callibration requirements -- around 10 mins, depending on the GPU.

In [None]:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

DAMPENING_FRAC = 0.1  # 0.01
OBSERVER = "mse"  # minmax
GROUP_SIZE = 128  # 64
# Configure the quantization algorithm to run.
recipe = [
    GPTQModifier(
        targets=["Linear"],
        ignore=["lm_head"],
        scheme="w4a16",
        dampening_frac=DAMPENING_FRAC,
        observer=OBSERVER,
        group_size=GROUP_SIZE
    )
]

# Apply quantization.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=8196,
)

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

### Evaluate accuracy in vLLM

We can evaluate accuracy with lm_eval

##### Check GPU memory leftovers:

In [None]:
!nvidia-smi

IMPORTANT: 🚨 After quantizing the model the GPU memory may not be freed (see the above output). You need to **restart the kernel** before evaluating the model to ensure you have enough GPU RAM available.

#### Install lm_eval

In [None]:
!pip install -q lm_eval==v0.4.3

#### Run the evaluation

Run the following to test accuracy on GSM-8K:

In [None]:
MODEL_ID = "/opt/app-root/src/base_model"
SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
!lm_eval --model vllm \
  --model_args pretrained=$SAVE_DIR,add_bos_token=true,max_model_len=6144 \
  --trust_remote_code \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

If powerfull GPU(s), you could also run the OpenLLM with the following:
```bash
!lm_eval \
  --model vllm \
  --model_args pretrained=$SAVE_DIR",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
  --trust_remote_code \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config
```

### Upload Optimized Model to MinIO

In [None]:
import os
from boto3 import client

MODEL_ID = "/opt/app-root/src/base_model"
OPTIMIZED_MODEL_DIR = MODEL_ID.split("/")[-1] + "-W4A16"
S3_PATH = "granite-int4"

print('Starting results upload.')
s3_endpoint_url = os.environ["AWS_S3_ENDPOINT"]
s3_access_key = os.environ["AWS_ACCESS_KEY_ID"]
s3_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_bucket_name = os.environ["AWS_S3_BUCKET"]

print(f'Uploading predictions to bucket {s3_bucket_name} '
        f'to S3 storage at {s3_endpoint_url}')

s3_client = client(
    's3', endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key, verify=False
)

# Walk through the local folder and upload files
for root, dirs, files in os.walk(OPTIMIZED_MODEL_DIR):
    for file in files:
        local_file_path = os.path.join(root, file)
        s3_file_path = os.path.join(S3_PATH, local_file_path[len(OPTIMIZED_MODEL_DIR)+1:])
        s3_client.upload_file(local_file_path, s3_bucket_name, s3_file_path)
        print(f'Uploaded {local_file_path}')

print('Finished uploading results.')