# **Model Quantization for Pretrained Large Language Models (LLMs)**

## **1. Introduction to Model Quantization**

**Model quantization** is a technique used to reduce the memory footprint and computational cost of machine learning models by representing parameters with fewer bits. For example, instead of using 32-bit floating point numbers (FP32), weights can be stored as 8-bit integers (INT8) or even lower. This has proven essential for:

* **Deploying models to resource-constrained environments** (e.g., mobile, edge devices)
* **Speeding up inference** on CPUs or GPUs with lower precision support
* **Reducing energy usage and costs in data centers**

### Key Quantization Types:

| Type                                   | Description                                                                                 |
| -------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Post-Training Quantization (PTQ)**   | Quantize a trained model without retraining it. Fast and easy, but may degrade accuracy.    |
| **Quantization-Aware Training (QAT)**  | Simulates quantization during training, leading to better accuracy but requires retraining. |
| **Dynamic Quantization**               | Activates quantization only during inference, suitable for RNNs and LLMs.                   |
| **Weight-only Quantization**           | Only the model weights are quantized; activations remain in higher precision.               |
| **Activation and Weight Quantization** | Quantizes both, requiring careful calibration or training.                                  |

---

## **2. Quantization for Pretrained LLMs**

### Why Quantize LLMs?

Large Language Models such as GPT, LLaMA, Falcon, and Mistral typically require:

* **Billions of parameters**
* **Multiple GBs of memory** (e.g., 30B model takes >60 GB in FP16)
* **Powerful accelerators (GPUs/TPUs)**

Quantization reduces this burden, enabling:

* Deployment on **consumer-grade GPUs** (e.g., 8‚Äì16 GB VRAM)
* Running models on **CPUs or edge devices**
* Faster inference with **minimal accuracy loss**

---

## **3. Best Practices in Quantizing LLMs**

### ‚úÖ Choose the Right Quantization Format:

* **INT8**: Best tradeoff between performance and accuracy (common in production).
* **INT4 (e.g., QLoRA, GPTQ)**: Smaller size, still competitive accuracy; ideal for inference.
* **NF4 (Normalized Float 4)**: Specialized format used in fine-tuning-aware quantization (e.g., QLoRA).
* **FP8 or bfloat16**: Emerging formats supported by NVIDIA and Google hardware.

### ‚úÖ Use Purpose-Built Libraries:

| Tool / Library                         | Description                                                          |
| -------------------------------------- | -------------------------------------------------------------------- |
| **GPTQ**                               | Fast post-training quantization, optimized for LLMs (INT4/INT3).     |
| **bitsandbytes**                       | Offers 8-bit optimizers and quantized linear layers (used in QLoRA). |
| **LLM.int8()**                         | HuggingFace integration for INT8 quantization.                       |
| **Intel Neural Compressor / OpenVINO** | INT8 quantization and optimization for CPUs.                         |
| **NVIDIA TensorRT-LLM**                | For highly optimized inference on NVIDIA GPUs.                       |

### ‚úÖ Calibrate Properly:

* For PTQ, use a small representative **calibration dataset** (100‚Äì1,000 examples).
* Choose diverse prompts if you‚Äôre deploying for general-purpose inference.

### ‚úÖ Quantize Layers Selectively:

* Some layers (e.g., attention heads, embeddings) are more sensitive.
* Use mixed precision: keep sensitive layers in FP16 or FP32.

### ‚úÖ Evaluate Thoroughly:

* Run **perplexity**, **BLEU/ROUGE**, or **task-specific accuracy** before/after quantization.
* Evaluate latency, memory, and throughput.

---

## **4. Key Concerns and Challenges**

### ‚ùó Accuracy Degradation:

* Aggressive quantization (e.g., INT3 or below) may lead to performance drops, especially on reasoning tasks.

### ‚ùó Hardware Compatibility:

* Some quantized formats require special hardware (e.g., INT4 may not run efficiently on older GPUs or CPUs).

### ‚ùó Lack of Fine-Tuning:

* PTQ doesn't retrain the model. Some use cases may need QAT or fine-tuning with QLoRA.

### ‚ùó Precision Accumulation:

* Integer math can accumulate errors over layers‚Äîquantization-aware models account for this.

---

## **5. Success Stories**

### üîπ **QLoRA (Quantized Low-Rank Adaptation)**

* HuggingFace + Tim Dettmers (2023)
* Used NF4 quantization + LoRA for fine-tuning
* Enabled fine-tuning 65B models on a single 48GB GPU
* Comparable accuracy to full-precision fine-tuned models

### üîπ **GPTQ**

* Open-source quantization tool
* Allows loading 30B+ models on 16GB VRAM with <1% accuracy loss
* Hugely popular in the LLM community for deploying LLaMA, Falcon, etc.

### üîπ **Intel‚Äôs INT8 BERT Optimizations**

* Showed 2x‚Äì4x performance gain on CPUs using INT8 quantized BERT with <1% drop in accuracy
* Used in real-world document processing pipelines

---

## **6. How to Tell If a Model Is Quantized**

* **File size**: Quantized models are significantly smaller (e.g., 4-bit model is \~1/8 size of FP32).
* **Configuration files**: HuggingFace models include `quantization_config.json` or show quantization format in model card.
* **Inference logs**: Quantized inference often uses custom kernels (e.g., `bitsandbytes`, `AutoGPTQ`, `TRT-LLM`) which appear in logs.
* **Model loading APIs**: Example:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ", device_map="auto")
```

---

## **7. Do You Need a Dataset?**

* **Post-Training Quantization**: Needs only a small calibration set (100‚Äì1000 diverse examples).
* **QAT or QLoRA**: Needs training data for fine-tuning; use domain-specific data if targeting a specific use case.
* **Inference-only**: Pre-quantized models can be used out-of-the-box.

### Dataset Tips:

* Include **representative prompts** for your use case
* Cover **edge cases**, if high reliability is needed
* Mix different **linguistic patterns** (instructions, QA, chat, documents)

---

## **8. Resource Requirements**

| Quantization Type                 | GPU RAM Needed                | CPU Friendly?             | Notes                             |
| --------------------------------- | ----------------------------- | ------------------------- | --------------------------------- |
| **FP32**                          | Very High (‚â•60 GB for 30B)    | No                        | Full accuracy                     |
| **FP16/BF16**                     | Moderate (30‚Äì45 GB for 30B)   | No                        | Good for training                 |
| **INT8**                          | Moderate (\~12‚Äì16 GB for 30B) | Yes (with AVX512 or VNNI) | Widely supported                  |
| **INT4 (GPTQ, QLoRA)**            | Low (\~6‚Äì8 GB for 30B)        | Partial                   | Best for low-end GPUs             |
| **QAT**                           | Moderate to high              | No                        | Requires retraining resources     |
| **bitsandbytes (8-bit training)** | 16 GB+                        | No                        | Efficient training with some loss |

---

## **9. Summary and Recommendations**

| Scenario                                 | Recommendation                                  |
| ---------------------------------------- | ----------------------------------------------- |
| Just want fast inference on LLMs         | Use GPTQ-quantized models (4-bit)               |
| Low-VRAM GPU but want to fine-tune       | Use QLoRA + NF4 (with bitsandbytes)             |
| Need CPU inference                       | Use INT8 models with OpenVINO or ONNX Runtime   |
| Max performance, no concern for accuracy | Use aggressive quantization (INT3/INT4)         |
| High accuracy required                   | Use mixed precision or QAT with selected layers |

---

## **10. Resources and Tools**

* üõ† **Libraries**:

  * [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
  * [GPTQ](https://github.com/IST-DASLab/gptq)
  * [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)
  * [OpenVINO](https://github.com/openvinotoolkit/openvino)
  * [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)

* üìö **Tutorials & Docs**:

  * HuggingFace: [Quantization Overview](https://huggingface.co/docs/transformers/perf_train_gpu_one#8-bit-quantization)
  * Intel: [Neural Compressor](https://github.com/intel/neural-compressor)
  * NVIDIA: [TensorRT-LLM Examples](https://github.com/NVIDIA/TensorRT-LLM)

* üí¨ **Communities**:

  * [HuggingFace Forums](https://discuss.huggingface.co/)
  * [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/)
  * Discord servers for HuggingFace, BitsAndBytes, and GPTQ

---


# LAB 2: Quantizing a Pre-trained LLM using TensorFlow 
- ### IMDB Sentiment Analysis

In [None]:
#!pip install tensorflow-datasets
# !pip install tensorflow-hub
# !pip install tensorflow-text

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import time
import os

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print("TF version:", tf.__version__)

### === 1. Load and Preprocess IMDB Dataset ===

In [None]:
print("\nLoading IMDB dataset...")
(train_ds, test_ds), ds_info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

##### Use a TextVectorization layer

In [None]:
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 256

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH
)


##### Adapt vectorizer to training data

In [None]:
train_text = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
    return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text).cache().shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.map(vectorize_text).batch(32).cache().prefetch(tf.data.AUTOTUNE)


### === 2. Build a Small Transformer-Based Model ===

In [None]:
from tensorflow.keras import layers

print("\nBuilding model...")

embedding_dim = 64
num_heads = 2
dff = 64

inputs = layers.Input(shape=(SEQUENCE_LENGTH,))
x = layers.Embedding(VOCAB_SIZE, embedding_dim)(inputs)
x = layers.LayerNormalization()(x)
x = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)(x, x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(dff, activation='relu')(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()


### === 3. Train Model ===

In [None]:
print("\nTraining model...")
model.fit(train_ds, validation_data=test_ds, epochs=3)

### === 4. Evaluate Original Model ===

In [None]:
print("\nEvaluating original model...")
loss, acc = model.evaluate(test_ds)
print(f"Original model accuracy: {acc:.4f}")


### === 5. Save Model ===

In [None]:
saved_model_dir = "saved_model_imdb"
model.save(saved_model_dir)


### === 6. Convert to TFLite: Baseline (no quantization) ===

In [None]:
print("\nConverting to TFLite (no quantization)...")
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

with open("model_fp32.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Size of unquantized model: {os.path.getsize('model_fp32.tflite') / 1024:.2f} KB")

### === 7. Convert to TFLite: Dynamic Range Quantization ===

In [None]:
print("\nConverting to TFLite (dynamic range quantization)...")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

with open("model_quant.tflite", "wb") as f:
    f.write(tflite_quant_model)

print(f"Size of quantized model: {os.path.getsize('model_quant.tflite') / 1024:.2f} KB")

### === 8. Evaluate TFLite Models ===

In [None]:
def evaluate_tflite_model(tflite_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract shape and dtype expected by TFLite model
    input_index = input_details[0]['index']
    expected_dtype = input_details[0]['dtype']
    expected_shape = input_details[0]['shape']

    # Get a single batch of data
    test_sample = next(iter(test_ds))
    x_sample, y_sample = test_sample
    x_sample = x_sample.numpy().astype(expected_dtype)
    y_sample = y_sample.numpy()

    correct = 0
    total = len(x_sample)

    start = time.time()
    for i in range(total):
        input_data = x_sample[i:i+1]

        # If dynamic shape is enabled, resize input
        if -1 in expected_shape:
            interpreter.resize_tensor_input(input_index, input_data.shape)
            interpreter.allocate_tensors()

        interpreter.set_tensor(input_index, input_data)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        pred = (output[0][0] > 0.5).astype(int)
        correct += int(pred == y_sample[i])
    end = time.time()

    acc = correct / total
    latency = (end - start) / total * 1000  # ms/sample
    return acc, latency

# === Run evaluation ===
print("\nEvaluating TFLite models...")
acc_fp32, latency_fp32 = evaluate_tflite_model("model_fp32.tflite")
acc_quant, latency_quant = evaluate_tflite_model("model_quant.tflite")

print(f"\nFP32 Model - Accuracy: {acc_fp32:.4f}, Latency: {latency_fp32:.2f} ms/sample")
print(f"Quantized Model - Accuracy: {acc_quant:.4f}, Latency: {latency_quant:.2f} ms/sample")


In [None]:
# Quantizing a Pretrained BERT Model (Minimal Size, TensorFlow Only)

import tensorflow as tf
import numpy as np
import time
import os
from transformers import AutoTokenizer, TFAutoModel
import tensorflow_datasets as tfds


In [None]:
# === 1. Load Tokenizer and Smallest BERT ===
# We'll use "prajjwal1/bert-tiny" (very small, TF-native)
MODEL_NAME = "google/bert_uncased_L-2_H-128_A-2"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = TFAutoModel.from_pretrained(MODEL_NAME, from_pt=False)

In [None]:
# === 2. Prepare IMDB Dataset (tokenized) ===
def encode(text, label):
    tokens = tokenizer(text.numpy().decode('utf-8'), truncation=True, padding='max_length', max_length=128, return_tensors='tf')
    return tokens['input_ids'][0], tokens['attention_mask'][0], label

def encode(text, label):
    tokens = tokenizer(
        text.numpy().decode('utf-8'),
        truncation=True,
        padding='max_length',
        max_length=128,
        return_tensors='np'  # Use NumPy output for better `tf.py_function` compatibility
    )
    return (
        tf.convert_to_tensor(tokens['input_ids'][0], dtype=tf.int32),
        tf.convert_to_tensor(tokens['attention_mask'][0], dtype=tf.int32),
        tf.cast(label, tf.float32)  # ‚úÖ Cast label here
    )


(train_raw, test_raw), ds_info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    as_supervised=True,
    with_info=True
)

train_ds = train_raw.map(encode_map).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_raw.map(encode_map).batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
# === 3. Build Classifier on Top of Frozen Tiny BERT ===
bert_model.trainable = False

input_ids = tf.keras.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.Input(shape=(128,), dtype=tf.int32, name="attention_mask")

outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)[1]  # pooled output
x = tf.keras.layers.Dropout(0.2)(outputs)
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs={"input_ids": input_ids, "attention_mask": attention_mask}, outputs=x)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
# === 4. Train Classifier Head Briefly ===
model.fit(train_ds, validation_data=test_ds, epochs=2)

In [None]:
# === 5. Save as SavedModel ===
model_dir = "bert_tiny_classifier"
model.save(model_dir)

In [None]:
# === 6. Convert to TFLite with Dynamic Range Quantization ===
converter = tf.lite.TFLiteConverter.from_saved_model(model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open("bert_tiny_quant.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Quantized model size: {os.path.getsize('bert_tiny_quant.tflite') / 1024:.2f} KB")

In [None]:
# === 7. Evaluate Quantized Model (CPU) ===
def evaluate_tflite_model(tflite_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_idx = {d['name']: d['index'] for d in input_details}
    output_idx = output_details[0]['index']

    correct = 0
    total = 0
    start = time.time()

    for batch in test_ds.take(10):
        x_batch, y_batch = batch
        for i in range(len(y_batch)):
            input_data = {
                input_idx['serving_default_input_ids:0']: x_batch['input_ids'][i:i+1].numpy(),
                input_idx['serving_default_attention_mask:0']: x_batch['attention_mask'][i:i+1].numpy()
            }
            for key, val in input_data.items():
                interpreter.set_tensor(key, val)
            interpreter.invoke()
            output = interpreter.get_tensor(output_idx)
            pred = (output[0][0] > 0.5).astype(int)
            correct += int(pred == y_batch[i].numpy())
            total += 1

    end = time.time()
    acc = correct / total
    latency = (end - start) / total * 1000  # ms/sample
    print(f"\nQuantized TinyBERT Accuracy: {acc:.4f}, Latency: {latency:.2f} ms/sample")

evaluate_tflite_model("bert_tiny_quant.tflite")
