# **Model Quantization for Pretrained Large Language Models (LLMs)**

## **1. Introduction to Model Quantization**

**Model quantization** is a technique used to reduce the memory footprint and computational cost of machine learning models by representing parameters with fewer bits. For example, instead of using 32-bit floating point numbers (FP32), weights can be stored as 8-bit integers (INT8) or even lower. This has proven essential for:

* **Deploying models to resource-constrained environments** (e.g., mobile, edge devices)
* **Speeding up inference** on CPUs or GPUs with lower precision support
* **Reducing energy usage and costs in data centers**

### Key Quantization Types:

| Type                                   | Description                                                                                 |
| -------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Post-Training Quantization (PTQ)**   | Quantize a trained model without retraining it. Fast and easy, but may degrade accuracy.    |
| **Quantization-Aware Training (QAT)**  | Simulates quantization during training, leading to better accuracy but requires retraining. |
| **Dynamic Quantization**               | Activates quantization only during inference, suitable for RNNs and LLMs.                   |
| **Weight-only Quantization**           | Only the model weights are quantized; activations remain in higher precision.               |
| **Activation and Weight Quantization** | Quantizes both, requiring careful calibration or training.                                  |

---

## **2. Quantization for Pretrained LLMs**

### Why Quantize LLMs?

Large Language Models such as GPT, LLaMA, Falcon, and Mistral typically require:

* **Billions of parameters**
* **Multiple GBs of memory** (e.g., 30B model takes >60 GB in FP16)
* **Powerful accelerators (GPUs/TPUs)**

Quantization reduces this burden, enabling:

* Deployment on **consumer-grade GPUs** (e.g., 8–16 GB VRAM)
* Running models on **CPUs or edge devices**
* Faster inference with **minimal accuracy loss**

---

## **3. Best Practices in Quantizing LLMs**

### ✅ Choose the Right Quantization Format:

* **INT8**: Best tradeoff between performance and accuracy (common in production).
* **INT4 (e.g., QLoRA, GPTQ)**: Smaller size, still competitive accuracy; ideal for inference.
* **NF4 (Normalized Float 4)**: Specialized format used in fine-tuning-aware quantization (e.g., QLoRA).
* **FP8 or bfloat16**: Emerging formats supported by NVIDIA and Google hardware.

### ✅ Use Purpose-Built Libraries:

| Tool / Library                         | Description                                                          |
| -------------------------------------- | -------------------------------------------------------------------- |
| **GPTQ**                               | Fast post-training quantization, optimized for LLMs (INT4/INT3).     |
| **bitsandbytes**                       | Offers 8-bit optimizers and quantized linear layers (used in QLoRA). |
| **LLM.int8()**                         | HuggingFace integration for INT8 quantization.                       |
| **Intel Neural Compressor / OpenVINO** | INT8 quantization and optimization for CPUs.                         |
| **NVIDIA TensorRT-LLM**                | For highly optimized inference on NVIDIA GPUs.                       |

### ✅ Calibrate Properly:

* For PTQ, use a small representative **calibration dataset** (100–1,000 examples).
* Choose diverse prompts if you’re deploying for general-purpose inference.

### ✅ Quantize Layers Selectively:

* Some layers (e.g., attention heads, embeddings) are more sensitive.
* Use mixed precision: keep sensitive layers in FP16 or FP32.

### ✅ Evaluate Thoroughly:

* Run **perplexity**, **BLEU/ROUGE**, or **task-specific accuracy** before/after quantization.
* Evaluate latency, memory, and throughput.

---

## **4. Key Concerns and Challenges**

### ❗ Accuracy Degradation:

* Aggressive quantization (e.g., INT3 or below) may lead to performance drops, especially on reasoning tasks.

### ❗ Hardware Compatibility:

* Some quantized formats require special hardware (e.g., INT4 may not run efficiently on older GPUs or CPUs).

### ❗ Lack of Fine-Tuning:

* PTQ doesn't retrain the model. Some use cases may need QAT or fine-tuning with QLoRA.

### ❗ Precision Accumulation:

* Integer math can accumulate errors over layers—quantization-aware models account for this.

---

## **5. Success Stories**

### 🔹 **QLoRA (Quantized Low-Rank Adaptation)**

* HuggingFace + Tim Dettmers (2023)
* Used NF4 quantization + LoRA for fine-tuning
* Enabled fine-tuning 65B models on a single 48GB GPU
* Comparable accuracy to full-precision fine-tuned models

### 🔹 **GPTQ**

* Open-source quantization tool
* Allows loading 30B+ models on 16GB VRAM with <1% accuracy loss
* Hugely popular in the LLM community for deploying LLaMA, Falcon, etc.

### 🔹 **Intel’s INT8 BERT Optimizations**

* Showed 2x–4x performance gain on CPUs using INT8 quantized BERT with <1% drop in accuracy
* Used in real-world document processing pipelines

---

## **6. How to Tell If a Model Is Quantized**

* **File size**: Quantized models are significantly smaller (e.g., 4-bit model is \~1/8 size of FP32).
* **Configuration files**: HuggingFace models include `quantization_config.json` or show quantization format in model card.
* **Inference logs**: Quantized inference often uses custom kernels (e.g., `bitsandbytes`, `AutoGPTQ`, `TRT-LLM`) which appear in logs.
* **Model loading APIs**: Example:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ", device_map="auto")
```

---

## **7. Do You Need a Dataset?**

* **Post-Training Quantization**: Needs only a small calibration set (100–1000 diverse examples).
* **QAT or QLoRA**: Needs training data for fine-tuning; use domain-specific data if targeting a specific use case.
* **Inference-only**: Pre-quantized models can be used out-of-the-box.

### Dataset Tips:

* Include **representative prompts** for your use case
* Cover **edge cases**, if high reliability is needed
* Mix different **linguistic patterns** (instructions, QA, chat, documents)

---

## **8. Resource Requirements**
- examples for 30B parameters

| Quantization Type                 | GPU RAM Needed                | CPU Friendly?             | Notes                             |
| --------------------------------- | ----------------------------- | ------------------------- | --------------------------------- |
| **FP32**                          | Very High (≥60 GB for 30B)    | No                        | Full accuracy                     |
| **FP16/BF16**                     | Moderate (30–45 GB for 30B)   | No                        | Good for training                 |
| **INT8**                          | Moderate (\~12–16 GB for 30B) | Yes (with AVX512 or VNNI) | Widely supported                  |
| **INT4 (GPTQ, QLoRA)**            | Low (\~6–8 GB for 30B)        | Partial                   | Best for low-end GPUs             |
| **QAT**                           | Moderate to high              | No                        | Requires retraining resources     |
| **bitsandbytes (8-bit training)** | 16 GB+                        | No                        | Efficient training with some loss |

---

## **9. Summary and Recommendations**

| Scenario                                 | Recommendation                                  |
| ---------------------------------------- | ----------------------------------------------- |
| Just want fast inference on LLMs         | Use GPTQ-quantized models (4-bit)               |
| Low-VRAM GPU but want to fine-tune       | Use QLoRA + NF4 (with bitsandbytes)             |
| Need CPU inference                       | Use INT8 models with OpenVINO or ONNX Runtime   |
| Max performance, no concern for accuracy | Use aggressive quantization (INT3/INT4)         |
| High accuracy required                   | Use mixed precision or QAT with selected layers |

---

## **10. Resources and Tools**

* **Libraries**:

  * [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
  * [GPTQ](https://github.com/IST-DASLab/gptq)
  * [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)
  * [OpenVINO](https://github.com/openvinotoolkit/openvino)
  * [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)

* **Tutorials & Docs**:

  * HuggingFace: [Quantization Overview](https://huggingface.co/docs/transformers/perf_train_gpu_one#8-bit-quantization)
  * Intel: [Neural Compressor](https://github.com/intel/neural-compressor)
  * NVIDIA: [TensorRT-LLM Examples](https://github.com/NVIDIA/TensorRT-LLM)

* **Communities**:

  * [HuggingFace Forums](https://discuss.huggingface.co/)
  * [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/)
  * Discord servers for HuggingFace, BitsAndBytes, and GPTQ

---


# LAB 2: Quantizing a Pre-trained LLM using TensorFlow 
- ### IMDB Sentiment Analysis

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import time
import os

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print("TF version:", tf.__version__)

### Load and Preprocess IMDB Dataset

> - The IMDB dataset (Internet Movie Database) is a popular benchmark dataset for binary sentiment classification.
> - Contains 50,000 movie reviews from IMDb.
> - Each review is a raw English sentence or paragraph.
> - Each review is labeled: 0 = negative, 1 = positive
> - IMDG dataset is commonly used for training and testing text classification models, especially for sentiment analysis tasks.
> - Dataset Breakdown: train: 25,000 labeled reviews, test: 25,000 labeled reviews


#### In this section:
- Load and tokenize the dataset
- Vectorize the data by converting it to a numerical array
- Use vectorized text as feature set for training a model



In [None]:
print("\nLoading IMDB dataset...")
(train_ds, test_ds), ds_info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

##### Use a TextVectorization layer

In [None]:
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 256

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH
)


##### Adapt vectorizer to training data

In [None]:
train_text = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
    return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text).cache().shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.map(vectorize_text).batch(32).cache().prefetch(tf.data.AUTOTUNE)


### Build a Small Transformer-Based Model

- We will build a custom simple Transformer-based binary text classifier
- Model Overview:
    - Input: A sequence of token integers (e.g., tokenized text, length = SEQUENCE_LENGTH).
    - Embedding Layer: Maps each token to a 64-dimensional vector.
    - Layer Normalization: Normalizes the embeddings to stabilize training.
    - Multi-Head Attention: Lets the model focus on different parts of the sequence when processing each word (2 attention heads).
    - Global Average Pooling: Collapses the sequence into a single vector by averaging token representations.
    - Dense Layer: Applies a fully connected layer with ReLU to capture nonlinear patterns.
    - Dropout: Randomly zeroes some values to prevent overfitting.
    - Output Layer: A single neuron with sigmoid activation, producing a probability for binary classification (e.g., positive vs negative sentiment).
- Where this can be used:
    - Text classification tasks like sentiment analysis, spam detection, or toxic comment classification, using a lightweight Transformer-style architecture.



In [None]:
from tensorflow.keras import layers

print("\nBuilding model...")

embedding_dim = 64
num_heads = 2
dff = 64

inputs = layers.Input(shape=(SEQUENCE_LENGTH,))
x = layers.Embedding(VOCAB_SIZE, embedding_dim)(inputs)
x = layers.LayerNormalization()(x)
x = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)(x, x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(dff, activation='relu')(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()


### === 3. Train Model ===

In [None]:
print("\nTraining model...")
model.fit(train_ds, validation_data=test_ds, epochs=3)

### Evaluate Original Model

In [None]:
print("\nEvaluating original model...")
loss, acc = model.evaluate(test_ds)
print(f"Original model accuracy: {acc:.4f}")


### Save Model

In [None]:
saved_model_dir = "saved_model_imdb"
model.save(saved_model_dir)


### Convert to TFLite: Baseline (no quantization)

In [None]:
print("\nConverting to TFLite (no quantization)...")
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

with open("model_fp32.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Size of unquantized model: {os.path.getsize('model_fp32.tflite') / 1024:.2f} KB")

### Convert to TFLite: Dynamic Range Quantization

- Dynamic Range Quantization is a post-training quantization technique that reduces model size and improves inference speed.
- Weights (the learned parameters) are converted from 32-bit floating point (FP32) to 8-bit integers (INT8).
- Activations (intermediate computations during inference) stay in FP32 but the converter records their range so it can quantize them dynamically at runtime.
- This method does not require any training or calibration data; it’s quick and easy to apply.

In [None]:
print("\nConverting to TFLite (dynamic range quantization)...")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

with open("model_quant.tflite", "wb") as f:
    f.write(tflite_quant_model)

print(f"Size of quantized model: {os.path.getsize('model_quant.tflite') / 1024:.2f} KB")

### Evaluate TFLite Models

- Loads the TensorFlow Lite (TFLite) model
- Prepares the model’s input and output details, including expected data types and shapes.
- Take a batch of test data (test_ds), convert it to the correct data type for the model.
- For each sample in the batch:
    - Optionally resize the input tensor if the model expects a dynamic input shape.
    - Feed the sample into the TFLite model and run inference.
    - Get the model’s prediction and compare it to the true label to count correct predictions.
- Measure total inference time and calculates average latency per sample.
- Compute accuracy (correct predictions ratio) and average latency in milliseconds.
- Compare the full-precision (FP32) and a quantized TFLite model

In [None]:
def evaluate_tflite_model(tflite_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract shape and dtype expected by TFLite model
    input_index = input_details[0]['index']
    expected_dtype = input_details[0]['dtype']
    expected_shape = input_details[0]['shape']

    # Get a single batch of data
    test_sample = next(iter(test_ds))
    x_sample, y_sample = test_sample
    x_sample = x_sample.numpy().astype(expected_dtype)
    y_sample = y_sample.numpy()

    correct = 0
    total = len(x_sample)

    start = time.time()
    for i in range(total):
        input_data = x_sample[i:i+1]

        # If dynamic shape is enabled, resize input
        if -1 in expected_shape:
            interpreter.resize_tensor_input(input_index, input_data.shape)
            interpreter.allocate_tensors()

        interpreter.set_tensor(input_index, input_data)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        pred = (output[0][0] > 0.5).astype(int)
        correct += int(pred == y_sample[i])
    end = time.time()

    acc = correct / total
    latency = (end - start) / total * 1000  # ms/sample
    return acc, latency

# === Run evaluation ===
print("\nEvaluating TFLite models...")
acc_fp32, latency_fp32 = evaluate_tflite_model("model_fp32.tflite")
acc_quant, latency_quant = evaluate_tflite_model("model_quant.tflite")

print(f"\nFP32 Model - Accuracy: {acc_fp32:.4f}, Latency: {latency_fp32:.2f} ms/sample")
print(f"Quantized Model - Accuracy: {acc_quant:.4f}, Latency: {latency_quant:.2f} ms/sample")


# Bonus Lab: Quantizing a Pretrained BERT Model

In [None]:
import tensorflow as tf
import numpy as np
import time
import os
from transformers import AutoTokenizer, TFAutoModel
import tensorflow_datasets as tfds


#### Load Tokenizer and Smallest BERT

In [None]:

# We'll use "prajjwal1/bert-tiny" (very small, TF-native)
MODEL_NAME = "google/bert_uncased_L-2_H-128_A-2"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = TFAutoModel.from_pretrained(MODEL_NAME, from_pt=False)

#### Prepare IMDB Dataset (tokenized)

In [None]:
def encode(text, label):
    tokens = tokenizer(
        text.numpy().decode('utf-8'),
        truncation=True,
        padding='max_length',
        max_length=128,
        return_tensors='np'
    )
    input_ids = tf.convert_to_tensor(tokens['input_ids'][0], dtype=tf.int32)
    attention_mask = tf.convert_to_tensor(tokens['attention_mask'][0], dtype=tf.int32)
    label = tf.cast(label, tf.float32)
    # Return as tuple, NOT dict
    return input_ids, attention_mask, label

def encode_map(text, label):
    # Specify output types as a tuple
    input_ids, attention_mask, label = tf.py_function(
        encode,
        inp=[text, label],
        Tout=(tf.int32, tf.int32, tf.float32)
    )
    # Set shapes (optional but recommended)
    input_ids.set_shape([128])
    attention_mask.set_shape([128])
    label.set_shape([])

    # Rebuild dict for model input
    return {"input_ids": input_ids, "attention_mask": attention_mask}, label

# Load IMDB dataset splits
(train_raw, test_raw), ds_info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    as_supervised=True,
    with_info=True
)


# Now use encode_map
train_ds = train_raw.map(encode_map).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_raw.map(encode_map).batch(32).prefetch(tf.data.AUTOTUNE)


#### Build Classifier on Top of Frozen Tiny BERT

In [None]:
# Freeze BERT weights
bert_model.trainable = False

# Define input layers
input_ids = tf.keras.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.Input(shape=(128,), dtype=tf.int32, name="attention_mask")

# Run inputs through BERT
outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)[1]  # pooled output

# Add dropout:
x = tf.keras.layers.Dropout(0.2)(outputs)

# Add final classification layer
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

# Build and compile Keras model:
model = tf.keras.Model(inputs={"input_ids": input_ids, "attention_mask": attention_mask}, outputs=x)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

#### Train Classifier Head Briefly

In [None]:
model.fit(train_ds, validation_data=test_ds, epochs=2)

##### Save as SavedModel

In [None]:
model_dir = "bert_tiny_classifier"
model.save(model_dir)

####  Convert to TFLite with Dynamic Range Quantization

In [None]:
converter = tf.lite.TFLiteConverter.from_saved_model(model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open("bert_tiny_quant.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Quantized model size: {os.path.getsize('bert_tiny_quant.tflite') / 1024:.2f} KB")

#### Evaluate Quantized Model

In [None]:
def evaluate_tflite_model(tflite_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_idx = {d['name']: d['index'] for d in input_details}
    output_idx = output_details[0]['index']

    correct = 0
    total = 0
    start = time.time()

    for batch in test_ds.take(10):
        x_batch, y_batch = batch
        for i in range(len(y_batch)):
            input_data = {
                input_idx['serving_default_input_ids:0']: x_batch['input_ids'][i:i+1].numpy(),
                input_idx['serving_default_attention_mask:0']: x_batch['attention_mask'][i:i+1].numpy()
            }
            for key, val in input_data.items():
                interpreter.set_tensor(key, val)
            interpreter.invoke()
            output = interpreter.get_tensor(output_idx)
            pred = (output[0][0] > 0.5).astype(int)
            correct += int(pred == y_batch[i].numpy())
            total += 1

    end = time.time()
    acc = correct / total
    latency = (end - start) / total * 1000  # ms/sample
    print(f"\nQuantized TinyBERT Accuracy: {acc:.4f}, Latency: {latency:.2f} ms/sample")

evaluate_tflite_model("bert_tiny_quant.tflite")
