```{contents}
```
## Quantization

Quantization is a core **model compression and acceleration technique** that converts high-precision numerical representations (e.g., FP32) into lower-precision formats (e.g., INT8, INT4) to make LLMs **faster, smaller, cheaper, and deployable** without major accuracy loss.

---

### 1. Why Quantization Matters for LLMs

LLMs are dominated by **matrix multiplications** and **memory movement**.

| Bottleneck       | Effect                     |
| ---------------- | -------------------------- |
| Model size       | Does not fit on single GPU |
| Memory bandwidth | Slows inference            |
| Compute cost     | Expensive deployment       |
| Energy           | High power consumption     |

**Quantization reduces:**

* Model size (≈ 2–8×)
* Memory bandwidth
* Latency
* Power consumption

---

### 2. Core Idea

Instead of storing weights as 32-bit floats:

[
W_{fp32} \rightarrow W_{int8/int4}
]

We store and compute with **lower precision**, using scale factors to preserve numerical meaning:

[
x \approx s \cdot q
]

where

* (q) = integer value
* (s) = scale factor

---

### 3. Quantization Workflow

```
FP32 Model
   ↓
Calibration (collect statistics)
   ↓
Choose precision (INT8 / INT4)
   ↓
Compute scale & zero-point
   ↓
Quantize weights & activations
   ↓
Optimized Inference Kernel
```

---

### 4. Types of Quantization in LLMs

| Category                          | Description                           |
| --------------------------------- | ------------------------------------- |
| Post-Training Quantization (PTQ)  | Quantize a pretrained model           |
| Quantization-Aware Training (QAT) | Train while simulating quantization   |
| Static Quantization               | Fixed ranges                          |
| Dynamic Quantization              | Compute ranges at runtime             |
| Weight-only Quantization          | Only weights are quantized            |
| Activation Quantization           | Weights + activations                 |
| Per-Tensor Quantization           | One scale per tensor                  |
| Per-Channel Quantization          | One scale per channel (more accurate) |

---

### 5. Common LLM Quantization Formats

| Format | Bits | Memory Reduction | Accuracy      |
| ------ | ---- | ---------------- | ------------- |
| FP16   | 16   | 2×               | High          |
| INT8   | 8    | 4×               | Very high     |
| INT4   | 4    | 8×               | Moderate–High |
| NF4    | 4    | 8×               | High          |
| GPTQ   | 4–8  | 4–8×             | High          |
| AWQ    | 4    | 8×               | Very high     |

---

### 6. Mathematical View

Uniform quantization:

[
q = \text{round}\left(\frac{x}{s}\right)
\quad,\quad
x \approx s \cdot q
]

With zero-point (z):

[
q = \text{round}\left(\frac{x}{s}\right) + z
]

---

### 7. Quantization for Transformer Layers

LLM weight matrices:

* (W_Q, W_K, W_V)
* FFN matrices
* Output projection

Dominant cost:

[
XW
]

After quantization:

[
X_{int} W_{int} \Rightarrow \text{INT kernels} \Rightarrow \text{dequantize}
]

---

### 8. Demonstration (PyTorch Dynamic INT8)

```python
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

model_int8 = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

print(model_int8)
```

---

### 9. 4-bit Quantization with BitsAndBytes

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)
```

---

### 10. Accuracy vs Performance Tradeoff

| Precision | Speed     | Memory     | Accuracy    |
| --------- | --------- | ---------- | ----------- |
| FP32      | Slow      | Huge       | Baseline    |
| FP16      | Fast      | Large      | ~Baseline   |
| INT8      | Faster    | Small      | ~Baseline   |
| INT4      | Very Fast | Very Small | Slight drop |

---

### 11. When to Use Which

| Scenario        | Recommendation |
| --------------- | -------------- |
| Training        | FP16 / BF16    |
| Fine-tuning     | FP16 + QLoRA   |
| Inference (GPU) | INT8 / AWQ     |
| Edge deployment | INT4 / GPTQ    |

---

### 12. Modern LLM Quantization Techniques

| Method      | Key Idea                              |
| ----------- | ------------------------------------- |
| GPTQ        | Layer-wise Hessian-aware quantization |
| AWQ         | Activation-aware weight scaling       |
| QLoRA       | 4-bit base model + low-rank adapters  |
| SmoothQuant | Balance weight & activation ranges    |

---

### 13. Practical Impact (LLaMA-7B Example)

| Precision | VRAM    |
| --------- | ------- |
| FP16      | ~14 GB  |
| INT8      | ~7 GB   |
| INT4      | ~3.5 GB |

---

### 14. Summary

Quantization enables:

* **Massive model compression**
* **Production-grade inference**
* **Edge and consumer deployment**
* **Minimal accuracy degradation**