## Consider this distribution

$FP32: [33.623563422, 12.646104098, 0, -51.5839]$

Above tensor is in `fp32` format. Like fp32 other types exist. 

![data types](./img/data_types.jpg)

### General formula to represent values is: $(-1)^{s}\times 2^{E-127}\times(1+\sum_{i=1}^{23}b_{23-i}2^{-i})$ where $b$ is from mantissa [(source)](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)


### `Int8` contains 8 bits, represented as integers in memory. The range can be $[0,255]$

Converting a 32 bit number to a 8 bit number reduce the space requirement by $4\times$. However that's not the only benefit. Normally integer arithmetic is much faster and energy efficient

### Q: How do you efficiently squeeze all the dynamic range (difference between the highest and the lowest value that can be represented) of `fp32` into `8` bits of `Int8`? 

### Consider this formula

### $\text{New Value, } v = \frac{c - l}{h-l}$
Where $c$ is any given value in the old range, $l$ is the lowest value possible in the old range, $h$ is the highest value possible in the old range. 

## Task 1

1. Write code in python that implements this formula. Use the following function given here to generate floating point numbers between a certain range
2. Use this formula to squeeze certain values in `fp32` range
3. What do you observe? What is the new range of the values?

In [1]:
import numpy as np
def generate_random_float(min_val: float, max_val: float, size: int = 1, seed: int = None) -> np.ndarray:
    """
    Generate random floating point numbers between min_val and max_val
    
    Args:
        min_val (float): Minimum value of range
        max_val (float): Maximum value of range
        size (int): Number of random values to generate
        seed (int): Random seed for reproducibility
        
    Returns:
        np.ndarray: Array of random floats
    """
    if seed is not None:
        np.random.seed(seed)
    
    random_nums = np.random.uniform(low=min_val, high=max_val, size=size)
    return random_nums if size > 1 else random_nums[0]

## Task 2

1. Suggest edits to this formula so that we can scale the values in any arbitrary range
2. Use it to change the range of certain values in `fp32` range
3. What is one problem with this approach (pertaining to numerical representation of the numbers?)
4. (Bonus) List more problems with this approach

### There are some obvious benefits of running a LLM with fewer precision. 

1. Less inference time
2. Less energy consumption
3. Less storage requirement
4. Ability to run in mobile devices

This comes with the cost of lesser precision and quality of the output

### Weights vs Activation quantization

Weight quantization:

1. Store weights in `int8`, dequantize into `fp32` when running it
2. Not faster inference, but saves space

Activation quantization:

1. Convert all inputs and outputs into `int8` and do computations in `int8`
2. Need calibration to determine scale factors for data at each layer 


# Zero point quantization

**Zero-point quantization** is a key concept in **quantization**, a technique used to reduce the size and computational requirements of machine learning models by converting higher-precision data (e.g., 32-bit floating-point) into lower-precision formats (e.g., 8-bit integers). Zero-point quantization specifically deals with ensuring that the range of the quantized data matches the original range of the floating-point data.

---

### **What is Zero-Point Quantization?**
Quantization typically involves mapping floating-point values to integers. For example, you may map a range of floating-point values (e.g., $[-1.0, 1.0]$) to an integer range (e.g., $[0, 255]$ for 8-bit unsigned integers). However, in many cases, the floating-point range may not start at zero, and we need to account for this offset to avoid information loss during quantization. This is where the **zero point** comes in.

The **zero point** is the integer value that corresponds to the real value of **0** in the original floating-point range.

---

### **Mathematical Representation**

Given:
1. A floating-point value $x$,
2. The quantization scale $s$ (i.e., the step size for mapping floating-point values to integers),
3. The zero-point $z$ (the offset),

The quantized integer value $q$ is calculated as:
$$
q = \text{round} \left( \frac{x}{s} \right) + z
$$

The dequantized value (to recover the floating-point value) is:
$$
x \approx s \cdot (q - z)
$$

---

### **How Zero-Point Works**

1. **Scale and Zero-Point Computation:**
   - The scale ($s$) determines how much precision is preserved during quantization.
   - The zero point ($z$) ensures that the range of the quantized values aligns with the original data.

2. **Mapping Floating-Point Range to Integer Range:**
   - For example, if the floating-point range is $[-1.0, 1.0]$ and the integer range is $[0, 255]$, the scale would be:
     $$
     s = \frac{\text{max} - \text{min}}{\text{int\_max} - \text{int\_min}} = \frac{1.0 - (-1.0)}{255 - 0} = 0.007843
     $$
   - The zero-point $z$ ensures that $x = 0.0$ maps to an integer within the range. For unsigned integers, $z$ would be:
     $$
     z = \text{round} \left( \frac{0.0 - \text{min}}{s} \right) = \text{round} \left( \frac{0.0 - (-1.0)}{0.007843} \right) = 128
     $$

3. **Key Observation:**
   - If the range of the floating-point data is symmetric around 0 (e.g., $[-1.0, 1.0]$), the zero point will often fall in the middle of the integer range.
   - If the range is not symmetric (e.g., $[0.0, 1.0]$), the zero point will adjust accordingly.

---

### **Why is Zero-Point Important?**
- **Alignment of Ranges:** Ensures that quantized values accurately represent the original floating-point values, especially for zero.
- **Avoid Information Loss:** Without a zero point, quantization could introduce a significant bias by shifting the representation of zero, leading to errors during dequantization.
- **Improved Model Performance:** Proper zero-point computation minimizes the error introduced by quantization, preserving model accuracy while reducing memory and computation requirements.

---

### **Example: Quantization with Zero Point**

#### Floating-Point Range: $[-6, 6]$, Integer Range: $[0, 255]$
1. Compute the scale:
   $$
   s = \frac{\text{max} - \text{min}}{\text{int\_max} - \text{int\_min}} = \frac{6 - (-6)}{255 - 0} = \frac{12}{255} \approx 0.047
   $$

2. Compute the zero point:
   $$
   z = \text{round} \left( \frac{0 - \text{min}}{s} \right) = \text{round} \left( \frac{0 - (-6)}{0.047} \right) = \text{round}(127.66) \approx 128
   $$

3. Quantize a floating-point value $x = 3.5$:
   $$
   q = \text{round} \left( \frac{x}{s} \right) + z = \text{round} \left( \frac{3.5}{0.047} \right) + 128 = \text{round}(74.47) + 128 = 202
   $$

4. Dequantize back to floating-point:
   $$
   x \approx s \cdot (q - z) = 0.047 \cdot (202 - 128) = 0.047 \cdot 74 \approx 3.48
   $$

---

- Zero-point quantization adjusts the mapping of floating-point numbers to integers to ensure that $0.0$ is accurately represented.
- It prevents bias and improves accuracy when converting between floating-point and quantized representations.
- Common in 8-bit quantization for neural networks to optimize memory and computation.


### Task 3
1. Solve 0 point quantization for floating point range $[-10, 10]$ and integer range $[-127, 127]$ and with the help of one example, quantize and dequantize a value. What's the issue?
2. Write python code for calculating 0 point quantization

# GPTQ quantization for 4 bit integers

## Hessian matrix

The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. Mathematically, for a function $ f(x_1, x_2, \ldots, x_n) $, the Hessian is defined as:

$$
H(f) = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}
\end{bmatrix}
$$

#### **Key Properties:**
- It is symmetric if $ f $ is twice continuously differentiable.
- The eigenvalues of the Hessian give insight into the local curvature of $ f $:
  - Positive eigenvalues: $ f $ is convex (local minimum).
  - Negative eigenvalues: $ f $ is concave (local maximum).
  - Mixed signs: $ f $ has a saddle point.

It tells us how the loss is updating. What is the error in our network as we are updating our weights, which allows us to minimize our loss. How our loss is effected by updating the weights.

### **The Hessian matrix contains the second order derivative of the loss function with respect to the weights. The inverse of this matrix quantifies how changes in the parameters affect the loss function**

Magnitude of the diagonal elements of the inverted Hessian matrix tells us which weights are more sensitive to quantization and which weights are less sensitive to quantization

### Goal:

The primary objective is to convert the network's parameters (weights) into 4-bit integers (values between -8 and 7, or 0 and 15 depending on the quantization scheme), except for "emergent features" which refer to specific parts of the network that are deemed crucial and kept in higher precision. The process aims to minimize the loss introduced by this quantization and find the best possible quantized values for the parameters.

1. We need to minimize the loss function from the *quantization*
2. We need to know how to compute the best parameter values for this quantization

### Algorithm:

The process is performed layer by layer:

1. Layer-wise Quantization: The weights of each layer are processed independently.

2. Find the Hessian Matrix: Compute the Hessian matrix for the current layer.

3. Apply a Damping Factor: A damping factor is applied to the Hessian matrix. This is a regularization technique to prevent the Hessian from overfitting to the current training data, which could lead to poor generalization after quantization. (Just like gradient descent, hessian can also overfit)

4. Find the Inverse Hessian: Calculate the inverse of the damped Hessian matrix. The inverse Hessian would allow us to see the loss *sensitivity* of weights, as we are updating the weights.
    - Higher values are less sensitive
    - Lower values are more sensitive

5. Process the Weights: For each weight in the layer:
    - Perform Zero-Point Quantization: Quantize the weight to a 4-bit integer using a zero-point quantization scheme.
    - Calculate the Quantization Error: Determine the difference between the original floating-point weight and its quantized value.
    - Normalize the Quantization Error: Normalize the quantization error by the corresponding diagonal element of the inverse Hessian. This scales the error based on the weight's sensitivity to changes in the loss. Weights with higher sensitivity will have their quantization errors adjusted more significantly.
    - Update Remaining Weight Vectors: Update the other weights in the layer based on the normalized quantization error and the inverse Hessian. This step aims to compensate for the quantization error of one weight by making small adjustments to other weights, minimizing the overall impact on the loss function.

## Python code to implement GPTQ quantization for a mode

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import os

def quantize_and_save_model():
    # Model ID for Qwen 0.5B
    model_id = "Qwen/Qwen2-0.5B-Instruct"
    
    # Output directory for the quantized model
    output_dir = "quantized_qwen"
    os.makedirs(output_dir, exist_ok=True)

    # Load tokenizer
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    # Load model in FP16 for faster quantization
    print("Loading model in FP16...")
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )

    # Prepare quantization config
    quantization_config = BaseQuantizeConfig(
        bits=4,  # Quantize to 4 bits
        group_size=128,  # Standard group size for GPTQ
        desc_act=True,  # Enable activation description
    )

    # Initialize GPTQ model
    print("Initializing GPTQ model...")
    gptq_model = AutoGPTQForCausalLM.from_pretrained(
        model,
        quantization_config,
        trust_remote_code=True
    )

    # Prepare calibration dataset
    # Using a small sample of text for calibration
    examples = [
        "The quick brown fox jumps over the lazy dog.",
        "Machine learning is a subset of artificial intelligence.",
        "Python is a versatile programming language.",
    ]
    
    # Tokenize calibration data
    calibration_data = tokenizer(
        examples,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )

    # Run quantization
    print("Starting quantization...")
    gptq_model.quantize(calibration_data["input_ids"])

    # Save quantized model
    print("Saving quantized model...")
    gptq_model.save_quantized(output_dir)
    
    # Save tokenizer
    tokenizer.save_pretrained(output_dir)
    
    print(f"Quantization complete! Model saved to: {output_dir}")
    return output_dir
 
output_dir = quantize_and_save_model()