# Choosing Quantization Range `[α, β]`

If the vector V represents the tensor to be quantized, we can choose the [𝛼, 𝛽] range according to the following strategies:

## 1. Min-Max Quantization

- Set `α = min(V)` and `β = max(V)`
- Covers the full range of values in `V`

### Pros
- Simple to implement

### Cons
- **Sensitive to outliers**
- A single large outlier can stretch the range and reduce precision for the rest of the values

### Example
```
Original Tensor:        43.31 -44.93 0 … 38.48 -20.49 1000.00 -28.02
Dequantized (Min-Max):  45.08 -45.08 0 24.59 -45.08 -12.29 36.88 -20.49 999.85 -28.68
```

Note: The outlier (`1000`) is preserved, but most other values are quantized poorly.

In the context of quantization
When quantizing a tensor (like a weight matrix), choosing the quantization range [α, β] using min and max values means you're including all the values — even extreme outliers (very large or very small values that are rare).

But these outliers can stretch the range unnecessarily, making the quantization less precise for the rest of the values.

To fix this, we use percentile-based quantization:

Percentile-based strategy

---

## 2. Percentile-Based Quantization

- Instead of using the min and max, choose a lower percentile (e.g., 0.1%) and upper percentile (e.g., 99.9%) of the data
- Helps to ignore outliers while preserving meaningful value distribution

Instead of 
```math
α = min(V)
\\

β = max(V)
```

we set
```math
α = value at the 1st percentile of V
\\

β = value at the 99th percentile of V
```

This means:

We ignore the lowest 1% of the values (as outliers)
We ignore the highest 1% of the values (as outliers)
And we quantize only the middle 98% more tightly, giving better precision to most of the values


### Pros
- Less sensitive to extreme values
- Preserves quality better for most values
- Why use percentiles?
   To avoid outliers dominating the quantization range
   To reduce quantization error for the majority of the values
   Useful when exact reconstruction of rare/extreme values isn't as important as accuracy for the common values

### Example
```
Original Tensor:        43.31 -44.93 0 … 38.48 -20.49 1000.00 -28.02
Dequantized (Percentile): 43.38 -44.52 0 … 38.48 -20.49 50.00 -28.01
```

Note: Outlier (`1000`) is now clipped, but the rest are better preserved.

---

## 3. Mean Squared Error (MSE) Optimized Range

- Choose `α, β` that **minimize the mean squared error** between original values and dequantized values

```math
argmin_{α, β} \sum_i (V_i - \hat{V}_i)^2
```

- Typically solved using **grid search** over possible `[α, β]` pairs

### Pros
- Optimizes for the most accurate reconstruction of values

### Cons
- Computationally expensive

---

## 4. Cross-Entropy Optimized Range (Used in LLMs)

- Used when some values are more important than others (e.g., logits passed through Softmax in LLMs)
- Goal: Preserve the **ordering**  as well as the distribution of values in `V` which is important for tasks like language modeling where the probability distribution matters.

- Used when the values in the tensor being quantized are not equally important. This happens for example in the
Softmax layer in Large Language Models. Since most of the inference strategies are Greedy, Top-P or Beam search, it is
important to preserve the order of the largest values after quantization.

```math
argmin_{α, β} \text{CrossEntropy}(softmax(V), softmax(\hat{V}))
```

### Use Case
- Softmax layers in LLMs where decoding relies on relative magnitude (Top-k, Top-p, Beam search)

---

## Summary of Strategies

| Strategy        | Goal                             | Strength                          | Weakness                      |
|----------------|----------------------------------|-----------------------------------|-------------------------------|
| Min-Max         | Full coverage of values          | Simple, covers entire range       | Sensitive to outliers         |
| Percentile      | Focus on central distribution    | Robust to outliers                | Clips extreme values          |
| MSE Optimized   | Best numerical reconstruction    | High accuracy                     | Requires search               |
| Cross-Entropy   | Best ranking preservation        | Good for LLM inference            | Complex, needs task knowledge |

---


# Create a simple tensor with random items

In [6]:
import numpy as np

# Suppress scientific notation
np.set_printoptions(suppress=True)

# Generate randomly distributed parameters
params = np.random.uniform(low=-50, high=150, size=10000)

# Introduce an outlier
params[-1] = 1000

# Round each number to the second decimal place
params = np.round(params, 2)

# Print the parameters
print(params)

[  96.79  -30.04  144.33 ...   24.16   12.02 1000.  ]


# Define the quantization methods and quantize

## Compare min-max and percentile range selection strategies

In [7]:
def clamp(params_q: np.array, lower_bound: int, upper_bound: int) -> np.array:
    params_q[params_q < lower_bound] = lower_bound
    params_q[params_q > upper_bound] = upper_bound
    return params_q

def asymmetric_quantization(params: np.array, bits: int) -> tuple[np.array, float, int]:
    alpha = np.max(params)
    beta = np.min(params)
    scale = (alpha - beta) / (2**bits-1)
    zero = -1*np.round(beta / scale)
    lower_bound, upper_bound = 0, 2**bits-1
    quantized = clamp(np.round(params / scale + zero), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale, zero

def asymmetric_quantization_percentile(params: np.array, bits: int, percentile: float = 99.99) -> tuple[np.array, float, int]:
    # find the percentile value
    alpha = np.percentile(params, percentile)
    beta = np.percentile(params, 100-percentile)
    scale = (alpha - beta) / (2**bits-1)
    zero = -1*np.round(beta / scale)
    lower_bound, upper_bound = 0, 2**bits-1
    quantized = clamp(np.round(params / scale + zero), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale, zero


def asymmetric_dequantize(params_q: np.array, scale: float, zero: int) -> np.array:
    return (params_q - zero) * scale

def quantization_error(params: np.array, params_q: np.array):
    # calculate the MSE
    return np.mean((params - params_q)**2)

(asymmetric_q, asymmetric_scale, asymmetric_zero) = asymmetric_quantization(params, 8)
(asymmetric_q_percentile, asymmetric_scale_percentile, asymmetric_zero_percentile) = asymmetric_quantization_percentile(params, 8)

print(f'Original:')
print(np.round(params, 2))
print('')
print(f'Asymmetric (min-max) scale: {asymmetric_scale}, zero: {asymmetric_zero}')
print(asymmetric_q)
print(f'')
print(f'Asymmetric (percentile) scale: {asymmetric_scale_percentile}, zero: {asymmetric_zero_percentile}')
print(asymmetric_q_percentile)

Original:
[  96.79  -30.04  144.33 ...   24.16   12.02 1000.  ]

Asymmetric (min-max) scale: 4.117529411764706, zero: 12.0
[ 36   5  47 ...  18  15 255]

Asymmetric (percentile) scale: 0.7844509882329367, zero: 64.0
[187  26 248 ...  95  79 255]


In [8]:
# Dequantize the parameters back to 32 bits
params_deq_asymmetric = asymmetric_dequantize(asymmetric_q, asymmetric_scale, asymmetric_zero)
params_deq_asymmetric_percentile = asymmetric_dequantize(asymmetric_q_percentile, asymmetric_scale_percentile, asymmetric_zero_percentile)

print(f'Original:')
print(np.round(params, 2))
print('')
print(f'Dequantized (min-max):')
print(np.round(params_deq_asymmetric,2))
print('')
print(f'Dequantized (percentile):')
print(np.round(params_deq_asymmetric_percentile,2))

Original:
[  96.79  -30.04  144.33 ...   24.16   12.02 1000.  ]

Dequantized (min-max):
[  98.82  -28.82  144.11 ...   24.71   12.35 1000.56]

Dequantized (percentile):
[ 96.49 -29.81 144.34 ...  24.32  11.77 149.83]


# Evaluate the quantization error (excluding the outlier)

In [9]:
# Calculate the quantization error
print(f'{"Error (min-max) excluding outlier: ":>40}{np.round(quantization_error(params[:-1], params_deq_asymmetric[:-1]),2)}')
print(f'{"Error (percentile) excluding outlier: ":>40}{np.round(quantization_error(params[:-1], params_deq_asymmetric_percentile[:-1]),2)}')

     Error (min-max) excluding outlier: 1.39
  Error (percentile) excluding outlier: 0.05
