In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Quantization
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.

![](https://miro.medium.com/v2/resize:fit:1400/1*hWIaIAQ7GWbrjfbaoUoYxw.jpeg)

**The two most common quantization cases are float32 -> float16 and float32 -> int8**

## Quantization to float16
Performing quantization to go from float32 to float16 is quite straightforward since both data types follow the same representation scheme. The questions to ask yourself when quantizing an operation to float16 are:

- Does my operation have a float16 implementation?
- Does my hardware suport float16?
- Is my operation sensitive to lower precision? For instance the value of epsilon in LayerNorm is usually very small (~ 1e-12), but the smallest representable value in float16 is ~ 6e-5, this can cause NaN issues. The same applies for big values.

## Quantization to int8
Performing quantization to go from float32 to int8 is more tricky. Only 256 values can be represented in int8, while float32 can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of float32 values to the int8 space.

Let’s consider a float x in [a, b], then we can write the following quantization scheme, also called the affine quantization scheme:

`x_q = round(x/S + Z)`

- x_q is the quantized int8 value associated to x
- S is the scale, and is a positive float32
- Z is called the zero-point, it is the int8 value corresponding to the value 0 in the float32 realm. This is important to be able to represent exactly the value 0 because it is used everywhere throughout machine learning models.

And float32 values outside of the [a, b] range are clipped to the closest representable value, so for any floating-point number x:

`x_q = clip(round(x/S + Z), round(a/S + Z), round(b/S + Z))`

So, Quantization in Machine Learning (ML) is the process of converting data in FP32 (floating point 32 bits) to a smaller precision like INT8 (Integer 8 bit) and perform all critical operations like Convolution in INT8 and at the end, convert the lower precision output to higher precision in FP32.

![](https://miro.medium.com/v2/resize:fit:1024/1*UC2UOr0_TYFKJegNPtSEcQ.png)

## Questions are:
- How accuracy is recovered in Quantization?
- How doing calculation in INT8 or FP32 gives the same result?

## Problem with Low Precision Formats?

As per the current state of research, we are struggling to maintain accuracy with INT4 and INT1 and the performance improvement with INT32 oe FP16 is not significant.

- INT1 is an integer data type with only 1 bit, meaning it can represent values of 0 and 1 only.
- INT4 is an integer data type with only 1 bit, meaning it can represent values of 0 and 15 only.
- INT8 is an integer data type with only 1 bit, meaning it can represent values of -127 and 128 only.

INT1 and INT4 loose information while quantization and also it does not capture direction because of positive values only and not negative values, while doing clipping more information got lost as well, thats why INT8 is preferred.


### **The most popular choice is: INT8**

When we are doing calculations in a particular datatype (say INT8), we need another structure with a datatype which can hold the result such that it handles overflow. This is known as accumulation data type. For, FP32, accumulation is FP32 but for INT8, accumulation is INT32.

This table gives you the idea of the reduction in data size and increase of mathematical power depending on the data type:

| Data Type | Accumulation | Math Power | Data Size Reduced |
|-----------|--------------|------------|-------------------|
| FP32      | FP32         | 1X         | 1X                |
| FP16      | FP16         | 8X         | 2X                |
| INT4      | INT32        | 16X        | 4X                |
| INT4      | INT32        | 32X        | 8X                |
| INT1      | INT32        | 128X       | 32X               |


### There are two basic operations in Quantization:

- #### **Quantize**: Convert data to lower precision like INT8
- #### **Dequantize**: Convert data to higher precision like FP32

## Quantize

We have to convert a data of range [A1, A2] to the range of the B bit Integer (INT8 in our case).

Hence, the problem is to map all elements in the range [A1, A2] to the range [-(2^B), (2^B-1)]. Elements outside the range of [A1, A2] will be clipped to the nearest bound.

There are two main types of Range mapping in Quantization:

- Affine quantization
- Scale quantization

![](https://miro.medium.com/v2/resize:fit:1400/1*c18I2HGMvv6ijsE_dwGkDg.png)

### Affine Quantization
In Affine Quantization, the parameters s and z are as follows:

`s = (2^B + 1)/(A1-A2)`

`z = -(ROUND(A2 * s)) - 2^(B-1)`

For INT8, s and z are as follows:

`s = (255)/(A1-A2)`

`z = -(ROUND(A2 * s)) - 128`

Once you convert all the input data using the above equation, we will get a quantized data. In this data, some values may be out of range. To bring it into range, we need another operation "Clip" to map all data outside the range to come within the range.

The Clip operation is as follows:

`clip(x, l, u) = x   ... if x is within [l, u]`

`clip(x, l, u) = l   ... if x < l`

`clip(x, l, u) = u   ... if x > u`

In the above equation, l is the lower limit in the quantization range while u is the upper limit in the quantization range.

So, the overall equation for Quantization in Affine Quantization is:


`x_quantize = quantize(x, b, s, z)`

`x_quantize = clip(round(s * x + z), −2^(B−1), 2^(B−1) − 1)`

For dequantization, the equation in Affine Quantization is:

`x_dequantize = dequantize(x_quantize, s, z) = (x_quantize − z) / s`

![](https://apple.github.io/coremltools/docs-guides/_images/quantization-technique.png)

### Scale quantization
The difference in Scale Quantization (in comparison to Affine Quantization) is that in this case, the zero point (z) is set to 0 and does not play a role in the equations. We use the scale factor (s) in the calculations of Scale Quantization.

We use the following equation:

`F(x) = s.x`

There are many variants of Scale Quantization and the simpliest is Symmetric Quantization. In this, the resultant range is symmetric. For INT8, the range will be [-127, 127]. Note that we are not considering -128 in the calculations.

Hence, in this, we will quantize a data from range [-A1, A1] to [-(2^(B-1), 2^(B-1)]. The equations for Quantization will be:

`s = (2^(B - 1) − 1) / A1`

Note, s is the scale factor.

The overall equation is:

`x_quantize = quantize(x, B, s)` 

`x_quantize = clip(round(s * x), −2^(B - 1) + 1, 2^(B - 1) − 1)`

The equation for dequantization will be:

`x_dequantize = dequantize(x_quantize, s) = x_quantize / s`

The input data is multi-dimensional and to quantize it, we can use the same scale value and zero point for the entire data or these parameters can be different for each 1D data in each dimension or different for each dimension.

![](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/8-bit-signed-integer-quantization.png)

This process of grouping the data for the quantization and dequantization process is known as Quantization Granularity.

The common choice of Quantization Granularity are:

- Per channel for 3D input
- Per row or Per column for 2D input