<a href="https://colab.research.google.com/github/swha815/Paper_List-GPGPU/blob/master/quant_01_general.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<H1>Neural Network Quantization</H1>

Porting floating-point (single-precision) numbers to integers (16 bits or less)


## I. Characteristics

### Advantages

- Smaller memory footprint
- Faster computation (roughly 5X)

### Disadvantages

- Loss of critical information when not done properly
- Extra burden of value conversion
  - Weights
    - Inference: can be pre-processed and used without modification during run-time
    - Training: must be quantized with every update (not true for Kahan summation-based methods)
  - Activation: input to NN must be quantized and output de-quantized 

## II. Taxonomy of Methods

### Symmetry

#### Symmetric

Distribution of the target numbers are assumed to have a mean centered around zero with $-m \le x \le m$.

![Symmetric](https://raw.githubusercontent.com/swha815/colab/main/img/symmetric.jpg)

#### Unsigned Symmetric

Activations after ReLU will always be $0 \le x$. Therefore, the bit assigned to negative values can be used to make positive values to twice densely covered with discrete integer (quantized) representations.

![Symmetric-unsigned](https://raw.githubusercontent.com/swha815/colab/main/img/symmetric-unsigned.jpg)

#### Asymmetric

Histogram of layer parameters almost always show that the kernel values are never centered around zero. Hence, asymmetric quantization better represents these values. However, asymmetric weight quantization scheme requires partial sums of feature maps to be calculated during run-time, requring extra HW or increasing latency.

![Asymmetric](https://raw.githubusercontent.com/swha815/colab/main/img/asymmetric.jpg)

### Linearity

#### Linear

In this most common scheme, equally spaced discrete values are assigned to the numbers to be represented. Quantization and de-quantization can be achieved by simply multiplying by a scale factor. 

#### Non-Linear, but Continuous

Schemes such as exponential/log quantization assigns discrete values with gradually increasing/decreasing spacing between discrete values. As most of the values to be represented are centered near zero, assigning more bits is a straight-forward way to minimize quantization noise. However, quantization/dequantization requires power function calculation.

#### Look-up Table

This method has the highest possibility to minimize the quantization noise. However, scatter/gather operation is required prior to dot-product calculation.

### Granularity

#### Fixed-Point

A single shared exponent exists across the whole network. Calculation between channels and layers are consistent, therefore, is the simplest to implement in HW.

#### Dynamic Fixed-Point

Each channel or layer have distinct shared exponent which is meaningful only within that channel/layer. In channel-wise quantization, dot-product will incur addition of products with varying exponents, therefore, requires special consideration (simplest solution is to employ shifters immediately before the addition).

#### Block Fixed-Point

A pre-defined set of numbers are assigned to a shared exponent usually correlated with the HW's computation unit. Appending shifters to the inputs of the multiplier or accumulator is usually required.

#### Low-Precision Floating-Point

Each value faciliates its own dedicated exponent. Computation requires floating-point computation units with lower precision than FP32. However, limited exponent bits can cause severe accuracy degradation from overflow/underflow.

## III. Qauntifying Effect of Quantization

### MSE

Mean-squared-error (MSE) is commonly used to measure the difference between two signals due to its simplicity.

\begin{equation}
MSE = \frac{1}{n} \sum \limits^n (x - \hat{x})^2
\tag{1}
\end{equation}

A major pitfall of using MSE for quantifying the loss arising from quantization is that it does not consider the overall magnitude of the values, thereby hindering direct comparison with other layers/channels.

### PSNR
Peak-signal-to-noise-ratio (PSNR) is another popular metric.

\begin{equation}
PSNR = 20 \times log_{10} (\hat{x}_{max}) - 10 \times log_{10} (MSE)
\tag{2}
\end{equation}

Unfortunately, PSNR can be deceivingly high when quantization range is overestimated.

### SQNR

Tranditional signal-to-quantization-noise ratio (SQNR) measure is defined as follows:

\begin{equation}
SQNR = \frac{P_{signal}}{P_{noise}} = \frac{E[x^2]}{E[\tilde{x}^2]} = \frac{\int x^2f(x)\mathrm{d}x}{\frac{x^2_{max}}{3\times4^v}} = \frac{3 \times 4^v \times {\int x^2f(x)\mathrm{d}x}}{x^2_{max}}
\tag{3}
\end{equation}

Assuming quantization noise is defined as additive noise which is uniformly distributed and that quantization is symmetric and linear, it can be approximated as the following:

\begin{equation}
P_{noise} = \int^{-m}_{-\infty} x^2 f(x) \mathrm{d}x + \int^n_{-m} (x-\hat{x})^2 f(x) \mathrm{d}x + \int^\infty_n x^2 f(x) \mathrm{d}x \\ \hspace{-45pt} = 2 \int^\infty_m x^2 f(x) \mathrm{d}x + \int^m_{-m} (x-\hat{x})^2 f(x)\mathrm{d}x
\tag{4}
\end{equation}

where, $m = \Delta \times 2^{b-1}$

Once an appropriate distribution can be found, minimum value for $P_{noise}$ can be determined given a specific bit-width.

SQNR is a nice measurement in that, unlike MSE, it can be used directly amongst layers/channels in a neural network to compare the level of deterioration from quantization.

## IV. Typical Process of NN Quantization

### Prerequisite

1. Review the target HW architecture
1. Identify where MSB/LSB truncation error (round-off and clamping) occurs
1. Discard negligible truncation error
1. Consider techniques to minimize truncation error (i.e. BatchNorm folding)

### Weights

1. **Profile** and collect statistics from kernel
1. **Analyze** gathered information
1. **Decide** quantization strategy
  - Granularity: layer-wise, output channel-wise, full channel-wise, etc
  - Symmetry: symmetric or asymmetric
  - Step-Size: uniform or non-uniform
1. **Simulate** HW kernel store by replacing original weights with quantized weights
  - Inference-only: quantize during pre-process stage
  - Training: quantize after every update

### Activations

1. **Profile** and collect statistics from where truncation error is likely to occur (i.e. output feature map)
1. **Analyze** gathered information and try to find/fit an appropriate distribution
1. **Decide** quantization strategy (dependent on HW architecture)
  - Granularity: layer-wise, channel-wise, etc
  - Symmetry: symmetric or asymmetric
  - Step-Size: uniform (linear) or non-uniform (quadratic or LUT-based)
1. **Simulate** integer-based HW with _fake_ quantization layer