# Basics of Reduced Precision and Quantization


## Summary

This notebook demonstrates the development of quantized models for 4-bit inferencing using our [model optimization library](https://pypi.org/project/fms-model-optimizer/). [FMS Model Optimizer](https://github.com/foundation-model-stack/fms-model-optimizer/) is a Python framework for the development of reduced precision neural network models which provides state-of-the-art quantization techniques, together with automated tools to apply these techniques in Pytorch environments for Quantization-Aware-Training (QAT) of popular deep learning workloads. The resulting low-precision models can be deployed on GPUs or other accelerators.

We will demonstrate the following:
- How input data can be quantized
- How quantization can be applied to a convolution layer
- How to automate the quantization process
- How a quantized convolution layer performs on a typical image

FMS Model Optimizer can be applied across a variety of computer vision and natural language processing tasks to speed up inferencing,  reduce power requirements, and reduce model size, while maintaining comparable model accuracy.

## Table of Contents

* <a href="#fms_mo_quantizer">Step 1. Quantize a normal data distribution</a>
    * <a href="#fms_mo_import">Import code libraries</a>
    * <a href="#geninput">Generate input data</a>
    * <a href="#clip">Clip input data </a>
    * <a href="#quant">Scale, shift, and quantize data</a>
    * <a href="#dequant">Dequantize data </a>
* <a href="#conv">Step 2. Quantize a convolution layer</a>
    * <a href="#3p2">Generate input data</a>
    * <a href="#3p3">Quantize input data</a>
    * <a href="#3p4">Create a single layer convolution network</a>
    * <a href="#3p5">Generate weights and bias</a>
    * <a href="#3p6">Quantize weights</a>
    * <a href="#3p7">Feed quantized data, weights, and bias into convolution layer</a>
* <a href="#fms_mo">Step 3. Use FMS Model Optimizer to automate quantization</a>
* <a href="#fms_mo_visual">Step 4. Try a convolution layer on a quantized image</a>
* <a href="#fms_mo_conclusion">Conclusion</a>
* <a href="#fms_mo_learn">Learn more</a>

<a id="`fms_mo`_quantizer"></a>
    
## Step 1. Quantize a normal data distribution

In this section we show how quantization works using a randomly generated normal distribution of input data. We will feed the input data to a quantizer and show the quantized output.

The quantizer can be summarized by the following equations: <br><br>
<font size=4>
$y_{int} = \lfloor \frac{clamp(y, \alpha_l, \alpha_u) - zp}{\Delta} \rceil$<br>
$y_q = y_{int} \times \Delta + zp$
</font>

where:
<br>
$y$ = input data<br>
$y_{int}$ = $y$ transformed and scaled into integer space, e.g. [0, 1, 2, ...]<br>
$y_q$ = quantized output from the quantizer<br>
$\alpha_l$, $\alpha_u$ = lower and upper clip bound for $y$, $\in \mathbb{R}$, e.g. [-1.5, 1.5]. Often referred to as clip_vals<br> 
$\Delta$ = stepsize between each quantized bin<br>
$zp$ = zero point: consider it as clip_min ($\alpha_l$) in this scenario <br><br>


These two equations are commonly known as quantization and dequantization, respectively. 
We will walk through these equations step-by-step in the cells below.

<a id="`fms_mo`_import"></a>

### Import code libraries

In [None]:
! pip install fms-model-optimizer
! pip install wget

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn
from fms_mo import qconfig_init, qmodel_prep

import argparse
from torchvision import transforms
from PIL import Image
import pandas as pd
import os
from os import path


<a id="geninput"></a>

### Generate input data

For simplicity, we generate a normal distribution of input data, with the mean set to 0 and standard deviation set to 1. A sample size of 1 million is chosen.

The histogram (distribution plot) is shown below, with the y-axis set as a distribution density.

For the purpose of this tutorial, we will focus on only the forward pass as shown by the equation. The backward pass indicated by $$ \frac{dL}{dy_q}, \frac{dy_q}{dy}, \frac{dy_q}{d\alpha_l}, \frac{dy_q}{d\alpha_u} $$ will not be covered.

In [None]:
# Generate normal distribution of data with mean=0, std=1, sample size 1e6
# Using numpy (much faster than torch on CPU)
raw_data = np.random.normal(0, 1, int(1e6))

# Plotting the histogram.
plt.figure(figsize=(16, 10))
plt.hist(raw_data, density=True, bins=128, alpha=0.8, label='y')
#plt.legend(loc='upper right')
plt.xlabel("Data")
plt.ylabel("density")
plt.title("Normal distribution of data with mean=0, std=1, sample size 1e6")
plt.show()

<a id="clip"></a>

### Clip input data

Quantization of a tensor means that we can only use a limited number of distinct values (16 in the case of 4-bit precision) to represent all the numbers in the tensor. For 4-bit precision, we will need to:
- determine the range we want to represent, i.e. $[ -\infty, \infty] => [ \alpha_l , \alpha_u]$, which means anything above $\alpha_u$ or below $\alpha_l$ will become $\alpha_u$ and $\alpha_l$ respectively.
- uniformly divide $[\alpha_l , \alpha_u]$ into 16 bins and round the numbers to the nearest bin.

Ideally, we want to represent the entire range of the original tensor. But because we have limited bins available, we must clip off extreme values. 

$[\alpha_l , \alpha_u]$ can be determined in many ways, including gradual learning through training, or minimizing the quantization error of the tensor. Quantization error can be defined as $$MeanSquaredError(y, y_q) = \sum_N \frac{1}{N} {(y - y_q)^2 }$$ , although there could be other definitions. 

Here we arbitrarily select a clip min and clip max value of -2.5 and 2.5 and perform the following: <br>
$$ clamp(y,\alpha_l,\alpha_u) $$

In [None]:
# define clipped values for upper and lower bound.
clip_min, clip_max = -2.5, 2.5
clipped_data = np.clip(raw_data, clip_min, clip_max)#clip_data(raw_data, clip_min, clip_max)

print( "min/max of the original tensor", np.min(raw_data), np.max(raw_data) )
print( "min/max of the clipped tensor", np.min(clipped_data), np.max(clipped_data) )

print("MSE(raw_data, clipped_data)", np.mean( (raw_data-clipped_data)**2 ))

In [None]:
# show the first 5 clipped numbers and their original values.
isClipped=np.logical_or(raw_data>clip_max, raw_data<clip_min)
idx_clipped_elements=np.where( isClipped )[0]
pd.DataFrame( 
                {'idx':idx_clipped_elements[:5], 
                'raw': raw_data[ idx_clipped_elements[:5] ],
                'clipped': clipped_data[idx_clipped_elements[:5]] }
            )

In [None]:
# Plot the distribution and the clipped data to visualize

plt.figure(figsize=(16, 10))
plt.hist(raw_data,     density=True, bins=64, label="y (raw values)", histtype='step', linewidth=3.5),
plt.hist(clipped_data, density=True, bins=64, color=['#33b1ff'], alpha=0.8,label="y_clamp (clipped edges)"), 
plt.legend(fancybox=True, ncol=2)
plt.xlabel("Data")
plt.ylabel("density")
plt.title("Raw Data and Clipped Data")
plt.show()

From the results above, we can see that we've successfully clamped the data. 



<a id="quant"></a>

### Scale, shift, and quantize data

Here we choose to use 4-bit integer for this quantization, with zp = clip_min. 

Our next step is to transform the data from the range [-2.5, 2.5] to the range [0, 15] and round the values to the nearest integer. Apply: <br>
$$y_{int} = \lfloor \frac{clamp(y, \alpha_l, \alpha_u) - zp}{\Delta} \rceil$$


In [None]:
# set bit size for quantization
n_bit = 4
zp = clip_min
stepsize = (clip_max - zp) / (2 ** n_bit -1)
y_scaled = (clipped_data - clip_min) / stepsize
y_int    = np.round(y_scaled)


In [None]:
plt.figure(figsize=(16, 10))
plt.hist(raw_data, density=True, bins=64, alpha=0.8,label="y (raw values)", histtype='step', linewidth=3.5)
plt.hist(y_scaled, density=True,  bins=64, color=['#33b1ff'], alpha=0.6,label="scale+shift")
plt.hist(y_int,    density=True,  bins=64, color=['#007d79'],alpha=0.8,label="quantize")
plt.legend(loc='upper left', fancybox=True, ncol=3)
plt.xlabel("Data")
plt.ylabel("density")
#plt.yscale('log')
plt.title("Raw Data and Shifted Data")
plt.show()


The plot above shows that we can represent the data as integers by clipping, shifting, scaling, and quantizing the data.

<a id="dequant"></a>

### Dequantize data

The last step is to dequantize the quantized data $y_{int}$ back to the range [-2.5, 2.5] so that it overlays the original distribution. <br>
<font size=4>
$$y_q = y_{int} \times \Delta + zp$$
</font>

In [None]:
yq = y_int * stepsize + zp

plt.figure(figsize=(16, 10))
plt.hist(raw_data, density=True, bins=64, label="original y", histtype='step', linewidth=2.5)#alpha=0.8,
plt.hist(yq,       density=True, color=['#33b1ff'], bins=64, label="quantized y")#alpha=0.7,
plt.legend(fancybox=True, ncol=2)
plt.xlabel("Data")
plt.ylabel("density")
plt.title("Raw Data and Quantized Data")
plt.show()

### An example of symmetric vs asymmetric quantization 

In [None]:
plt.subplots(3,1, figsize=(16, 12), sharex=True)

arstyle=dict(facecolor='C1',alpha=0.5, shrink=0.05)

n_bit = 4
clip_min, clip_max = -2.5, 2.5
asym_raw_data = np.abs(raw_data)
for i, (raw_i, lbl_i) in enumerate([(raw_data, 'Case 1: sym data, sym Q'), 
                                    (asym_raw_data, 'Case 2: asym data, asym Q'),
                                    (asym_raw_data, 'Case 3: asym data sym Q') ]):
    if 'asym Q' in lbl_i:
        # asym quantization for range [0, clip_max]
        clip_min_i = np.min(raw_i)
        nbins = 2**n_bit -1
        scale = (clip_max - clip_min_i)/nbins
        zp = np.round(-clip_min_i/scale)
    else:
        # sym quantization
        clip_min_i = -max(clip_max, np.abs(clip_min))
        nbins = 2**n_bit -2
        scale = (clip_max - clip_min_i)/nbins
        zp = 0

    # here we could use one of the 2 commonly used formulas
    # 1. y_q = round( (clamp(x) - zp)/scale )*scale + zp
    # 2. y_q = (clamp(round(x/scale + zero_point), quant_min, quant_max) - zero_point) * scale
    y_int_i = np.round( (np.clip(raw_i, clip_min, clip_max) - zp)/scale )
    yq_i = y_int_i*scale + zp
    max_bin_i = np.round( (clip_max-zp)/scale)*scale + zp

    plt.subplot(311+i)
    plt.hist(raw_i, density=False, bins=64, label="original y", histtype='step', linewidth=2.5)
    plt.hist(yq_i,  density=False, color=['#33b1ff'], bins=64, label='y_q')
    plt.legend(fancybox=True, ncol=2, fontsize=14)

    plt.ylabel("Count")
    plt.annotate('upper clip bound', xy=(max_bin_i, 0), xytext=(max_bin_i, 1e5), arrowprops=arstyle)    
    plt.annotate('lower clip bound', xy=(clip_min_i, 0), xytext=(clip_min_i, 1e5), arrowprops=arstyle)    
    plt.title(lbl_i)

plt.tight_layout()
plt.show()

Now we will wrap the above steps into a "simple quantizer" so that we can easily reuse it later (using torch instead of numpy).

In [None]:
def simpleQuantizer(input, n_bit, clip_min, clip_max):
    zp = clip_min
    stepsize = (clip_max - zp) / (2 ** n_bit -1)

    y_scaled = (torch.clamp(input, clip_min, clip_max) - clip_min) / stepsize
    y_int    = torch.round(y_scaled)
    return y_int * stepsize + zp


<a id="conv"></a>

## Step 2. Quantize a convolution layer

In this section, we show how to manually quantize a Convolution layer, i.e. quantizing the input data and weights, and then feed them into a convolution computation. 

**Note:**
1. The quantizers here use different clip values and can be different type of quantizers if needed. 
2. In practice, "bias" in convolution layer usually doesn't get quantized. Simply because the computation is much lower compare to matmul and the risk of losing accuracy is very high.

<a id="3p2"></a>

### Generate input data

Similar to Step 1, the input data is a randomly generated normal distribution. We generate 1 input sample with 3 channels, 32 pixel width, and 32 pixel height.

In [None]:
# Channel, Width, Height
C, H, W = 3, 32, 32
N = 1

# Generate 1 sample
input = torch.randn(N,C,H,W)

print('Input Shape: ', input.shape)
print('Number of unique input values: ', input.detach().unique().size()[0])
print(f'Expected: {N * C * H * W} (Based on randomly generated values for shape {N} x {C} x {H} x {W})')

<a id="3p3"></a>

### Quantize input data


In [None]:
# Set the max, min clip values and number of bits
clip_min, clip_max = -2.5, 2.5
n_bit = 4

# Quantize the input data
input_quant = simpleQuantizer(input, n_bit, clip_min, clip_max)

print('Quantized input Shape: ', input_quant.shape)
print('Number of unique quantized input values: ', input_quant.detach().unique().size()[0])
print(f'Expected: {2 ** n_bit} (Based on 2 ^ {n_bit})')

<a id="3p4"></a>

### Create a single layer convolution network

In [None]:
# Create Network of 1 Convolution Layer
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        # (32 (width, height) - 3 (filter size)) / 1 (stride) + 1 = 30 (new width, height)
        self.conv = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=3, stride=1)

    def forward(self, input):
        out = self.conv(input)
        return out

net = NeuralNet()

net(input).shape

<a id="3p5"></a>

### Generate weights and bias

To simulate the quantization of a pretrained model we set the weights manually to a normal distribution of values. Bias will be set to zeros because we don't plan on using bias.

In [None]:
# Generate the weights for the convolution filter so we know what the values are
weight = torch.randn(net.conv.weight.shape)
bias = torch.zeros(net.conv.bias.shape)

# Replace current conv2d weight with this randomly generated weight (so we know the values)
net.conv.weight = torch.nn.Parameter(weight)

# ignore bias for now 
net.conv.bias = torch.nn.Parameter(bias)

print('Weight Shape: ', weight.shape)
print('Number of unique weight values: ', weight.detach().unique().size()[0])
print(f'Expected: {weight.numel()} (Based on randomly generated values for shape {weight.shape[0]} x {weight.shape[1]} x {weight.shape[2]} x {weight.shape[3]})')

<a id="3p6"></a>

### Quantize weights


In [None]:
# Set variables for quantization

# Quantize the weights (similar to input)
weight_quant = simpleQuantizer(weight, n_bit, clip_min, clip_max)

print('Quantized weight Shape: ', weight_quant.shape)
print('Number of unique quantized weight values: ', weight_quant.detach().unique().size()[0])
print(f'Expected: {2 ** n_bit} (Based on 2 ^ {n_bit})')
print('First Channel of Quantized Weight', weight_quant[0])


<a id="3p7"></a>

### Feed quantized data, weights, and bias into convolution layer


In [None]:
# Generate output y
y = net(input)

# Generate quantized output y, NOTE, this net is currently using non-quantized weight 
y_quant = net(input_quant)

print('Number of unique output values: ', y.detach().unique().size()[0])
print('Expected maximum unique output values: ', y.flatten().size()[0])
print('Number of unique quantized output values: ', y_quant.detach().unique().size()[0])


**Now we plot four cases to determine how well quantization works with convolution:**

1. both input and weights are not quantized
2. quantized weights with raw input
3. raw weights with quantized input
4. both weights and input are quantized 


In [None]:
def PlotAndCompare(d1, d2, labels, title):
    mse = nn.functional.mse_loss(d1, d2, reduction='mean' )
    plt.hist( d1.flatten().detach().numpy(), bins=64, alpha = 0.7, density=True, label=labels[0])
    plt.hist( d2.flatten().detach().numpy(), bins=64, color=['#33b1ff'], alpha = 0.8, density=True, label=labels[1], histtype='step', linewidth=3.5)
    plt.yscale('log')
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, ncol=2)
    plt.title(f"{title}, MSE={mse:.3f}")



titles=['inputs', 'weights', 'outputs']
isQ = ['not quantized', 'quantized']
for i, inp in enumerate([input, input_quant]):
    for j, W in enumerate([weight, weight_quant]):
        plt.subplots(1,3,figsize=(18,5))
        plt.suptitle(f'Case {i*2+j+1}: Input {isQ[i]}, Weight {isQ[j]}', fontsize=20, ha='center', va='bottom')
        plt.subplot(131); PlotAndCompare(input,     inp,        ['raw', isQ[i]],  f"input, {isQ[i]}")
        plt.subplot(132); PlotAndCompare(weight,    W,          ['raw', isQ[j]],  f"weight, {isQ[j]}")
        net.conv.weight = torch.nn.Parameter(W)
        y_quant = net(inp)
        plt.subplot(133); PlotAndCompare(y,         y_quant,   ['raw', f'A={isQ[j]}, W={isQ[i]}'], "conv output")
        plt.show()




We see different levels of MSE when we quantize different components of the convolution layer. Note that for a given set of input and weight tensors, MSE depends on the `number of bits`, `clip_min`, and `clip_max` chosen.

<a id="fms_mo"></a>

## Step 3. Use FMS Model Optimizer to automate quantization

In this section we show how to reduce manual effort in the quantization process by using our model optimization library to automate the process.

For simplicity we will use a 1-layer toy network as an example, but FMS Model Optimizer can handle more complicated networks. 

As in Step 2, to simulate the quantization of a pretrained model we set the weights manually to a normal distribution of values. Bias will be set to zeros because we don't plan on using bias.

We initialize the configuration (dictionary), manually modify the parameters of interest, then run a "model prep" to quantize the network automatically. The results will be identical to the output `y_quant` shown in Step 2.

The parameters `nbits_w` and `nbits_a` will be used to control the precision of (most of) the modules identified by FMS Model Optimizer that can be quantized.

In [None]:
# Create a neural network with single convolution for fms_mo demo purposes
net_fms_mo = NeuralNet()

# Set weights and bias in convolution network (optional)
# Explain that this step is used to be consistent with previous steps.
net_fms_mo.conv.weight = torch.nn.Parameter(weight)
net_fms_mo.conv.bias = torch.nn.Parameter(bias)


# Step 1: initialize configuration dict
qcfg = qconfig_init()

# set bits for quantization (nbits_a needs to be set to quantize input regardless of bias)
qcfg['nbits_w'] = 4
qcfg['nbits_a'] = 4

# just to be consistent with our "simple Quantizer" (normally align_zero is True)
qcfg['align_zero'] = False

# Quantization Mode here means which quantizers we would like to use,
# There are many quantizers available in fms_mo, such as PArameterized Clipping acTivation (PACT),
# Statstics-Aware Weight Binning (SAWB).
qcfg['qw_mode'] = 'pact'
qcfg['qa_mode'] = 'pact'

# Set weight and input (activation) clip vals
qcfg['w_clip_init_valn'], qcfg['w_clip_init_val'] = -2.5, 2.5
qcfg['act_clip_init_valn'], qcfg['act_clip_init_val'] = -2.5, 2.5


# This parameter is usually False, but for Demo purposes we quantize the first/only layer
qcfg['q1stlastconv'] = True


if path.exists("results"):
    print("results folder exists!")
else:
    os.makedirs('results')
    
# Step 2: Prepare the model to convert layer to add Quantizers
qmodel_prep(net_fms_mo, input, qcfg, save_fname='./results/temp.pt')



In [None]:
# Step 3: Run network as usual
y_quant_fms_mo = net_fms_mo(input)
y_quant      = net(input_quant) 

plt.figure(figsize=(16, 10))
PlotAndCompare(y_quant_fms_mo, y_quant, ['fms_mo','manual'],'quantized Conv output by different methods')
plt.show()


<a id="`fms_mo`_visual"></a>

## Step 4. Try a convolution layer on a quantized image

In this section we pass an image of a lion through a quantizer and convolution layer to observe the performance of the quantizer with convolution.

In [None]:
import os, wget
IMG_FILE_NAME = 'lion.png'
url = 'https://raw.githubusercontent.com/foundation-model-stack/fms-model-optimizer/main/tutorials/images/' + IMG_FILE_NAME

if not os.path.isfile(IMG_FILE_NAME):
  wget.download(url, out=IMG_FILE_NAME)

img = Image.open(IMG_FILE_NAME)
img

In [None]:
convert_tensor = transforms.ToTensor()
input_img_tensor = convert_tensor(img)
input_img_tensor = input_img_tensor.unsqueeze(0)

tensor_img_transform = transforms.ToPILImage()

# we used unsqueeze to create batch dimension, i.e. from [C,H,W] to [N,C,H,W]
print(input_img_tensor.shape)


In [None]:

net_img_non_quantized = NeuralNet()

# Generate weights
weight = torch.randn(net_img_non_quantized.conv.weight.shape)

# Replace current conv2d weight (so we know the values)
net_img_non_quantized.conv.weight = torch.nn.Parameter(weight)

# Since we are recycling the net_fms_mo from previous section, the weight needs to be replaced
weight_quant = simpleQuantizer(weight, n_bit, clip_min, clip_max)
net_fms_mo.conv.weight = torch.nn.Parameter(weight_quant)

# Generate normal output from filter
y_img_tensor = net_img_non_quantized(input_img_tensor)
y_img_quant  = net_fms_mo(input_img_tensor)

# Transform output to image
feature_map       = tensor_img_transform(y_img_tensor[0])
feature_map_quant = tensor_img_transform(y_img_quant[0])


plt.subplots(3,1,figsize=(16,25))
plt.subplot(311)
plt.title('Output from non-quantized model', fontsize=20)
plt.imshow(feature_map, cmap='RdBu')
plt.clim(0,255)
plt.colorbar()

plt.subplot(312)
plt.title('Output from quantized model', fontsize=20)
plt.imshow(feature_map_quant, cmap='RdBu')
plt.clim(0,255)
plt.colorbar()

plt.subplot(313)
PlotAndCompare(y_img_tensor, y_img_quant, ['raw','quantized'],'Conv output')

plt.tight_layout()
plt.show()


<mark style="background-color: lightyellow">
We can see that many details in the second image from the quantized model are saturated and lost. But the shape of the lion can still be seen clearly. This implies that if the quantized model is properly trained, the most critical information can be preserved. For example, if we want to perform a classification or object detection, we may be able to achieve the same answer that "It's a Lion!" from both images.
</mark>

<a id="`fms_mo`_conclusion"></a>

## Conclusion

This notebook provided the following demonstrations:

- In Step 1, we showed how quantization can be applied manually to a randomly generated normal distribution of input data.
- In Step 2, we showed how to apply quantization to a convolution layer.
- In Step 3, we showed how to automate the quantization process using FMS Model Optimizer.
- In Step 4, we observed the performance of a quantized convolution layer on an image of a lion.

<a id="`fms_mo`_learn"></a>

## Learn more 

Please see [example scripts](https://github.com/foundation-model-stack/fms-model-optimizer/tree/main/examples) for more practical use of FMS Model Optimizer.
