# Post Training Quantization on Needle

Fan Mo @cokespace2, Krish Parmar@parmar, Yue Zhuang@zysophia

git: https://github.com/zysophia/DLsys_Quantization_Model

In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
# !mkdir -p 10714
%cd /content/drive/MyDrive/10714
# !git clone https://github.com/zysophia/DLsys_Quantization_Model
%cd /content/drive/MyDrive/10714/DLsys_Quantization_Model

!pip3 install pybind11

In [None]:
import sys
sys.path.append('./python')
sys.path.append('./apps')

In [None]:
!export CXX=/usr/bin/clang++ && make

## Overview of Quantization for Deep Learning


In deep learning, quantization refers to the process of reducing the precision of the weights and activations of a neural network model. Quantization helps reduce the size of a large deep learning model so that it can be stored and inferenced more efficiently on edge devices (eg: mobile phone) which have limited memory and computation resources. Typically, quantization uses fit number of bits (usually 8 or 16) to represent the values of the weight and activations of a model instead of using float point numbers.

Quantization is often used along with other techniques such as pruning to further reduce the size and computational complexity of a model. There are mainly two types of quantizations in deep learning.

1. Post-Training Quantization: Quantization is applied on a pre-trained model as the tensors are quantized to lower bit-width. There are multiple methods to transform the floating point to an int, including range mapping, scaling, etc. However, as we decrease the size of the model and lower the precision, there is no garantee on the performance on the model.

2. Quantization Aware Training: The model is trained to be aware of quantization. The weights and activations are quantized and the gradients are computed using the quantized representation so that the model would be more robust to the quantization and will achieve a higher performance after that.


## Summary of What We Have Achieved

Considering the limit of the time, we will focus on implementing post training quantization in this project and leave the implementation of quantization-aware training to be done in the future. We will build our method on top of the needle library we have. 

* In this project, we were able to implement two methods `save_model()` and `load_model()` which dumps a trained model and read a trained model into memory.

* We also implemented a method `quantize_model()` which takes a model as input and do the post training quantization on weights and activations. We compared the size of two models and the size of model is reduced significantly.

* Finally, we evaluated the original model and quantized model on the testing data and observed comparable performance.


## Code, Algorithms and Evaluation

### Save and Load Models

We have implemented two methods `save_model()` and `load_model()` in tools/model_save_and_load.py. The method `load_named_params()` is called in `load_model()` as it updated the model's parameters with the given values recursively. We use pickle to dump and load the value dicts.

Run the following block for tests on model saving/ loading.

In [None]:
!python3 tests/test_quantize.py

### Fuse Convolution BatchNorm and ReLU
We implement the `fuse_conv_bn_relu()` method in tools/fuse_ops.py, which can fuse Convolution, BatchNorm and ReLU in one layer.

### Quantization

We implemented the `quantize_conv()` method in tools/quantize_weights.py. The method takes a nn.Conv layer as input and applied quantization by channel. It outputs the quantized data as well as the scales.

Run the following block for tests on quantization.

In [None]:
!python3 tools/quantize_test.py

## References

* https://iq.opengenus.org/basics-of-quantization-in-ml/
* https://medium.com/@joel_34050/quantization-in-deep-learning-478417eab72b