## Quantization in PyTorch | Mixed Precision Training

### [Link to my Youtube Video Explaining this whole Notebook](https://www.youtube.com/watch?v=RPvx3yZ2fc8&list=PLxqBkZuBynVQqJTE9nRM2p7Tb12TDPlnq&index=10)

[![Imgur](https://imgur.com/MO5BPwm.png)](https://www.youtube.com/watch?v=RPvx3yZ2fc8&list=PLxqBkZuBynVQqJTE9nRM2p7Tb12TDPlnq&index=10)



Neural Networks are implemented as computational graphs, and their com‐
putations often use 32-bit (or in some cases, 64-bit) floating-
point numbers. However, we can enable our computations to use 
lower-precision numbers and still achieve comparable
results by applying quantization.

Quantization refers to techniques for computing and accessing
memory with lower-precision data. These techniques can
decrease model size, reduce memory bandwidth, and perform
faster inference due to savings in memory bandwidth and
faster computing with int8 arithmetic.
A quick quantization method is to reduce all computation pre‐
cision by half. 


In [1]:
import torch 
from torch import nn
import torch.nn.functional as F
import warnings
warnings.filterwarnings('ignore')

#### First a simple implementation of LeNet5 architecture to build a working a Neural Network on which will see the effects of Quantization.

In [2]:
class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.max_pool2d(
            F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(
            F.relu(self.conv2(x)), 2)
        x = x.view(-1, 
                   int(x.nelement() / x.shape[0]))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

fp32_model = LeNet5()
fp32_model

LeNet5(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

By default, all computations and memory are implemented as
float32. We can inspect the data types of our model’s parame‐
ters using the following code:


In [3]:
for n, p in fp32_model.named_parameters():
  print(n, ": ", p.dtype)

conv1.weight :  torch.float32
conv1.bias :  torch.float32
conv2.weight :  torch.float32
conv2.bias :  torch.float32
fc1.weight :  torch.float32
fc1.bias :  torch.float32
fc2.weight :  torch.float32
fc2.bias :  torch.float32
fc3.weight :  torch.float32
fc3.bias :  torch.float32


And now we reduce the model to half precision in one line of code using the
half() method:

Using half() is often a quick and easy way to quantize your models.
It’s worth a try to see if the performance is adequate for your
use case.



In [4]:
fp16_model = fp32_model.half()

for n, p in fp16_model.named_parameters():
  print(n, ": ", p.dtype)

conv1.weight :  torch.float16
conv1.bias :  torch.float16
conv2.weight :  torch.float16
conv2.bias :  torch.float16
fc1.weight :  torch.float16
fc1.bias :  torch.float16
fc2.weight :  torch.float16
fc2.bias :  torch.float16
fc3.weight :  torch.float16
fc3.bias :  torch.float16


However, in many cases, we don’t want to quantize every com‐
putation in the same way, and we may need to quantize beyond
float16 values. For these other cases, PyTorch provides three
additional modes of quantization: 

- dynamic quantization, 

- post-training static quantization, and 
- quantization-aware training(QAT).

Dynamic quantization is used when throughput is limited by
compute or memory bandwidth for weights. This is often true
for LSTM, RNN or Transformer networsk.

Small portions of the network – in particular, portions of the softmax operation – must remain in float32. This is because the sum of a large number of small values (our logits) can be a source of accumulated error.

Static quantization is used when throughput is limited by memory band‐
width for activations and often applies for CNNs. QAT is used
when accuracy requirements cannot be achieved by static
quantization.

Dynamic quantization is the easiest type. It converts the activa‐
tions to int8 on the fly. Computations use efficient int8 values,
but the activations are read and written to memory in floating-
point format.

------------------------

## Dynamic Quantization

The following code shows you how to quantize a model with dynamic quantization:

All we need to do is pass in our model and specify the quantized layers and the quantization level.


In [5]:
import torch.quantization

quantized_model = torch.quantization.quantize_dynamic(
      fp32_model, 
      {torch.nn.Linear},
      dtype=torch.qint8)

- model is the PyTorch module targeted by the optimization.

- {torch.nn.Linear} is the set of layer classes within the model we want to quantize.

- dtype is the quantized tensor type that will be used 

In [6]:
quantized_model

LeNet5(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): DynamicQuantizedLinear(in_features=400, out_features=120, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (fc2): DynamicQuantizedLinear(in_features=120, out_features=84, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (fc3): DynamicQuantizedLinear(in_features=84, out_features=10, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

## Compare Model size between fp32 and int8

In [7]:
def print_size_of_model(model, label=""):
    torch.save(model.state_dict(), "temp.p")
    size=os.path.getsize("temp.p")
    print("model: ",label,' \t','Size (KB):', size/1e3)
    os.remove('temp.p')
    return size

# compare the sizes
f=print_size_of_model(fp32_model,"fp32")
q=print_size_of_model(quantized_model,"int8")

model:  fp32  	 Size (KB): 126.919
model:  int8  	 Size (KB): 70.005


## Static quantization 


Post-training static quantization can be used to further reduce
latency by observing the distributions of different activations
during training and by deciding how those activations should
be quantized at the time of inference. This type of quantization
allows us to pass quantized values between operations without
converting back and forth between floats and ints in memory:



In [8]:
static_quant_model = LeNet5()
static_quant_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')

torch.quantization.prepare(
    static_quant_model, inplace=True)
torch.quantization.convert(
    static_quant_model, inplace=True)

LeNet5(
  (conv1): QuantizedConv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), scale=1.0, zero_point=0)
  (conv2): QuantizedConv2d(6, 16, kernel_size=(5, 5), stride=(1, 1), scale=1.0, zero_point=0)
  (fc1): QuantizedLinear(in_features=400, out_features=120, scale=1.0, zero_point=0, qscheme=torch.per_channel_affine)
  (fc2): QuantizedLinear(in_features=120, out_features=84, scale=1.0, zero_point=0, qscheme=torch.per_channel_affine)
  (fc3): QuantizedLinear(in_features=84, out_features=10, scale=1.0, zero_point=0, qscheme=torch.per_channel_affine)
)

quantization might result in reduced accuracy. In such cases, we can significantly improve the accuracy simply by using a different quantization configuration. 


### This ‘fbgemm’ configuration does the following:

* Quantizes weights on a per-channel basis.
* Uses a histogram observer that collects a histogram of activations and then picks quantization parameters in an optimal manner.


### Prepare model for quantization

`torch.quantization.prepare` will attach observers to the model. This will calibrate the training data. Calibration helps in computing the distribution of different activation. These distributions are then used to determine how activations should be quantized at inference time. Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats — and then back to ints — between every operation, resulting in a significant speed-up.

## Quantization-aware training

Quantization-aware training typically results in the best accu‐
racy. Float values are rounded to the int8 equivalent, but the computations
are still done in floating point. That is, the weight adjustments
are made “aware” that they will be quantized during training.
The following code shows how to quantize a model with QAT:


In [9]:
qat_model = LeNet5()

torch.quantization.get_default_qat_qconfig('fbgemm')
  
qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

torch.quantization.prepare_qat(
    qat_model, inplace=True)
torch.quantization.convert(
    qat_model, inplace=True)

LeNet5(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

## Example to compare sizes of a model with and without Quantization


In [10]:
import torchvision 

model_quant = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)
model_no_quant = torchvision.models.mobilenet_v2(pretrained=True)


def get_model_size(modl):
    torch.save(modl.state_dict(), "demo.pt")
    print("%.2f MB" %(os.path.getsize("demo.pt")/1e6))

# os.remove('demo.pt')
get_model_size(model_quant)

get_model_size(model_no_quant)



3.63 MB
14.26 MB


## Full Example

In [None]:
# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or assymetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
res