# Quantization Recipe
source: https://pytorch.org/tutorials/recipes/quantization.html#workflows

* Reduced size and faster inference speed with about the same accuracy as the original model.
*  Quantization can be applied to both server and mobile model deployment, but it can be especially important or even critical on mobile, because a non-quantized model’s size may exceed the limit that an iOS or Android app allows for, cause the deployment or OTA update to take too much time, and make the inference too slow for a good user experience.

## Introduction
* Quantization is a technique that converts 32-bit floating numbers in the model parameters to 8-bit integers. 
* With quantization, the model size and memory footprint can be reduced to 1/4 of its original size, and the inference can be made about 2-4 times faster, while the accuracy stays about the same.
* There are overall three approaches or workflows to quantize a model: 
    * Post training dynamic quantization
    * Post training static quantization
    * Quantization aware training
* But if the model you want to use already has a quantized version, you can use it directly without going through any of the three workflows above. For example, the torchvision library already includes quantized versions for models MobileNet v2, ResNet 18, ResNet 50, Inception v3, GoogleNet, among others. So we will make the last approach another workflow, albeit a simple one.



### Workflows
 Use one of the four workflows below to quantize a model.


#### 1. Use Pretrained Quantized MobileNet v2

To get the MobileNet v2 quantizaed model,


In [4]:
import torchvision
model_quantized = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)


  device=storage.device,


Compare the size difference of a unquantized MobileNet v2 model with its quantized version:
    

In [6]:
model = torchvision.models.mobilenet_v2(pretrained=True)

import os
import torch

def print_model_size(mdl):
    torch.save(mdl.state_dict(), "tmp.pt")
    print("%.2f MB" %(os.path.getsize("tmp.pt")/1e6))
    os.remove('tmp.pt')
    
print_model_size(model)
print_model_size(model_quantized)

14.24 MB
3.62 MB


#### 2. Post Training Dynamic Quantization

* Converts all the weights in a model from 32-bit floating numbers to 8-bit integers.
* It doesn't convert the activations to int8 **till just before performing the computation on the activations, simply call `torch.quantization.quantize_dynamic`

In [7]:
model_dynamic_quantized = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)

where qconfig_spec specifies the list of submodule names in model to apply quantization to.



The full documentation of the quantize_dynamic API call is [here](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic). Three other examples of using the post training dynamic quantization are [the Bert example](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html), an [LSTM model example](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html#test-dynamic-quantization), and another [demo LSTM example](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html#do-the-quantization).



#### 3. Post Training Static Quantization
* Methods convert both the weights and the activations to 8-bit  integers beforehand so there won't be on-the fly conversion on activation during the inference, as the dynamic quantization does, hence improving the performance significantly.


In [8]:
backend = "qnnpack"
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(model, inplace=False)
model_static_quantized = torch.quantization.convert(model_static_quantized, inplace=False)



In [10]:
print_model_size(model_static_quantized)

3.97 MB


A complete model definition and static quantization example is [here](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary). A dedicated static quantization tutorial is [here](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).



#### Quantization Aware Training
