# Quantizing a Sparse Model (PTQ)

Quantization can be combined with the sparsity toolkit (`torch.ao.sparsity`) to achieve even better computational performance. There are two main ways of quantizing a sparse model:

1. Post-Training Quantization: Quantizes a model that was already trained and sparsified
2. Quantization-Aware Training: Trains the model with quantization and sparsity in mind

In this notebook we will focus on the post-training quantization

## Post-Training Quantization

This is the simplest way of quantizing a sparse model, as the quantization and sparsification are independent of each other. The general workflow is:

1. Train the sparse model / Sparsify an existing model
2. Squash the sparsity masks
3. Quantize the model as if no sparsity was present

Below we will follow a more detailed flow step-by-step

### Step 1: Create a model

In [1]:
import torch
from torch import nn
import torch.quantization as tq

in_features = 7
num_classes = 10

def make_model():
    model = nn.Sequential(
        tq.QuantStub(),
        nn.Linear(in_features, 32),
        nn.ReLU(),
        nn.Linear(32, 256),
        nn.ReLU(),
        nn.Linear(256, 32),
        nn.ReLU(),
        nn.Linear(32, num_classes),
        tq.DeQuantStub()
    )
    return model

model = make_model()
print(model)

Sequential(
  (0): QuantStub()
  (1): Linear(in_features=7, out_features=32, bias=True)
  (2): ReLU()
  (3): Linear(in_features=32, out_features=256, bias=True)
  (4): ReLU()
  (5): Linear(in_features=256, out_features=32, bias=True)
  (6): ReLU()
  (7): Linear(in_features=32, out_features=10, bias=True)
  (8): DeQuantStub()
)


### Step 2: Attach the model to a sparsifier and step

*Note: At this step you can follow the "sparse training" flow without thinking about the quantization just yet.*

In [2]:
import copy
from torch.ao import sparsity

# Create a sparsifier

sparse_config = [
    {'module': model[1], 'sparsity_level': 0.7, 'sparse_block_shape': (1, 4), 'zeros_per_block': 4},
    {'module': model[3], 'sparsity_level': 0.9, 'sparse_block_shape': (1, 4), 'zeros_per_block': 4},
    # The following layers will take default parameters
    model[5],
]

sparse_defaults = {
    'sparsity_level': 0.8,
    'sparse_block_shape': (1, 4),
    'zeros_per_block': 4
}

# Create a sparsifier and attach a model to it
sparsifier = sparsity.WeightNormSparsifier(**sparse_defaults)
sparsifier.prepare(model, config=sparse_config)
sparsifier.step()  # Sparsify the model
sparsifier.squash_mask()

# Save the model for future benchmarking
model_fp = copy.deepcopy(model)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /home/zafar/Git/pytorch-dev/pytorch/aten/src/ATen/native/BinaryOps.cpp:506.)
  return torch.floor_divide(self, other)


In [3]:
# Show the sparsities achieved
for name, m in model.named_modules():
    if isinstance(m, nn.Linear):
        sparsity_level = (m.weight == 0).float().mean()
        print(f'Sparsity in model[{name}]: {sparsity_level:.2%}')

Sparsity in model[1]: 70.54%
Sparsity in model[3]: 89.99%
Sparsity in model[5]: 79.98%
Sparsity in model[7]: 0.00%


### Step 3: Quantize the model

Now that the sparse model is created, we can just run the post-training quantization. Note that we have implemented optimized sparse kernels, located at `torch.ao.nn`.

The quantization sub-flow is as follows:

1. Preapare and calibrate the model
1. Create a custom mapping for the quantized kernels
    - The mapping should be from `nn.Linear` to `ao.nn.sparse.quantized.Linear`.
    - This step makes sure that we are utilizing the accelerated sparse-quantized kernels instead of just quantized kernels
1. Use an existing `torch.quantization.convert` with `mapping` argument to quantize the model
    - **Note:** We currently have a temporary measure to communicate the shapes of the zero blocks using a context manager (`torch.ao.nn.sparse.quantized.utils.LinearBlockSparsePattern`). This will be removed in the nearest future.

In [4]:
import torch.quantization as tq
import torch.ao.nn.sparse.quantized as ao_qnn
from torch.ao.nn.sparse.quantized.utils import LinearBlockSparsePattern

model_qsp = copy.deepcopy(model)

# Step 1. Prepare and calibrate
model_qsp.qconfig = tq.get_default_qconfig()
tq.prepare(model_qsp, inplace=True)
model_qsp(torch.randn(128, in_features));

# Step 2: Create custom mapping
#         You can also use dynamic mapping here that maps to `ao.nn.sparse.quantized.dynamic.Linear`
sparse_mapping = tq.get_default_static_quant_module_mappings()
sparse_mapping[nn.Linear] = ao_qnn.Linear

# Step 3: Convert the model
with LinearBlockSparsePattern(1, 4):
    tq.convert(model_qsp, inplace=True, mapping=sparse_mapping)



The model is now quantized and uses sparse quantized kernels

In [15]:
print(model_qsp)

Sequential(
  (0): Quantize(scale=tensor([0.0458]), zero_point=tensor([64]), dtype=torch.quint8)
  (1): SparseQuantizedLinear(in_features=7, out_features=32, scale=0.031617671251297, zero_point=62, qscheme=torch.per_channel_affine)
  (2): ReLU()
  (3): SparseQuantizedLinear(in_features=32, out_features=256, scale=0.01034998707473278, zero_point=74, qscheme=torch.per_channel_affine)
  (4): ReLU()
  (5): SparseQuantizedLinear(in_features=256, out_features=32, scale=0.0019278707914054394, zero_point=63, qscheme=torch.per_channel_affine)
  (6): ReLU()
  (7): SparseQuantizedLinear(in_features=32, out_features=10, scale=0.0022666696459054947, zero_point=77, qscheme=torch.per_channel_affine)
  (8): DeQuantize()
)
