# Mixed-Precision Post-Training Quantization in PyTorch using the Model Compression Toolkit (MCT)
[Run this tutorial in Google Colab](https://colab.research.google.com/github/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/pytorch/example_pytorch_mixed_precision_ptq.ipynb)

## Overview
This quick-start guide explains how to use the **Model Compression Toolkit (MCT)** to quantize a PyTorch model with post-training mixed-precision quantization. This quantization assigns different precision levels to various layers based on their impact on the model's output. We will load a pre-trained model and  quantize it using the MCT. Finally, we will evaluate the quantized model and export it to an ONNX file.

## Summary
In this tutorial, we will cover:

1. Loading and preprocessing ImageNet’s validation dataset.
2. Constructing an unlabeled representative dataset.
3. Applying mixed-precision post-training quantization to the model's weights using MCT.
3. Accuracy evaluation of the floating-point and the quantized models.

## Setup
Install the relevant packages:

In [None]:
!pip install -q torch torchvision

In [None]:
import importlib
if not importlib.util.find_spec('model_compression_toolkit'):
    !pip install model_compression_toolkit

In [None]:
import torch
from torch.utils.data import DataLoader
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.datasets import ImageNet
import numpy as np
import random

Load a pre-trained MobileNetV2 model from torchvision, in 32-bits floating-point precision format.

In [None]:
weights = MobileNet_V2_Weights.IMAGENET1K_V2

float_model = mobilenet_v2(weights=weights)

## Dataset preparation
### Download ImageNet validation set
Download ImageNet dataset with only the validation split.

**Note** that for demonstration purposes we use the validation set for the model quantization routines. Usually, a subset of the training dataset is used, but loading it is a heavy procedure that is unnecessary for the sake of this demonstration.

This step may take several minutes...

In [None]:
import os

if not os.path.isdir('imagenet'):
    !mkdir imagenet
    !wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz
    !wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

Extract ImageNet validation dataset using torchvision "datasets" module.

In [None]:
dataset = ImageNet(root='./imagenet', split='val', transform=weights.transforms())

## Representative Dataset
For quantization with MCT, we need to define a representative dataset required by the Post-Training Quantization (PTQ) algorithm. This dataset is a generator that returns a list of images:

In [None]:
batch_size = 50
n_iter = 10

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

def representative_dataset_gen():
    dataloader_iter = iter(dataloader)
    for _ in range(n_iter):
        yield [next(dataloader_iter)[0]]


## Target Platform Capabilities (TPC)
In addition, MCT optimizes models for dedicated hardware platforms using Target Platform Capabilities (TPC). 
**Note:**  To apply mixed-precision quantization to specific layers, the TPC must define different bit-width options for those layers. For more details, please refer to our [documentation](https://sonysemiconductorsolutions.github.io/mct-model-optimization/api/api_docs/modules/target_platform_capabilities.html). In this example, we use the default PyTorch TPC, which supports 2, 4, and 8-bit options for convolution and linear layers.

In [None]:
import model_compression_toolkit as mct

# Get a FrameworkQuantizationCapabilities object that models the hardware platform for the quantized model inference. Here, for example, we use the default platform that is attached to a Pytorch layers representation.
target_platform_cap = mct.get_target_platform_capabilities('pytorch', 'default')

## Mixed Precision Configurations
We will create a `MixedPrecisionQuantizationConfig` that defines the search options for mixed-precision:
1. **Number of images** - Determines how many images from the representative dataset are used to find an optimal bit-width configuration. More images result in higher accuracy but increase search time.
2. **Gradient weighting** - Improves bit-width configuration accuracy at the cost of longer search time. This method will not be used in this example.

MCT will determine a bit-width for each layer and quantize the model based on this configuration. The candidate bit-widths for quantization should be defined in the target platform model.

In [None]:
configuration = mct.core.CoreConfig(
    mixed_precision_config=mct.core.MixedPrecisionQuantizationConfig(
    num_of_images=32,
    use_hessian_based_scores=False))

To enable mixed-precision quantization, we define the desired compression ratio. In this example, we will configure the model to compress the weights to **75% of the size of the 8-bit model's weights**. To achieve this, we will retrieve the model's resource utilization information, `resource_utilization_data`, specifically focusing on the weights' memory. Then, we will create a `ResourceUtilization` object to enforce the size constraint on the weight's memory, which applies only to the quantized layers and attributes (e.g., Conv2D kernels, but not biases).

In [None]:
# Get Resource Utilization information to constraint your model's memory size.
resource_utilization_data = mct.core.pytorch_resource_utilization_data(
    float_model,
    representative_dataset_gen,
    configuration,
    target_platform_capabilities=target_platform_cap)

weights_compression_ratio = 0.75  # About 0.75 of the model's weights memory size when quantized with 8 bits.
# Create a ResourceUtilization object 
resource_utilization = mct.core.ResourceUtilization(resource_utilization_data.weights_memory * weights_compression_ratio)

## Run Post-Training Quantization with Mixed Precision
Now, we are ready to use MCT to quantize the model.

In [None]:
quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
    in_module=float_model,
    representative_data_gen=representative_dataset_gen,
    target_resource_utilization=resource_utilization,
    core_config=configuration,
    target_platform_capabilities=target_platform_cap)

## Model Evaluation
In order to evaluate our models, we first need to load the validation dataset.

In [None]:
val_dataloader = DataLoader(dataset, batch_size=50, shuffle=False, num_workers=16, pin_memory=True)

Now, we will create a function for evaluating a model.

In [None]:
from tqdm import tqdm


def evaluate(model, testloader):
    """
    Evaluate a model using a test loader.
    """
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():
        for data in tqdm(testloader):
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

            # correct += (predicted == labels).sum().item()
    val_acc = (100 * correct / total)
    print('Accuracy: %.2f%%' % val_acc)
    return val_acc

Let's start with the floating-point model evaluation.

In [None]:
evaluate(float_model, val_dataloader)

Finally, let's evaluate the quantized model:

In [None]:
evaluate(quantized_model, val_dataloader)

Now, we can export the quantized model to ONNX:

In [None]:
mct.exporter.pytorch_export_model(quantized_model, save_model_path='qmodel.onnx', repr_dataset=representative_dataset_gen)

## Conclusion

In this tutorial, we demonstrated how to quantize a classification model using the mixed precision feature with MCT. 
MCT can deliver competitive results across a wide range of tasks and network architectures. For more details, [check out the paper:](https://arxiv.org/abs/2109.09113).

## Copyrights:
Copyright 2024 Sony Semiconductor Israel, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
