# Post Training Mixed Precision Quantization using the Model Compression Toolkit - A Quick-Start Guide

[Run this tutorial in Google Colab](https://colab.research.google.com/github/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/keras/example_keras_mobilenet_mixed_precision.ipynb)

## Overview


This tutorial demonstrates a pre-trained model quantization using the **Model Compression Toolkit (MCT)** with **Mixed Precision**. 

Mixed Precision enables quantization of different layers with different bit-width precisions, to fit the model into a set of hardware restrictions. 

As we will see, mixed-precision quantization is a simple yet effective quantization scheme for compressing a model to a desired model size.

## Summary

In this tutorial we will cover:

1. Post-Training Mixed-Precision Quantization using MCT.
2. Loading and preprocessing ImageNet's validation dataset.
3. Constructing an unlabeled representative dataset.
4. Accuracy evaluation of the floating-point and the quantized models.

## Setup

Install and import the relevant packages:

In [None]:
TF_VER = '2.14.0'

!pip install -q tensorflow=={TF_VER}
!pip install -q mct-nightly

In [None]:
import tensorflow as tf
import keras
import model_compression_toolkit as mct
import os

## Dataset preparation

Download ImageNet dataset with only the validation split.

**Note** that for demonstration purposes we use the validation set for the model quantization and mixed precision routines. Usually, a subset of the training dataset is used, but loading it is a heavy procedure that is unnecessary for the sake of this demonstration.

This step may take several minutes...

In [None]:
if not os.path.isdir('imagenet'):
    !mkdir imagenet
    !wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz
    !mv ILSVRC2012_devkit_t12.tar.gz imagenet/
    !wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
    !mv ILSVRC2012_img_val.tar imagenet/

Extract ImageNet validation dataset using torchvision "datasets" module

In [None]:
import torchvision
if not os.path.isdir('imagenet/val'):
    torchvision.datasets.ImageNet(root='./imagenet', split='val')

Define the required preprocessing method for the pretrained model,
and create a generator for the representative dataset, which is required for mixed precision quantization.

The representative dataset is used for collecting statistics on the inference outputs of all layers in the model.
 
In order to decide on the size of the representative dataset, we configure the batch size and the number of calibration iterations.
This gives us the total number of samples that will be used during PTQ (batch_size x n_iter).
In this example we set `batch_size = 50` and `n_iter = 10`, resulting in a total of 500 representative images.

Please ensure that the dataset path has been set correctly.

In [None]:
def imagenet_preprocess_input(images, labels):
    """
    Use the keras applications preprocess function.
    Args:
        images: input image batch.
        labels: input label batch.
    Returns:
        preprocessed images & labels
    """
    return tf.keras.applications.mobilenet_v2.preprocess_input(images), labels

In [None]:
def get_representative_dataset(n_iter=10, batch_size=50):
    """
    Download the ImageNet validation set locally and create the representative dataset generator.
    Returns:
        representative dataset generator for calibration
    """
    print('loading dataset, this may take a few minutes ...')
    dataset = tf.keras.utils.image_dataset_from_directory(
        directory='./imagenet/val',
        batch_size=batch_size,
        image_size=[224, 224],
        shuffle=True,
        crop_to_aspect_ratio=True,
        interpolation='bilinear')
    dataset = dataset.map(lambda x, y: (imagenet_preprocess_input(x, y)))

    def representative_dataset():
        for _ in range(n_iter):
            yield [dataset.take(1).get_single_element()[0].numpy()]

    return representative_dataset

representative_dataset_gen = get_representative_dataset()

## Model Post-Training Mixed Precision quantization using MCT

This is the main part in which we quantize our model.

First, we load a pre-trained MobileNetV2 model from Keras, in 32-bits floating-point precision format.

In [None]:
from keras.applications.mobilenet_v2 import MobileNetV2
float_model = MobileNetV2()

Next, we need to define a **mixed precision quantization configuration** with possible mixed precision search options.
MCT will search a mixed precision solution (namely, bit-width assignment for each layer)
and quantize the model according to this configuration.
**Note** that you can skip this part if you prefer to use the default quantization settings.

In addition, we need to define a `TargetPlatformCapability` object, representing the HW specifications on which we wish to eventually deploy our quantized model.
The candidates bit-width for quantization are defined in the target platform model. 

Finally, we need to set the **hardware constraints** which we want our quantized model to fit into.
These are defined using a `ResourceUtilization` object.
In this example, we set a **weights memory** constraint, by computing the size of the desired model's parameters under a compression of the model to 75% of its fixed-point 8-bit precision.

In [None]:
# Enable Mixed-Precision config. For the sake of running faster, the hessian-based scores are disabled in this tutorial
mp_config = mct.core.MixedPrecisionQuantizationConfig(
    num_of_images=32,
    use_hessian_based_scores=False)
core_config = mct.core.CoreConfig(mixed_precision_config=mp_config)
# Specify the target platform capability (TPC)
tpc = mct.get_target_platform_capabilities("tensorflow", 'imx500', target_platform_version='v1')

# Get Resource Utilization information to constraint your model's memory size. Retrieve a ResourceUtilization object with helpful information of each resource metric, to constraint the quantized model to the desired memory size.
resource_utilization_data = mct.core.keras_resource_utilization_data(float_model,
                                   representative_dataset_gen,
                                   core_config=core_config,
                                   target_platform_capabilities=tpc)

# Set a constraint for each of the Resource Utilization metrics.
# Create a ResourceUtilization object to limit our returned model's size. Note that this values affects only layers and attributes
# that should be quantized (for example, the kernel of Conv2D in Keras will be affected by this value,
# while the bias will not)
# examples:
weights_compression_ratio = 0.75  # About 0.75 of the model's weights memory size when quantized with 8 bits.
resource_utilization = mct.core.ResourceUtilization(resource_utilization_data.weights_memory * weights_compression_ratio)

### Run model Post-Training Quantization
Finally, we quantize our model using MCT's post-training quantization API.

In [None]:
quantized_model, quantization_info = mct.ptq.keras_post_training_quantization(
    float_model,
    representative_dataset_gen,
    target_resource_utilization=resource_utilization,
    core_config=core_config,
    target_platform_capabilities=tpc)

That's it! Our model is now quantized.

## Models evaluation

In order to evaluate our models, we first need to load the validation dataset. As before, let's assume we downloaded the ImageNet validation dataset to a folder with the path below:

In [None]:
def get_validation_dataset():
    """
    Generate validation dataset
    Returns:
         the validation dataset
    """
    dataset = tf.keras.utils.image_dataset_from_directory(
        directory='./imagenet/val',
        batch_size=50,
        image_size=[224, 224],
        shuffle=False,
        crop_to_aspect_ratio=True,
        interpolation='bilinear')
    dataset = dataset.map(lambda x, y: (imagenet_preprocess_input(x, y)))
    return dataset

evaluation_dataset = get_validation_dataset()

Let's start with the floating-point model evaluation.

We need to compile the model before evaluation and set the loss and the evaluation metric:

In [None]:
float_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
results = float_model.evaluate(evaluation_dataset)

Finally, let's evaluate the quantized model:

In [None]:
quantized_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
results = quantized_model.evaluate(evaluation_dataset)

You can see that we got a very small degradation with a compression rate of x4 !

Now, we can export the model to Keras and TFLite:

In [None]:
mct.exporter.keras_export_model(model=quantized_model, save_model_path='qmodel.tflite',
                                serialization_format=mct.exporter.KerasExportSerializationFormat.TFLITE,
                                quantization_format=mct.exporter.QuantizationFormat.FAKELY_QUANT)

mct.exporter.keras_export_model(model=quantized_model, save_model_path='qmodel.keras')

## Conclusion

In this tutorial, we demonstrated how to quantize a pre-trained model using MCT with mixed-precision with a few lines of code. We saw that we can achieve more than x4 compression ratio with minimal performance degradation.





Copyright 2023 Sony Semiconductor Israel, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
