# Export a Quantized Pytorch Model With the Model Compression Toolkit (MCT)

[Run this tutorial in Google Colab](https://colab.research.google.com/github/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/pytorch/example_pytorch_export.ipynb)

## Overview
This tutorial demonstrates how to export a PyTorch model to ONNX and TorchSript formats using the Model Compression Toolkit (MCT). It covers the steps of creating a simple PyTorch model, applying post-training quantization (PTQ) using MCT, and then exporting the quantized model to ONNX and TorchSript. The tutorial also shows how to use the exported model for inference.

## Summary:
In this tutorial, we will cover:

1. Constructing a simple PyTorch model for demonstration purposes.
2. Applying post-training quantization to the model using the Model Compression Toolkit.
3. Exporting the quantized model to the ONNX and TorchScript formats.
4. Ensuring compatibility between PyTorch and ONNX during the export process.
5. Using the exported model for inference.

## Setup
To export your quantized model to ONNX format and use it for inference, you will need to install some additional packages. Note that these packages are only required if you plan to export the model to ONNX. If ONNX export is not needed, you can skip this step.

In [None]:
! pip install -q onnx onnxruntime "onnxruntime-extensions<0.14"

Install the Model Compression Toolkit:

In [None]:
import importlib
if not importlib.util.find_spec('model_compression_toolkit'):
    !pip install model_compression_toolkit

In [None]:
import numpy as np
from torchvision.models.mobilenetv2 import mobilenet_v2
import model_compression_toolkit as mct

## Quantize the Model with the Model Compression Toolkit (MCT)
Let's begin the export demonstration by loading a model and applying quantization using MCT. This process will allow us to prepare the model for ONNX export.

In [None]:
# Create a model
float_model = mobilenet_v2()

# Notice that here the representative dataset is random for demonstration only.
def representative_data_gen():
    yield [np.random.random((1, 3, 224, 224))]


quantized_exportable_model, _ = mct.ptq.pytorch_post_training_quantization(float_model, representative_data_gen=representative_data_gen)




### ONNX
The model will be exported in ONNX format, where both weights and activations are represented as floats. Make sure that `onnx` is installed to enable exporting.

There are two optional formats available for export: MCTQ or FAKELY_QUANT.

#### MCTQ Quantization Format
By default, `mct.exporter.pytorch_export_model`  exports the quantized PyTorch model to ONNX using custom quantizers from the `mct_quantizers` module. 

In [None]:
# Path of exported model
onnx_file_path = 'model_format_onnx_mctq.onnx'

# Export ONNX model with mctq quantizers.
mct.exporter.pytorch_export_model(
    model=quantized_exportable_model,
    save_model_path=onnx_file_path,
    repr_dataset=representative_data_gen)

Note that the model's size remains unchanged compared to the quantized exportable model, as the weight data types are still represented as floats.

#### ONNX Opset Version
By default, the ONNX opset version used is 15. However, this can be adjusted by specifying the `onnx_opset_version` parameter during export.

In [None]:
# Export ONNX model with mctq quantizers.
mct.exporter.pytorch_export_model(
    model=quantized_exportable_model,
    save_model_path=onnx_file_path,
    repr_dataset=representative_data_gen,
    onnx_opset_version=16)

### Using the Exported Model for Inference
To load and perform inference with the ONNX model exported in MCTQ format, use the `mct_quantizers` method `get_ort_session_options` during the creation of an ONNX Runtime session. 
**Note:** Inference on models exported in this format tends to be slower and experiences higher latency. However, inference on hardware such as the IMX500 will not suffer from this issue.

In [None]:
import mct_quantizers as mctq
import onnxruntime as ort

sess = ort.InferenceSession(onnx_file_path,
                            mctq.get_ort_session_options(),
                            providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

_input_data = next(representative_data_gen())[0].astype(np.float32)
_model_output_name = sess.get_outputs()[0].name
_model_input_name = sess.get_inputs()[0].name

# Run inference
predictions = sess.run([_model_output_name], {_model_input_name: _input_data})

#### Fakely-Quantized Format
To export a fakely-quantized model, use the `QuantizationFormat.FAKELY_QUANT` option. This format ensures that quantization is simulated but does not alter the data types of the weights and activations during export.

In [None]:
import tempfile

# Path of exported model
_, onnx_file_path = tempfile.mkstemp('.onnx')

# Use QuantizationFormat.FAKELY_QUANT for fakely-quantized weights and activations.
mct.exporter.pytorch_export_model(model=quantized_exportable_model,
                                  save_model_path=onnx_file_path,
                                  repr_dataset=representative_data_gen,
                                  quantization_format=mct.exporter.QuantizationFormat.FAKELY_QUANT)

Note that the fakely-quantized model has the same size as the quantized exportable model, as the weights are still represented as floats.

### TorchScript Format

The model can also be exported in TorchScript format, where weights and activations are quantized but represented as floats (fakely quantized).

In [None]:
# Path of exported model
_, torchscript_file_path = tempfile.mkstemp('.pt')


# Use mode PytorchExportSerializationFormat.TORCHSCRIPT a torchscript model
# and QuantizationFormat.FAKELY_QUANT for fakely-quantized weights and activations.
mct.exporter.pytorch_export_model(model=quantized_exportable_model,
                                  save_model_path=torchscript_file_path,
                                  repr_dataset=representative_data_gen,
                                  serialization_format=mct.exporter.PytorchExportSerializationFormat.TORCHSCRIPT,
                                  quantization_format=mct.exporter.QuantizationFormat.FAKELY_QUANT)

Note that the fakely-quantized model retains the same size as the quantized exportable model, as the weight data types remain in float format.

## Copyrights:
Copyright 2024 Sony Semiconductor Solutions, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
