Skip to content

Commit

Permalink
Added capability to quantize a model while exporting through ONNX. (h…
Browse files Browse the repository at this point in the history
…uggingface#6089)

* Added capability to quantize a model while exporting through ONNX.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

We do not support multiple extensions

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reformat files

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* More quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure test_generate_identified_name compares the same object types

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added documentation everywhere on ONNX exporter

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use pathlib.Path instead of plain-old string

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use f-string everywhere

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use the correct parameters for black formatting

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use Python 3 super() style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use packaging.version to ensure installed onnxruntime version match requirements

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fixing imports sorting order.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Missing raise(s)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added quantization documentation

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix some spelling.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix bad list header format

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
  • Loading branch information
mfuntowicz committed Jul 29, 2020
1 parent 25de74c commit 6c00285
Show file tree
Hide file tree
Showing 3 changed files with 288 additions and 46 deletions.
61 changes: 55 additions & 6 deletions docs/source/serialization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@ The following command shows how easy it is to export a BERT model from the libra
python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased bert-base-cased.onnx
The conversion tool works for both PyTorch and Tensorflow models and ensures:
* The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
* The inputs and outputs are correctly generated to their ONNX counterpart.
* The generated model can be correctly loaded through onnxruntime.

* The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
* The inputs and outputs are correctly generated to their ONNX counterpart.
* The generated model can be correctly loaded through onnxruntime.

.. note::
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
Expand All @@ -32,9 +33,57 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:


Also, the conversion tool supports different options which let you tune the behavior of the generated model:
* Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
* Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
* Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).

* Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
* Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
* Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).

Quantization
------------------------------------------------

ONNX exporter supports generating a quantized version of the model to allow efficient inference.

Quantization works by converting the memory representation of the parameters in the neural network
to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`)
which can express a wide-range of floating-point numbers with decent precision.
These properties are especially interesting at training where you want fine-grained representation.

On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers
without changing the performances of the neural network.

More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing
the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation)
according to the following formula:

.. math::
y_{float32} = scale * x_{int8} - zero\_point
.. note::
The quantization process will infer the parameter `scale` and `zero_point` from the neural network parameters

Leveraging tiny-integers has numerous advantages when it comes to inference:

* Storing fewer bits instead of 32 bits for the `float32` reduces the size of the model and makes it load faster.
* Integer operations execute a magnitude faster on modern hardware
* Integer operations require less power to do the computations

In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize``
when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this
same script file.

Example of quantized BERT model export:

.. code-block:: bash
python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --quantize bert-base-cased.onnx
.. note::
Quantization support requires ONNX Runtime >= 1.4.0

.. note::
When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
above command will contain the original ONNX model storing `float32` weights.
The second one, with ``-quantized`` suffix, will hold the quantized parameters.


TorchScript
Expand Down
Loading

0 comments on commit 6c00285

Please sign in to comment.