Skip to content

Files

Latest commit

c468d67 · Nov 5, 2024

History

History

engine_refit_onnx_bidaf

TensorRT Engine Refitting of ONNX models.

Table Of Contents

Description

This sample shows how to refit an engine built from an ONNX model via parsers. A modified version of the ONNX BiDAF model is used as the sample model, which implements the Bi-Directional Attention Flow (BiDAF) network described in the paper Bidirectional Attention Flow for Machine Comprehension.

How does this sample work?

This sample replaces unsupported nodes (HardMax / Compress) in the original ONNX model via ONNX-graphsurgeon (in prepare_model.py) and build a refittable TensorRT engine. The engine is then refitted with fake weights and correct weights, each followed by inference on sample context and query sentences in build_and_refit_engine.py.

Prerequisites

Dependencies required for this sample

  1. Install the dependencies for Python:
pip3 install -r requirements.txt
  1. TensorRT

  2. ONNX-GraphSurgeon

  3. Download sample data. See the "Download Sample Data" section of the general setup guide.

Running the sample

The data directory needs to be specified (either via -d /path/to/data or environment varaiable TRT_DATA_DIR) when running these scripts. An error will be thrown if not. Taking TRT_DATA_DIR approach in following example.

  • Prepare the ONNX model. (The data directory needs to be specified.)
    python3 prepare_model.py

The output should look similar to the following:

Modifying the ONNX model ...
Modified ONNX model saved as bidaf-modified.onnx
Done.

The script will modify the original model from onnx/models and save an ONNX model that can be parsed and run by TensorRT.

The original ONNX model contains four CategoryMapper nodes to map the four input string arrays to int arrays. Since TensorRT does not support string data type and CategoryMapper nodes, we dump out the four maps for the four nodes as json files (model/CategoryMapper_{4-6}.json) and use them to preprocess input data. Now the four inputs become four outputs of the original CategoryMapper nodes.

And unsupported HardMax nodes and Compress nodes are replaced by ArgMax nodes and Gather nodes, respectively.

  • Build a TensorRT engine, refit the engine and run inference. python3 build_and_refit_engine.py --weights-location GPU

The script will build a TensorRT engine from the modified ONNX model, and then refit the engine from GPU weights and run inference on sample context and query sentences.

When running the above command for the first time, the output should look similar to the following:

Loading ONNX file from path bidaf-modified.onnx...
Beginning ONNX file parsing
[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_4 has Int64 binding.
[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_5 has Int64 binding.
[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_6 has Int64 binding.
[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_7 has Int64 binding.
Completed parsing of ONNX file
Network inputs:
CategoryMapper_4 <class 'numpy.int64'> (-1, 1)
CategoryMapper_5 <class 'numpy.int64'> (-1, 1, 1, 16)
CategoryMapper_6 <class 'numpy.int64'> (-1, 1)
CategoryMapper_7 <class 'numpy.int64'> (-1, 1, 1, 16)
Building an engine from file bidaf-modified.onnx; this may take a while...
Completed creating Engine
Refitting engine from GPU weights...
Engine refitted in 39.88 ms.
Doing inference...
Doing inference...
Refitting engine from GPU weights...
Engine refitted in 0.27 ms.
Doing inference...
Doing inference...
Passed

Note that refitting for second time will be much faster than the first time. When running the above command again, engine will be deserialized from the plan file, the output should look similar to the following:

Reading engine from file bidaf.trt...
Refitting engine from GPU weights...
Engine refitted in 32.64 ms.
Doing inference...
Doing inference...
Refitting engine from GPU weights...
Engine refitted in 0.41 ms.
Doing inference...
Doing inference...
Passed

To refit the engine from CPU weights, change the command to be python3 build_and_refit_engine.py --weights-location CPU. And the output should look similar to the following

Reading engine from file bidaf.trt...
Refitting engine from CPU weights...
Engine refitted in 45.18 ms.
Doing inference...
Doing inference...
Refitting engine from CPU weights...
Engine refitted in 1.20 ms.
Doing inference...
Doing inference...
Passed

There is also an option --version-compatible to enable engine version compatibility. If installed, tensorrt_dispatch package will used for refitting and running version compatible engines instead of tensorrt package. To build and refit a version compatible engine, run the command python3 build_and_refit_engine.py --version-compatible and the output should look similar to the above cases.

Additional resources

The following resources provide a deeper understanding about the model used in this sample:

Model

Documentation

License

For terms and conditions for use, reproduction, and distribution, see the TensorRT Software License Agreement documentation.

Changelog

October 2020: This sample was recreated, updated and reviewed.

August 2023:

  • Add support for refitting engines from GPU weights.
  • Removed support for Python versions < 3.8.

January 2024:

  • Add support for refitting version compatible engines.

Known issues

There are no known issues in this sample.