How to specify the TensorRT version in Triton Server for inference? #7188

Gcstk · 2024-05-07T09:28:48Z

Description:

I am currently facing an issue with specifying the TensorRT version in Triton Server. I have exported my models as .plan files using TensorRT 10.0 because using version 8.6.1 resulted in unsupported INT64 operations, which led to significant precision loss. Additionally, there were errors related to batch processing when exporting the bce-rerank model using TensorRT 8.6.1. After consulting the documentation and doing some research, it seems that TensorRT 10.0 resolves these issues.

I0507 09:25:56.395027 146 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f15d8000000' with size 268435456
I0507 09:25:56.395363 146 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0507 09:25:56.397606 146 model_lifecycle.cc:469] loading: bce-emb:1
I0507 09:25:56.555863 146 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0507 09:25:56.555894 146 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.19
I0507 09:25:56.555900 146 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.19
I0507 09:25:56.555909 146 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0507 09:25:56.556383 146 tensorrt.cc:231] TRITONBACKEND_ModelInitialize: bce-emb (version 1)
I0507 09:25:57.179209 146 logging.cc:46] Loaded engine size: 532 MiB
E0507 09:25:57.208928 146 logging.cc:40] 1: [stdArchiveReader.cpp::StdArchiveReaderInitCommon::46] Error Code 1: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 236, Serialized Engine Version: 237)
I0507 09:25:57.259794 146 tensorrt.cc:274] TRITONBACKEND_ModelFinalize: delete model state
E0507 09:25:57.259839 146 model_lifecycle.cc:638] failed to load 'bce-emb' version 1: Internal: unable to load plan file to auto complete config: /workspace/model_repository/bce-emb/1/model.plan
I0507 09:25:57.259856 146 model_lifecycle.cc:773] failed to load 'bce-emb'
I0507 09:25:57.259973 146 server.cc:607]

However, the latest NGC container for Triton Server only includes TensorRT 8.6.3, which fails to load my model. I attempted the following methods to upgrade the TensorRT version:

1、Pulled the full Triton Server 24.03 container and upgraded to TensorRT 10 within the container, but it still attempts to load version 8.6.3, which led me to believe that a backend change is necessary. Hence, I tried the next step.

2、Pulled the TensorRT backend from this GitHub repository and attempted to compile it with TensorRT 10, but encountered errors that seem to also indicate a version mismatch.

Question:
How can I resolve this issue to use TensorRT 10 for inference in Triton Server? Any advice or insights on how to successfully deploy and run inference with the latest version of TensorRT in Triton Server would be greatly appreciated!

Triton Information
Triton Server 24.03 container

Expected behavior
A clear and concise description of what you expected to happen.

Gcstk · 2024-05-07T09:32:02Z

I've read through some issues where adjustments were made to the Triton Server containers by selecting appropriate versions. I am wondering if it is possible to upgrade only the TensorRT version within the current container, or should I wait for an official NGC container that includes TensorRT 10?

krishung5 · 2024-05-09T22:30:16Z

Hi @Gcstk, thanks for bringing this up. There will be some API changes and fixes needed if you'd like to compile the TRT backend with TRT 10. I'd recommend waiting until we officially support TRT 10, which will happen with Triton 24.05. Note that the integration is still in progress and not all features would be supported as of 24.05.

Prots · 2024-05-22T09:53:33Z

Hello, could you please tell us when we should expect 24.05 version with tensorrt 10 support?

krishung5 · 2024-05-22T19:24:35Z

@Prots 24.05 will be released at the end of month, which will be later this week or early next week I believe.

tanmayv25 · 2024-05-24T22:37:29Z

24.05 containers has been released. These containers support TensorRT 10.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags

Prots · 2024-06-12T13:48:34Z

@tanmayv25 I see in the image layers LABEL TRT_VERSION=9.3.0.1 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/layers
So 24.05 still use an old version of trt engine.

tanmayv25 · 2024-06-12T19:25:59Z

@Prots There was a delay in releasing TRT-LLM Triton container because of some major issues being reported for TRT-LLM rel-0.10.0 which support TRT 10. See the issue here.
Because of these bugs, we had to release 24.05 Triton TRT-LLM container with TRT-LLM rel-0.9.0 with TRT 9.3.0.1.

That being said... TRT-LLM container is a special Triton container comprising of only python and TRT-LLM backends.

The regular release of Triton container of 24.05 with rest of the backends is released with TRT 10 library only.
nvcr.io/nvidia/tritonserver:24.05-py3

Prots · 2024-06-13T13:25:22Z

So @tanmayv25 when we should expect triton container based on trtllm v0.10.0 with TRT 10.x.x. ? It's a bit messy that versions are different and requires some time to understand what you should use.
i.e. I took latest trtllm version v0.10.0 which supports llama3 quantized model and build engine but I can't run it within latest triton server image because of different engines versions.

krishung5 added the question Further information is requested label May 9, 2024

tanmayv25 closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify the TensorRT version in Triton Server for inference? #7188

How to specify the TensorRT version in Triton Server for inference? #7188

Gcstk commented May 7, 2024

Gcstk commented May 7, 2024

krishung5 commented May 9, 2024

Prots commented May 22, 2024

krishung5 commented May 22, 2024

tanmayv25 commented May 24, 2024

Prots commented Jun 12, 2024

tanmayv25 commented Jun 12, 2024 •

edited

Loading

Prots commented Jun 13, 2024

How to specify the TensorRT version in Triton Server for inference? #7188

How to specify the TensorRT version in Triton Server for inference? #7188

Comments

Gcstk commented May 7, 2024

Gcstk commented May 7, 2024

krishung5 commented May 9, 2024

Prots commented May 22, 2024

krishung5 commented May 22, 2024

tanmayv25 commented May 24, 2024

Prots commented Jun 12, 2024

tanmayv25 commented Jun 12, 2024 • edited Loading

Prots commented Jun 13, 2024

tanmayv25 commented Jun 12, 2024 •

edited

Loading