Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify the TensorRT version in Triton Server for inference? #7188

Closed
Gcstk opened this issue May 7, 2024 · 8 comments
Closed

How to specify the TensorRT version in Triton Server for inference? #7188

Gcstk opened this issue May 7, 2024 · 8 comments
Labels
question Further information is requested

Comments

@Gcstk
Copy link

Gcstk commented May 7, 2024

Description:

I am currently facing an issue with specifying the TensorRT version in Triton Server. I have exported my models as .plan files using TensorRT 10.0 because using version 8.6.1 resulted in unsupported INT64 operations, which led to significant precision loss. Additionally, there were errors related to batch processing when exporting the bce-rerank model using TensorRT 8.6.1. After consulting the documentation and doing some research, it seems that TensorRT 10.0 resolves these issues.

I0507 09:25:56.395027 146 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f15d8000000' with size 268435456
I0507 09:25:56.395363 146 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0507 09:25:56.397606 146 model_lifecycle.cc:469] loading: bce-emb:1
I0507 09:25:56.555863 146 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0507 09:25:56.555894 146 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.19
I0507 09:25:56.555900 146 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.19
I0507 09:25:56.555909 146 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0507 09:25:56.556383 146 tensorrt.cc:231] TRITONBACKEND_ModelInitialize: bce-emb (version 1)
I0507 09:25:57.179209 146 logging.cc:46] Loaded engine size: 532 MiB
E0507 09:25:57.208928 146 logging.cc:40] 1: [stdArchiveReader.cpp::StdArchiveReaderInitCommon::46] Error Code 1: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 236, Serialized Engine Version: 237)
I0507 09:25:57.259794 146 tensorrt.cc:274] TRITONBACKEND_ModelFinalize: delete model state
E0507 09:25:57.259839 146 model_lifecycle.cc:638] failed to load 'bce-emb' version 1: Internal: unable to load plan file to auto complete config: /workspace/model_repository/bce-emb/1/model.plan
I0507 09:25:57.259856 146 model_lifecycle.cc:773] failed to load 'bce-emb'
I0507 09:25:57.259973 146 server.cc:607]

However, the latest NGC container for Triton Server only includes TensorRT 8.6.3, which fails to load my model. I attempted the following methods to upgrade the TensorRT version:

1、Pulled the full Triton Server 24.03 container and upgraded to TensorRT 10 within the container, but it still attempts to load version 8.6.3, which led me to believe that a backend change is necessary. Hence, I tried the next step.

2、Pulled the TensorRT backend from this GitHub repository and attempted to compile it with TensorRT 10, but encountered errors that seem to also indicate a version mismatch.

Question:
How can I resolve this issue to use TensorRT 10 for inference in Triton Server? Any advice or insights on how to successfully deploy and run inference with the latest version of TensorRT in Triton Server would be greatly appreciated!

Triton Information
Triton Server 24.03 container

Expected behavior
A clear and concise description of what you expected to happen.

@Gcstk
Copy link
Author

Gcstk commented May 7, 2024

I've read through some issues where adjustments were made to the Triton Server containers by selecting appropriate versions. I am wondering if it is possible to upgrade only the TensorRT version within the current container, or should I wait for an official NGC container that includes TensorRT 10?

@krishung5
Copy link
Contributor

Hi @Gcstk, thanks for bringing this up. There will be some API changes and fixes needed if you'd like to compile the TRT backend with TRT 10. I'd recommend waiting until we officially support TRT 10, which will happen with Triton 24.05. Note that the integration is still in progress and not all features would be supported as of 24.05.

@krishung5 krishung5 added the question Further information is requested label May 9, 2024
@Prots
Copy link

Prots commented May 22, 2024

Hello, could you please tell us when we should expect 24.05 version with tensorrt 10 support?

@krishung5
Copy link
Contributor

@Prots 24.05 will be released at the end of month, which will be later this week or early next week I believe.

@tanmayv25
Copy link
Contributor

24.05 containers has been released. These containers support TensorRT 10.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags

@Prots
Copy link

Prots commented Jun 12, 2024

@tanmayv25 I see in the image layers LABEL TRT_VERSION=9.3.0.1 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/layers
So 24.05 still use an old version of trt engine.

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jun 12, 2024

@Prots There was a delay in releasing TRT-LLM Triton container because of some major issues being reported for TRT-LLM rel-0.10.0 which support TRT 10. See the issue here.
Because of these bugs, we had to release 24.05 Triton TRT-LLM container with TRT-LLM rel-0.9.0 with TRT 9.3.0.1.

That being said... TRT-LLM container is a special Triton container comprising of only python and TRT-LLM backends.

The regular release of Triton container of 24.05 with rest of the backends is released with TRT 10 library only.
nvcr.io/nvidia/tritonserver:24.05-py3

@Prots
Copy link

Prots commented Jun 13, 2024

So @tanmayv25 when we should expect triton container based on trtllm v0.10.0 with TRT 10.x.x. ? It's a bit messy that versions are different and requires some time to understand what you should use.
i.e. I took latest trtllm version v0.10.0 which supports llama3 quantized model and build engine but I can't run it within latest triton server image because of different engines versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

4 participants