-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
- GPU: NVIDIA H100
- TensorRT-LLM v0.10.0
- tensorrtllm_backend v0.10.0
- tritonserver 24.03
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Follow this article on TensorRT-LLM performance using v0.10.0 to get a Llama3-8B-Instruct engine made. Then, try to start a Triton Inference Server in the Docker container nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3; I choose the 24.03 version as suggested here.
Expected behavior
The Triton Inference Server runs.
actual behavior
When the Triton Inference Server runs and tries to deserialize the engine, I get Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 228, Serialized Engine Version: 237.
additional notes
I know that this is the result of a version mismatch, but how do I resolve it? I thought that TensorRT-LLM matches with the 24.03 tritonserver containers because of the release notes from TensorRT-LLM v0.10.0, but it's not working.