Deserializing Engine Version Mismatch

### System Info

- GPU: NVIDIA H100
- TensorRT-LLM v0.10.0
- tensorrtllm_backend v0.10.0
- tritonserver 24.03

### Who can help?

@byshiue @sche

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Follow [this article on TensorRT-LLM performance](https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html) using v0.10.0 to get a Llama3-8B-Instruct engine made. Then, try to start a Triton Inference Server in the Docker container `nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3`; I choose the `24.03` version as suggested [here](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.10.0).

### Expected behavior

The Triton Inference Server runs.

### actual behavior

When the Triton Inference Server runs and tries to deserialize the engine, I get `Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 228, Serialized Engine Version: 237`.

### additional notes

I know that this is the result of a version mismatch, but how do I resolve it? I thought that TensorRT-LLM matches with the 24.03 tritonserver containers because of the release notes from TensorRT-LLM v0.10.0, but it's not working.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deserializing Engine Version Mismatch #531

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deserializing Engine Version Mismatch #531

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions