-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
- 1x H100
- Llama3 8B Instruct
- TensorRT-LLM v0.10.0
- tensorrtllm_backend v0.10.0
- tritonserver 24.06
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Follow this benchmarking article to create the Llama3 8B Instruct model engine (but also using a converted checkpoint for trtllm-build. Then, finish setup similar to this blog
Expected behavior
After following the benchmarking article, I get that throughput is ~10,000 tokens per second. I hope to achieve this on a Triton Inference Server.
actual behavior
Instead, a single request has throughput around 80 tokens per second, and I'm struggling to figure out how to efficiently manage concurrent requests. I'm hoping to achieve the 10,000 tokens per second that the GPU/engine have shown it is capable of reaching as per the benchmark.
additional notes
Ultimately, I think this comes down to needing a better understanding of Triton and TensorRT-LLM (e.g., parameters, model configs. etc...) which I admittedly don't have. Suggestions for how to create an environment similar to the benchmark would be greatly appreciated.
I am aware of Triton's instance_group model parameter that would allow me to spawn multiple instance of the same model, but it seems like I'm limited to only three on one GPU due to memory constraints. This would only achieve 3 * 80 = 240 tokens per sec, which is, again, far from what I'm hoping for.