You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
We're seeing significant latency in the order of 300-600 hundred milliseconds between COMPUTE_END and REQUEST_END on a TensorRT-LLM model. See OTEL trace image below.
Triton Information
What version of Triton are you using? 2.42.0
Are you using the Triton container or did you build it yourself? NGC Container 24.01
To Reproduce
Mistral-7B compiled to a TensorRT-LLM engine on an A100 with 2x tensor parallel.
TensorRT-LLM backend version 0.7.1 (as shipped in NGC Container)
Deployed to a 2 GPU A100-80 server.
Dynamic Batching
Redis Response cache
We use the grpc API
See the config.pbtxt below.
Expected behavior
REQUEST_END should occur very shortly after COMPUTE_END, perhaps on the order of tens of ms.
Is this unusual for other models you have tested? Is this latency value consistently present across all test runs with this model and other TRT LLM models?
Description
We're seeing significant latency in the order of 300-600 hundred milliseconds between COMPUTE_END and REQUEST_END on a TensorRT-LLM model. See OTEL trace image below.
Triton Information
What version of Triton are you using? 2.42.0
Are you using the Triton container or did you build it yourself? NGC Container 24.01
To Reproduce
See the config.pbtxt below.
Expected behavior
REQUEST_END should occur very shortly after COMPUTE_END, perhaps on the order of tens of ms.
The text was updated successfully, but these errors were encountered: