-
Notifications
You must be signed in to change notification settings - Fork 132
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
GPU: A10
trt-llm-backend: Latest commit
Please note that I built the Triton LLM backend from scratch using the build script.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I'm trying to get the Whisper example working in streaming mode using the official examples, but it seems that the Triton server gets stuck at some point. Please note that streaming sometimes works on the first request after server startup, but after the first request, subsequent requests don't work as expected.
Also, please note that the non-decoupled mode is working perfectly.
Expected behavior
To see the tokens received by the client one by one.
actual behavior
It gets stuck, and tokens are not being delivered to the Triton client.
additional notes
root@ip-172-31-13-186:/workspace/tensorrtllm_backend# python3 tools/whisper/client.py --audio-path 1221-135766-0002.wav
task-0: 0/1
ok, we are sending the request
[TensorRT-LLM][WARNING] Default padding attention mask will be used as not all requests have cross attention mask.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the request. Default padding attention mask will be created.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
ok, we got the response
<c_python_backend_utils.InferenceResponse object at 0x7deeb4f8b770> # this is the first token, the others not coming
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working