Skip to content

triton streaming is not working as expected #651

@robosina

Description

@robosina

System Info

GPU: A10
trt-llm-backend: Latest commit

Please note that I built the Triton LLM backend from scratch using the build script.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm trying to get the Whisper example working in streaming mode using the official examples, but it seems that the Triton server gets stuck at some point. Please note that streaming sometimes works on the first request after server startup, but after the first request, subsequent requests don't work as expected.

Also, please note that the non-decoupled mode is working perfectly.

Expected behavior

To see the tokens received by the client one by one.

actual behavior

It gets stuck, and tokens are not being delivered to the Triton client.

additional notes

root@ip-172-31-13-186:/workspace/tensorrtllm_backend# python3 tools/whisper/client.py --audio-path 1221-135766-0002.wav
task-0: 0/1
ok, we are sending the request
[TensorRT-LLM][WARNING] Default padding attention mask will be used as not all requests have cross attention mask.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the request. Default padding attention mask will be created.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
ok, we got the response
<c_python_backend_utils.InferenceResponse object at 0x7deeb4f8b770> # this is the first token, the others not coming
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions