triton streaming is not working as expected

### System Info

GPU: A10
trt-llm-backend: [Latest commit](https://github.com/triton-inference-server/tensorrtllm_backend/commit/869c2e008b2947c7ca26bab1d65f6ee2c3bea341)

Please note that I built the Triton LLM backend from scratch using the build script.

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I'm trying to get the Whisper example working in streaming mode using the official examples, but it seems that the Triton server gets stuck at some point. Please note that streaming sometimes works on the first request after server startup, but after the first request, subsequent requests don't work as expected.

Also, please note that the non-decoupled mode is working perfectly.

### Expected behavior

To see the tokens received by the client one by one.

### actual behavior

It gets stuck, and tokens are not being delivered to the Triton client.

### additional notes

```
root@ip-172-31-13-186:/workspace/tensorrtllm_backend# python3 tools/whisper/client.py --audio-path 1221-135766-0002.wav
task-0: 0/1
ok, we are sending the request
[TensorRT-LLM][WARNING] Default padding attention mask will be used as not all requests have cross attention mask.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the request. Default padding attention mask will be created.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
ok, we got the response
<c_python_backend_utils.InferenceResponse object at 0x7deeb4f8b770> # this is the first token, the others not coming
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

triton streaming is not working as expected #651

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

triton streaming is not working as expected #651

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions