problem with streaming

### System Info

A100

### Who can help?

@kaiyux

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I run this:
curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'

But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.

### Expected behavior

stream receive smoothly

### actual behavior

stream is not received smoothly

### additional notes

if i remove dynamic_batching in:
https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt
the problem will solve but the speed is still slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

problem with streaming #640

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

problem with streaming #640

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions