triton server multi request dynamic_batching not work

### System Info

- GPU A800 80G *2
- Container：nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
- Model：Qwen2.5-14B-Instruct

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

1. adding dynamic_batching in tensorrt_llm/config.txt  
    dynamic_batching {
    preferred_batch_size: [ 32 ]
    max_queue_delay_microseconds: 10000
    default_queue_policy: { max_queue_size: 32 }
}

2. instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus: [ 0 ]
  }
]

3. Simulate 10 concurrent requests.

### Expected behavior

Expect these 10 requests to be processed simultaneously and return results.

### actual behavior

If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.

### additional notes

If you need me to provide the complete config.pbtxt file, feel free to ask。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

triton server multi request dynamic_batching not work #661

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

triton server multi request dynamic_batching not work #661

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions