Skip to content

triton server multi request dynamic_batching not work #661

@kazyun

Description

@kazyun

System Info

  • GPU A800 80G *2
  • Container:nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
  • Model:Qwen2.5-14B-Instruct

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. adding dynamic_batching in tensorrt_llm/config.txt
    dynamic_batching {
    preferred_batch_size: [ 32 ]
    max_queue_delay_microseconds: 10000
    default_queue_policy: { max_queue_size: 32 }
    }

  2. instance_group [
    {
    count: 1
    kind : KIND_GPU
    gpus: [ 0 ]
    }
    ]

  3. Simulate 10 concurrent requests.

Expected behavior

Expect these 10 requests to be processed simultaneously and return results.

actual behavior

If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.

additional notes

If you need me to provide the complete config.pbtxt file, feel free to ask。

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions