Skip to content

inflight_batcher_llm example batching #558

@PKaralupov

Description

@PKaralupov

System Info

Using nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

Who can help?

@juney-nvidi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I created my own config.pbtxt by following official example. Everything works perfectly when I am using batch size 1

My config.pbtxt :
name: "flan_t5_tensorrt_11"
backend: "tensorrtllm"
max_batch_size: 64

dynamic_batching {
max_queue_delay_microseconds: 10000
}

input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "5"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/model/flan_t5_tensorrt_11/1/decoder"
}
}
parameters: {
key: "encoder_model_path"
value: {
string_value: "/model/flan_t5_tensorrt_11/1/encoder"
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "8192"
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "8192"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "guaranteed_no_evict"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "beam_search"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}

Expected behavior

curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" : [[256], [256]] }]}'

Will not fail

actual behavior

When I inference with batch size > 1:

curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" : [[256], [256]] }]}'

[TensorRT-LLM][ERROR] Assertion failed: Invalid size for tensor request_output_len (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.h)

This assertion fires error, however I am not sure how can I pass tensor with size 1 using dynamic batching
https://github.com/triton-inference-server/tensorrtllm_backend/blob/6791733e64fe7edd9eda7e0d804d0d04d30fb057/inflight_batcher_llm/src/utils.h#L161C5-L161C25

If I change shape of request_output_len and beam_width to [1]
curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [1], "datatype" : "INT32", "data" : [5] }, {"name" : "request_output_len", "shape" : [1], "datatype" : "INT32", "data" : [256] }]}'

{"error":"[request id: <id_unknown>] input 'input_ids' batch size does not match other inputs for 'flan_t5_tensorrt_11'"}
This makes sense as batching is enabled

What is the correct way to use inflight_batcher_llm/tensorrt_llm example using request with more than 1 sample?

additional notes

I do not think that it is a bug, however I am not sure how to use dynamic batching correctly in this scenario

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions