-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
Using nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Who can help?
@juney-nvidi
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I created my own config.pbtxt by following official example. Everything works perfectly when I am using batch size 1
My config.pbtxt :
name: "flan_t5_tensorrt_11"
backend: "tensorrtllm"
max_batch_size: 64
dynamic_batching {
max_queue_delay_microseconds: 10000
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: True
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "5"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/model/flan_t5_tensorrt_11/1/decoder"
}
}
parameters: {
key: "encoder_model_path"
value: {
string_value: "/model/flan_t5_tensorrt_11/1/encoder"
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "8192"
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "8192"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "guaranteed_no_evict"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "beam_search"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}
Expected behavior
curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" : [[256], [256]] }]}'
Will not fail
actual behavior
When I inference with batch size > 1:
curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" : [[256], [256]] }]}'
[TensorRT-LLM][ERROR] Assertion failed: Invalid size for tensor request_output_len (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.h)
This assertion fires error, however I am not sure how can I pass tensor with size 1 using dynamic batching
https://github.com/triton-inference-server/tensorrtllm_backend/blob/6791733e64fe7edd9eda7e0d804d0d04d30fb057/inflight_batcher_llm/src/utils.h#L161C5-L161C25
If I change shape of request_output_len and beam_width to [1]
curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [1], "datatype" : "INT32", "data" : [5] }, {"name" : "request_output_len", "shape" : [1], "datatype" : "INT32", "data" : [256] }]}'
{"error":"[request id: <id_unknown>] input 'input_ids' batch size does not match other inputs for 'flan_t5_tensorrt_11'"}
This makes sense as batching is enabled
What is the correct way to use inflight_batcher_llm/tensorrt_llm example using request with more than 1 sample?
additional notes
I do not think that it is a bug, however I am not sure how to use dynamic batching correctly in this scenario