inflight_batcher_llm example batching

### System Info

Using nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

### Who can help?

@juney-nvidi

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I created my own config.pbtxt by following official example. Everything works perfectly when I am using batch size 1

My config.pbtxt :
name: "flan_t5_tensorrt_11"
backend: "tensorrtllm"
max_batch_size: 64

dynamic_batching {
    max_queue_delay_microseconds: 10000
}

input [
 {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: True
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: True
  },
{
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: True
  },
 {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: True
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "5"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/model/flan_t5_tensorrt_11/1/decoder"
  }
}
parameters: {
  key: "encoder_model_path"
  value: {
    string_value: "/model/flan_t5_tensorrt_11/1/encoder"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "8192"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "8192"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "guaranteed_no_evict"
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "beam_search"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}

### Expected behavior

curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" :  [[256], [256]] }]}'

Will not fail

### actual behavior

When I inference with batch size > 1:

curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [2, 1], "datatype" : "INT32", "data" : [[5], [5]] }, {"name" : "request_output_len", "shape" : [2, 1], "datatype" : "INT32", "data" :  [[256], [256]] }]}'

[TensorRT-LLM][ERROR] Assertion failed: Invalid size for tensor request_output_len (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.h)

This assertion fires error, however I am not sure how can I pass tensor with size 1 using dynamic batching
https://github.com/triton-inference-server/tensorrtllm_backend/blob/6791733e64fe7edd9eda7e0d804d0d04d30fb057/inflight_batcher_llm/src/utils.h#L161C5-L161C25



If I change shape of request_output_len and  beam_width to [1]
curl -X POST localhost:8000/v2/models/flan_t5_tensorrt_11/infer -d '{"inputs": [{"name" : "input_ids", "shape" : [ 2, 3 ], "datatype" : "INT32", "data" : [[1,2,3], [4,5,6]] }, {"name" : "beam_width", "shape" : [1], "datatype" : "INT32", "data" : [5] }, {"name" : "request_output_len", "shape" : [1], "datatype" : "INT32", "data" :  [256] }]}'

{"error":"[request id: <id_unknown>] input 'input_ids' batch size does not match other inputs for 'flan_t5_tensorrt_11'"}
This makes sense as batching is enabled


What is the correct way to use inflight_batcher_llm/tensorrt_llm example using request with more than 1 sample?

### additional notes

I do not think that it is a bug, however I am not sure how to use dynamic batching correctly in this scenario

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inflight_batcher_llm example batching #558

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inflight_batcher_llm example batching #558

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions