An error that `Shape does not match true shape of 'data' field` occurs when using tensorrt_llm model alone in inflight_batcher_llm

### System Info

This issue is not related to the system.

### Who can help?

@byshiue @schetlur-nv 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. launch the triton inference server (tensorrt_llm alone)
2. Post via `curl` command like this:
    ```
    $ curl -X POST localhost:8000/v2/models/tensorrt_llm/generate_stream -H "Content-Type: application/json" -d '{"input_ids": [1,2,3], "input_lengths": [3], "request_output_len": [3], "streaming": true}'
    ```

### Expected behavior

return outputs with `output_ids`, ...

### actual behavior

an error occurs:
```
{"error":"Unable to parse 'data': Shape does not match true shape of 'data' field"}
```

Here is a log:
```
I1010 08:39:21.059183 679062 http_server.cc:4590] "HTTP request: 2 /v2/models/tensorrt_llm/generate_stream"
I1010 08:39:21.059254 679062 model_lifecycle.cc:339] "GetModel() 'tensorrt_llm' version -1"
I1010 08:39:21.059269 679062 model_lifecycle.cc:297] "VersionStates() 'tensorrt_llm'"
I1010 08:39:21.059398 679062 model_lifecycle.cc:339] "GetModel() 'tensorrt_llm' version -1"
I1010 08:39:21.059422 679062 http_server.cc:3305] "[request id: <id_unknown>] Infer failed: Unable to parse 'data': Shape does not match true shape of 'data' field"
```

### additional notes

I am trying to use only `tensorrt_llm` model in inflight_batcher_llm.
Here is my status.
```
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1010 08:24:52.622037 679062 server.cc:631] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                 |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":" |
|             |                                                                 | 6.000000","default-max-batch-size":"4"}}                                                                               |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

I1010 08:24:52.622073 679062 model_lifecycle.cc:276] "ModelStates()"
I1010 08:24:52.622092 679062 server.cc:674] 
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1       | READY  |
+--------------+---------+--------+

I1010 08:24:52.696724 679062 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA RTX A6000"
I1010 08:24:52.696755 679062 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA RTX A6000"
I1010 08:24:52.696760 679062 metrics.cc:877] "Collecting metrics for GPU 2: NVIDIA RTX A6000"
I1010 08:24:52.696765 679062 metrics.cc:877] "Collecting metrics for GPU 3: NVIDIA RTX A6000"
I1010 08:24:52.703944 679062 metrics.cc:770] "Collecting CPU metrics"
I1010 08:24:52.704059 679062 tritonserver.cc:2598] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                              |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                             |
| server_version                   | 2.50.0                                                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_te |
|                                  | nsor_data parameters statistics trace logging                                                                                                                      |
| model_repository_path[0]         | /data/triton-server-engines/triton_model_repo                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                          |
| strict_model_config              | 1                                                                                                                                                                  |
| model_config_name                |                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                           |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                           |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                           |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                           |
| min_supported_compute_capability | 6.0                                                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                                                 |
| cache_enabled                    | 0                                                                                                                                                                  |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

```

I set `dims` of `input_ids` to `-1` (I don't change from default value), and my `tensorrt_llm` model's config.pbtxt is here.
<details>
<summary>config.pbtxt</summary>

```
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 1

model_transaction_policy {
  decoupled: true
}

dynamic_batching {
    preferred_batch_size: [ 1 ]
    max_queue_delay_microseconds: 0
    default_queue_policy: { max_queue_size: 0 }
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_table_extra_ids"
    data_type: TYPE_UINT64
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
	data_type: TYPE_UINT64
	dims: [ 1 ]
    reshape: { shape: [ ] }
	optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
	data_type: TYPE_FP16
	dims: [ -1, -1 ]
	optional: true
	allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
	data_type: TYPE_INT32
	dims: [ -1, 3 ]
	optional: true
	allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "${max_beam_width}"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/data/triton-server-engines/llama-3-8b-engine/"
  }
}
parameters: {
  key: "encoder_model_path"
  value: {
    string_value: "${encoder_engine_dir}"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "${max_attention_window_size}"
  }
}
parameters: {
  key: "sink_token_length"
  value: {
    string_value: "${sink_token_length}"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "${batch_scheduler_policy}"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "${kv_cache_host_memory_bytes}"
  }
}
# kv_cache_onboard_blocks is for internal implementation.
parameters: {
  key: "kv_cache_onboard_blocks"
  value: {
    string_value: "${kv_cache_onboard_blocks}"
  }
}
# enable_trt_overlap is deprecated and doesn't have any effect on the runtime
# parameters: {
#   key: "enable_trt_overlap"
#   value: {
#     string_value: "${enable_trt_overlap}"
#   }
# }
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "cancellation_check_period_ms"
  value: {
    string_value: "${cancellation_check_period_ms}"
  }
}
parameters: {
  key: "stats_check_period_ms"
  value: {
    string_value: "${stats_check_period_ms}"
  }
}
parameters: {
  key: "iter_stats_max_iterations"
  value: {
    string_value: "${iter_stats_max_iterations}"
  }
}
parameters: {
  key: "request_stats_max_iterations"
  value: {
    string_value: "${request_stats_max_iterations}"
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "${normalize_log_probs}"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "${gpu_device_ids}"
  }
}
parameters: {
  key: "participant_ids"
  value: {
    string_value: "${participant_ids}"
  }
}
parameters: {
  key: "lora_cache_optimal_adapter_size"
  value: {
    string_value: "${lora_cache_optimal_adapter_size}"
  }
}
parameters: {
  key: "lora_cache_max_adapter_size"
  value: {
    string_value: "${lora_cache_max_adapter_size}"
  }
}
parameters: {
  key: "lora_cache_gpu_memory_fraction"
  value: {
    string_value: "${lora_cache_gpu_memory_fraction}"
  }
}
parameters: {
  key: "lora_cache_host_memory_bytes"
  value: {
    string_value: "${lora_cache_host_memory_bytes}"
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "${decoding_mode}"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "medusa_choices"
    value: {
      string_value: "${medusa_choices}"
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
      string_value: "${gpu_weights_percent}"
  }
}
parameters: {
  key: "enable_context_fmha_fp32_acc"
  value: {
    string_value: "${enable_context_fmha_fp32_acc}"
  }
}
parameters: {
  key: "multi_block_mode"
  value: {
    string_value: "${multi_block_mode}"
  }
}
```

</details>

When I post via curl command, an error occurs:
```
$ curl -X POST localhost:8000/v2/models/tensorrt_llm/generate_stream -H "Content-Type: application/json" -d '{"input_ids": [1,2,3], "input_lengths": [3], "request_output_len": [3], "streaming": true}'
{"error":"Unable to parse 'data': Shape does not match true shape of 'data' field"}
```
This command also doesn't work:
```
$ curl -X POST localhost:8000/v2/models/tensorrt_llm/generate_stream -H "Content-Type: application/json" -d '{"input_ids": [[1],[2],[3]], "input_lengths": [3], "request_output_len": [3], "streaming": true}'
{"error":"Unable to parse 'data': Shape does not match true shape of 'data' field"}
```

However, when I set `input_ids` to `[1]`, it works:
```
$ curl -X POST localhost:8000/v2/models/tensorrt_llm/generate_stream -H "Content-Type: application/json" -d '{"input_ids": [1], "input_lengths": [1], "request_output_len": [3], "streaming": true}'
data: {"batch_index":0,"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm","model_version":"1","output_ids":320,"output_log_probs":0.0,"sequence_length":1}

data: {"batch_index":0,"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm","model_version":"1","output_ids":1049,"output_log_probs":0.0,"sequence_length":1}

data: {"batch_index":0,"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm","model_version":"1","output_ids":24,"output_log_probs":0.0,"sequence_length":1}
```

How can I pass multiple input token ids via HTTP request ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An error that `Shape does not match true shape of 'data' field` occurs when using tensorrt_llm model alone in inflight_batcher_llm #613

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An error that Shape does not match true shape of 'data' field occurs when using tensorrt_llm model alone in inflight_batcher_llm #613

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

An error that `Shape does not match true shape of 'data' field` occurs when using tensorrt_llm model alone in inflight_batcher_llm #613