API Usage Error: profileMinDims.d[i] <= dimensions.d[i]

### System Info

- 1x H100
- Llama3 8B Instruct
- TensorRT-LLM v0.10.0
- tensorrtllm_backend v0.10.0
- tritonserver 24.06

### Who can help?

@kaiyux 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Follow [this benchmarking article](https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html) to create the Llama3 8B Instruct model engine (but also using a converted checkpoint for `trtllm-build`. Then, finish setup similar to [this blog](https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/)

### Expected behavior

After following the benchmarking article, I get that throughput is ~10,000 tokens per second. I hope to achieve this on a Triton Inference Server.

### actual behavior

Instead, a single request has throughput around 80 tokens per second, and I'm struggling to figure out how to efficiently manage concurrent requests. I'm hoping to achieve the 10,000 tokens per second that the GPU/engine have shown it is capable of reaching as per the benchmark.

### additional notes

Ultimately, I think this comes down to needing a better understanding of Triton and TensorRT-LLM (e.g., parameters, model configs. etc...) which I admittedly don't have. Suggestions for how to create an environment similar to the benchmark would be greatly appreciated.

I am aware of Triton's `instance_group` model parameter that would allow me to spawn multiple instance of the same model, but it seems like I'm limited to only three on one GPU due to memory constraints. This would only achieve 3 * 80 = 240 tokens per sec, which is, again, far from what I'm hoping for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API Usage Error: profileMinDims.d[i] <= dimensions.d[i] #532

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API Usage Error: profileMinDims.d[i] <= dimensions.d[i] #532

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions