Skip to content

API Usage Error: profileMinDims.d[i] <= dimensions.d[i] #532

@LanceB57

Description

@LanceB57

System Info

  • 1x H100
  • Llama3 8B Instruct
  • TensorRT-LLM v0.10.0
  • tensorrtllm_backend v0.10.0
  • tritonserver 24.06

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow this benchmarking article to create the Llama3 8B Instruct model engine (but also using a converted checkpoint for trtllm-build. Then, finish setup similar to this blog

Expected behavior

After following the benchmarking article, I get that throughput is ~10,000 tokens per second. I hope to achieve this on a Triton Inference Server.

actual behavior

Instead, a single request has throughput around 80 tokens per second, and I'm struggling to figure out how to efficiently manage concurrent requests. I'm hoping to achieve the 10,000 tokens per second that the GPU/engine have shown it is capable of reaching as per the benchmark.

additional notes

Ultimately, I think this comes down to needing a better understanding of Triton and TensorRT-LLM (e.g., parameters, model configs. etc...) which I admittedly don't have. Suggestions for how to create an environment similar to the benchmark would be greatly appreciated.

I am aware of Triton's instance_group model parameter that would allow me to spawn multiple instance of the same model, but it seems like I'm limited to only three on one GPU due to memory constraints. This would only achieve 3 * 80 = 240 tokens per sec, which is, again, far from what I'm hoping for.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions