Skip to content

Feature request: support multiple model instances on TensorRT LLM triton backend.  #105

@wengsnow

Description

@wengsnow

I used Baichuan2 13B model weight only int 8 and launch a triton server on single GPU. Now I have a node has 2 GPUs and want to multiple model instances to a instance for each GPUs to improve concurrent request throughput .

For the default config: "kind_cpu count=1", the gpu usage as follows:
image

Just like fastertransformer, I set "kind_cpu count=2" (https://github.com/triton-inference-server/fastertransformer_backend/blob/c6936c39be6fc57c161db34cfe5c45d1bef81ee2/README.md?plain=1#L48) But it can not work, I get such error:
image
When this error happened, the first GPU memory was increased, the second GPU memory was not changed.

When I follow the triton guide to set "kind_gpu count=1", get the same error.
(https://github.com/triton-inference-server/server/blob/0f1f9246ee8257db6196da65950205be79379401/docs/user_guide/model_configuration.md?plain=1#L600)

I want to know how to use Multiple Model Instances worked for tensorrtllm, which is a important feature for triton.

Metadata

Metadata

Labels

feature requestNew feature or requesttriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions