Feature request: support multiple model instances on TensorRT LLM triton backend. 

I used Baichuan2 13B model weight only int 8 and launch a triton server on single GPU. Now I have a node has 2 GPUs and want to multiple model instances to a instance for each GPUs to improve concurrent request throughput . 

For the default config: "kind_cpu count=1", the gpu usage as follows:
![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/27884697/9eceac52-a8eb-4dd5-96e6-cf9bc458ebcd)

Just like fastertransformer,  I set "kind_cpu count=2" (https://github.com/triton-inference-server/fastertransformer_backend/blob/c6936c39be6fc57c161db34cfe5c45d1bef81ee2/README.md?plain=1#L48)  But it can not work, I get such error:
![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/27884697/7bef7220-bd28-44e1-9b07-6f9d94e166fa)
When this error happened,  the first GPU memory was increased, the second GPU memory was not changed. 

When I follow the triton guide to set "kind_gpu count=1", get the same error.
(https://github.com/triton-inference-server/server/blob/0f1f9246ee8257db6196da65950205be79379401/docs/user_guide/model_configuration.md?plain=1#L600)

I want to know how to use Multiple Model Instances worked for tensorrtllm, which is a important feature for triton.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: support multiple model instances on TensorRT LLM triton backend. #105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: support multiple model instances on TensorRT LLM triton backend. #105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions