-
Notifications
You must be signed in to change notification settings - Fork 133
Description
I used Baichuan2 13B model weight only int 8 and launch a triton server on single GPU. Now I have a node has 2 GPUs and want to multiple model instances to a instance for each GPUs to improve concurrent request throughput .
For the default config: "kind_cpu count=1", the gpu usage as follows:

Just like fastertransformer, I set "kind_cpu count=2" (https://github.com/triton-inference-server/fastertransformer_backend/blob/c6936c39be6fc57c161db34cfe5c45d1bef81ee2/README.md?plain=1#L48) But it can not work, I get such error:

When this error happened, the first GPU memory was increased, the second GPU memory was not changed.
When I follow the triton guide to set "kind_gpu count=1", get the same error.
(https://github.com/triton-inference-server/server/blob/0f1f9246ee8257db6196da65950205be79379401/docs/user_guide/model_configuration.md?plain=1#L600)
I want to know how to use Multiple Model Instances worked for tensorrtllm, which is a important feature for triton.