-
Notifications
You must be signed in to change notification settings - Fork 132
Open
Labels
triagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
is support multi node in triton inference server?
i build llama-7b for tensorrtllm_backend and execute triton inference server
i have a 4 GPUS but triton inference server load only 1 GPUS
image
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
build (llama2)
python build.py --model_dir ${model_directory} \
--dtype float16 \
--use_gpt_attention_plugin bfloat16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--use_gemm_plugin float16 \
--output_dir engines/fp16/1-gpu/
run
tritonserver --model-repo=/tensorrtllm_backend/triton_model_repo --disable-auto-complete-config

Metadata
Metadata
Assignees
Labels
triagedIssue has been triaged by maintainersIssue has been triaged by maintainers