Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from ...\ggml-cuda.dll
load_backend: loaded RPC backend from ...\ggml-rpc.dll
load_backend: loaded CPU backend from ...\ggml-cpu-skylakex.dll
version: 5305 (6bccecaf)
built with MSVC 19.29.30159.0 for Windows AMD64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
echo "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant. You should think step-by-step.<|im_end|>\n<|im_start|>user\nWhat is the solution of x+5=-2??<|im_end|>\n<|im_start|>assistant\n<think>\n" | CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ./llama-cli -m ../Qwen3-4B-128K-UD-Q4_K_XL.gguf --rope-scaling none --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --rope_freq_base 1000000 --ctx-size 131072 --flash_attn --no-mmap --mlock --n-gpu-layers 37 --batch_size 4096 --threads 18 --simple-io --main-gpu 0
Problem description & steps to reproduce
When running llama-cli on a multi-GPU machine (for example 3‑GPU), --split-mode none (the default “no split”) and explicitly pinning all layers to GPU0 via --tensor-split 100,0,0 produce markedly different inference speeds—even though both end up filling GPU0’s VRAM and leave the other two cards idle.
Expected behavior
--split-mode none
and --tensor-split 100,0,0
(all layers on GPU0) should behave identically, giving the same inference throughput when only one device is doing work.
Impact
- Significant performance regression (~3× slowdown) when users attempt to force all layers onto a single GPU via
--tensor-split
. - Breaks parity with “no split” mode and complicates multi‑GPU deployments where dynamic tensor‑slicing might be adjusted at runtime.
Investigation notes / hypotheses
- Kernel launch patterns
--split-mode none
may use a single fused CUDA kernel per transformer block, whereas --tensor-split
forces many small inter‑GPU synchronization points (even if tensors are “100%” on GPU0).
- CUDA peer‑to‑peer checks
Internal logic might still query or broadcast to other devices, adding overhead.
- Memory allocation paths
The allocator for “tensor split” may be different (e.g. pinned vs. device) leading to extra synchronization.
Resolution suggestion
Unify the allocation / kernel‑fusion logic in tensor‑split is 100% allocated to one GPU with split‑mode none if possible.
First Bad Commit
No response
Relevant log output
When using --split-mode none (235.21 /s):
llama_perf_sampler_print: sampling time = 63.04 ms / 608 runs ( 0.10 ms per token, 9644.82 tokens per second)
llama_perf_context_print: load time = 1566.33 ms
llama_perf_context_print: prompt eval time = 71.50 ms / 57 tokens ( 1.25 ms per token, 797.15 tokens per second)
llama_perf_context_print: eval time = 2338.30 ms / 550 runs ( 4.25 ms per token, 235.21 tokens per second)
llama_perf_context_print: total time = 2652.69 ms / 607 tokens
llama_perf_sampler_print: sampling time = 63.04 ms / 608 runs ( 0.10 ms per token, 9644.82 tokens per second)
llama_perf_context_print: load time = 1566.33 ms
llama_perf_context_print: prompt eval time = 71.50 ms / 57 tokens ( 1.25 ms per token, 797.15 tokens per second)
llama_perf_context_print: eval time = 2338.30 ms / 550 runs ( 4.25 ms per token, 235.21 tokens per second)
llama_perf_context_print: total time = 2652.84 ms / 607 tokens
When using --tensor-split 100,0,0 (84.63 /s):
llama_perf_sampler_print: sampling time = 83.26 ms / 717 runs ( 0.12 ms per token, 8611.47 tokens per second)
llama_perf_context_print: load time = 1685.70 ms
llama_perf_context_print: prompt eval time = 72.50 ms / 57 tokens ( 1.27 ms per token, 786.21 tokens per second)
llama_perf_context_print: eval time = 7786.48 ms / 659 runs ( 11.82 ms per token, 84.63 tokens per second)
llama_perf_context_print: total time = 8171.76 ms / 716 tokens
llama_perf_sampler_print: sampling time = 83.26 ms / 717 runs ( 0.12 ms per token, 8611.47 tokens per second)
llama_perf_context_print: load time = 1685.70 ms
llama_perf_context_print: prompt eval time = 72.50 ms / 57 tokens ( 1.27 ms per token, 786.21 tokens per second)
llama_perf_context_print: eval time = 7786.48 ms / 659 runs ( 11.82 ms per token, 84.63 tokens per second)
llama_perf_context_print: total time = 8171.89 ms / 716 tokens