Skip to content

Misc. bug: --split-mode none ≠ --tensor-split 100,0,0 (all layers on GPU0) #13612

Closed
@Thireus

Description

@Thireus

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from ...\ggml-cuda.dll
load_backend: loaded RPC backend from ...\ggml-rpc.dll
load_backend: loaded CPU backend from ...\ggml-cpu-skylakex.dll
version: 5305 (6bccecaf)
built with MSVC 19.29.30159.0 for Windows AMD64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

echo "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant. You should think step-by-step.<|im_end|>\n<|im_start|>user\nWhat is the solution of x+5=-2??<|im_end|>\n<|im_start|>assistant\n<think>\n" | CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ./llama-cli -m ../Qwen3-4B-128K-UD-Q4_K_XL.gguf --rope-scaling none --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --rope_freq_base 1000000 --ctx-size 131072 --flash_attn --no-mmap --mlock --n-gpu-layers 37 --batch_size 4096 --threads 18 --simple-io --main-gpu 0

Problem description & steps to reproduce

When running llama-cli on a multi-GPU machine (for example 3‑GPU), --split-mode none (the default “no split”) and explicitly pinning all layers to GPU0 via --tensor-split 100,0,0 produce markedly different inference speeds—even though both end up filling GPU0’s VRAM and leave the other two cards idle.

Expected behavior

--split-mode none and --tensor-split 100,0,0 (all layers on GPU0) should behave identically, giving the same inference throughput when only one device is doing work.

Impact

  • Significant performance regression (~3× slowdown) when users attempt to force all layers onto a single GPU via --tensor-split.
  • Breaks parity with “no split” mode and complicates multi‑GPU deployments where dynamic tensor‑slicing might be adjusted at runtime.

Investigation notes / hypotheses

  • Kernel launch patterns

--split-mode none may use a single fused CUDA kernel per transformer block, whereas --tensor-split forces many small inter‑GPU synchronization points (even if tensors are “100%” on GPU0).

  • CUDA peer‑to‑peer checks

Internal logic might still query or broadcast to other devices, adding overhead.

  • Memory allocation paths

The allocator for “tensor split” may be different (e.g. pinned vs. device) leading to extra synchronization.

Resolution suggestion

Unify the allocation / kernel‑fusion logic in tensor‑split is 100% allocated to one GPU with split‑mode none if possible.

First Bad Commit

No response

Relevant log output

When using --split-mode none (235.21 /s):

llama_perf_sampler_print:    sampling time =      63.04 ms /   608 runs   (    0.10 ms per token,  9644.82 tokens per second)
llama_perf_context_print:        load time =    1566.33 ms
llama_perf_context_print: prompt eval time =      71.50 ms /    57 tokens (    1.25 ms per token,   797.15 tokens per second)
llama_perf_context_print:        eval time =    2338.30 ms /   550 runs   (    4.25 ms per token,   235.21 tokens per second)
llama_perf_context_print:       total time =    2652.69 ms /   607 tokens

llama_perf_sampler_print:    sampling time =      63.04 ms /   608 runs   (    0.10 ms per token,  9644.82 tokens per second)
llama_perf_context_print:        load time =    1566.33 ms
llama_perf_context_print: prompt eval time =      71.50 ms /    57 tokens (    1.25 ms per token,   797.15 tokens per second)
llama_perf_context_print:        eval time =    2338.30 ms /   550 runs   (    4.25 ms per token,   235.21 tokens per second)
llama_perf_context_print:       total time =    2652.84 ms /   607 tokens


When using --tensor-split 100,0,0 (84.63 /s):

llama_perf_sampler_print:    sampling time =      83.26 ms /   717 runs   (    0.12 ms per token,  8611.47 tokens per second)
llama_perf_context_print:        load time =    1685.70 ms
llama_perf_context_print: prompt eval time =      72.50 ms /    57 tokens (    1.27 ms per token,   786.21 tokens per second)
llama_perf_context_print:        eval time =    7786.48 ms /   659 runs   (   11.82 ms per token,    84.63 tokens per second)
llama_perf_context_print:       total time =    8171.76 ms /   716 tokens

llama_perf_sampler_print:    sampling time =      83.26 ms /   717 runs   (    0.12 ms per token,  8611.47 tokens per second)
llama_perf_context_print:        load time =    1685.70 ms
llama_perf_context_print: prompt eval time =      72.50 ms /    57 tokens (    1.27 ms per token,   786.21 tokens per second)
llama_perf_context_print:        eval time =    7786.48 ms /   659 runs   (   11.82 ms per token,    84.63 tokens per second)
llama_perf_context_print:       total time =    8171.89 ms /   716 tokens

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions