Misc. bug: --split-mode none ≠ --tensor-split 100,0,0 (all layers on GPU0)

### Name and Version

```
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from ...\ggml-cuda.dll
load_backend: loaded RPC backend from ...\ggml-rpc.dll
load_backend: loaded CPU backend from ...\ggml-cpu-skylakex.dll
version: 5305 (6bccecaf)
built with MSVC 19.29.30159.0 for Windows AMD64
```

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-cli

### Command line

```shell
echo "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant. You should think step-by-step.<|im_end|>\n<|im_start|>user\nWhat is the solution of x+5=-2??<|im_end|>\n<|im_start|>assistant\n<think>\n" | CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ./llama-cli -m ../Qwen3-4B-128K-UD-Q4_K_XL.gguf --rope-scaling none --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --rope_freq_base 1000000 --ctx-size 131072 --flash_attn --no-mmap --mlock --n-gpu-layers 37 --batch_size 4096 --threads 18 --simple-io --main-gpu 0
```

### Problem description & steps to reproduce

When running llama-cli on a multi-GPU machine (for example 3‑GPU), --split-mode none (the default “no split”) and explicitly pinning all layers to GPU0 via --tensor-split 100,0,0 produce markedly different inference speeds—even though both end up filling GPU0’s VRAM and leave the other two cards idle.

**Expected behavior**

`--split-mode none` and `--tensor-split 100,0,0` (all layers on GPU0) should behave identically, giving the same inference throughput when only one device is doing work.

**Impact**

- Significant performance regression (~3× slowdown) when users attempt to force all layers onto a single GPU via `--tensor-split`.
- Breaks parity with “no split” mode and complicates multi‑GPU deployments where dynamic tensor‑slicing might be adjusted at runtime.

**Investigation notes / hypotheses**

- Kernel launch patterns

`--split-mode none` may use a single fused CUDA kernel per transformer block, whereas `--tensor-split` forces many small inter‑GPU synchronization points (even if tensors are “100%” on GPU0).

- CUDA peer‑to‑peer checks

Internal logic might still query or broadcast to other devices, adding overhead.

- Memory allocation paths

The allocator for “tensor split” may be different (e.g. pinned vs. device) leading to extra synchronization.

**Resolution suggestion**

Unify the allocation / kernel‑fusion logic in tensor‑split is 100% allocated to one GPU with split‑mode none if possible.

### First Bad Commit

_No response_

### Relevant log output

```shell
When using --split-mode none (235.21 /s):

llama_perf_sampler_print:    sampling time =      63.04 ms /   608 runs   (    0.10 ms per token,  9644.82 tokens per second)
llama_perf_context_print:        load time =    1566.33 ms
llama_perf_context_print: prompt eval time =      71.50 ms /    57 tokens (    1.25 ms per token,   797.15 tokens per second)
llama_perf_context_print:        eval time =    2338.30 ms /   550 runs   (    4.25 ms per token,   235.21 tokens per second)
llama_perf_context_print:       total time =    2652.69 ms /   607 tokens

llama_perf_sampler_print:    sampling time =      63.04 ms /   608 runs   (    0.10 ms per token,  9644.82 tokens per second)
llama_perf_context_print:        load time =    1566.33 ms
llama_perf_context_print: prompt eval time =      71.50 ms /    57 tokens (    1.25 ms per token,   797.15 tokens per second)
llama_perf_context_print:        eval time =    2338.30 ms /   550 runs   (    4.25 ms per token,   235.21 tokens per second)
llama_perf_context_print:       total time =    2652.84 ms /   607 tokens


When using --tensor-split 100,0,0 (84.63 /s):

llama_perf_sampler_print:    sampling time =      83.26 ms /   717 runs   (    0.12 ms per token,  8611.47 tokens per second)
llama_perf_context_print:        load time =    1685.70 ms
llama_perf_context_print: prompt eval time =      72.50 ms /    57 tokens (    1.27 ms per token,   786.21 tokens per second)
llama_perf_context_print:        eval time =    7786.48 ms /   659 runs   (   11.82 ms per token,    84.63 tokens per second)
llama_perf_context_print:       total time =    8171.76 ms /   716 tokens

llama_perf_sampler_print:    sampling time =      83.26 ms /   717 runs   (    0.12 ms per token,  8611.47 tokens per second)
llama_perf_context_print:        load time =    1685.70 ms
llama_perf_context_print: prompt eval time =      72.50 ms /    57 tokens (    1.27 ms per token,   786.21 tokens per second)
llama_perf_context_print:        eval time =    7786.48 ms /   659 runs   (   11.82 ms per token,    84.63 tokens per second)
llama_perf_context_print:       total time =    8171.89 ms /   716 tokens
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: --split-mode none ≠ --tensor-split 100,0,0 (all layers on GPU0) #13612

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: --split-mode none ≠ --tensor-split 100,0,0 (all layers on GPU0) #13612

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions