Metal backend is currently slower than CPU-only on all tested models

## Summary

While exploring P3 (Metal compute graph for KV attention), we discovered that the **existing Metal backend (\`TQ_BUILD_METAL=ON\`) makes inference 13–40% slower** than the CPU-only build on every model size we tested. This applies to both \`fp32\` and all \`turbo_kv_*\` paths.

## Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)

| Build | KV type | tok/s |
|---|---|---:|
| Metal ON | fp32 | 15.07 |
| **Metal OFF** | fp32 | **17.87** |
| Metal ON | turbo_kv_4b | 14.17 |
| **Metal OFF** | turbo_kv_4b | **16.53** |
| Metal ON | turbo_kv_5b | 13.43 |
| **Metal OFF** | turbo_kv_5b | **15.33** |

Across model sizes:

| Model | Metal-OFF win |
|---|---|
| SmolLM2 135M | neutral (within noise) |
| Llama 3.2 1B | +13–17% |
| Llama 3.2 3B | +14–22% |
| Gemma 4 26B | **+40%** |

Even on the largest model we tested (Gemma 4 26B at 1.0 tok/s with Metal vs 1.4 tok/s without), Metal is net negative.

## Why?

The current Metal path uses per-matmul dispatch with command buffer commit + waitUntilCompleted at flush points. At batch-1 inference, the per-op dispatch overhead exceeds the GPU compute benefit. This is the same dispatch-overhead issue documented in our earlier failed compute-graph experiments.

What's surprising is that even on the very large Gemma 4 26B, Metal still loses. The matmul ops are large enough that GPU compute should win, but the dispatch + sync still dominates.

## Impact on past benchmarks

**All quant.cpp benchmarks published before commit \`<TBD>\` (2026-04-08) used \`-DTQ_BUILD_METAL=ON\` and were therefore 14-22% slower than what users actually get with the default CMake build.** README and CHANGELOG numbers have been updated to reflect the honest CPU-only baseline.

The CMake default is and has been \`TQ_BUILD_METAL=OFF\`, so end users were always getting the fast path. Only our internal benchmarks were misled.

## Action items

- [x] Document the finding in README + CHANGELOG (this commit)
- [x] Re-baseline all benchmarks with \`TQ_BUILD_METAL=OFF\`
- [ ] Investigate the dispatch overhead source — is it the gather/scatter, the wait sync, or the per-encoder begin/end cost?
- [ ] Either fix the Metal path (likely requires fewer dispatches per token, e.g., a single command buffer per layer instead of per matmul) or remove it
- [ ] If fixed, find the model size threshold above which Metal wins; auto-enable based on model size
- [ ] Profile with Xcode Metal Frame Debugger / Instruments

## Out of scope (won't fix here)

- Adding new Metal kernels (e.g., for turbo_kv attention) — would compound the problem until the existing dispatch path is fixed
- Full GPU compute graph (already failed in previous attempts)

## How to reproduce

\`\`\`bash
# CPU-only (fast, default)
cmake -B build_cpu -DTQ_BUILD_METAL=OFF
cmake --build build_cpu -j

# Metal (currently slower)
cmake -B build_metal -DTQ_BUILD_METAL=ON
cmake --build build_metal -j

# Compare
for k in fp32 turbo_kv_4b; do
  for build in build_cpu build_metal; do
    \$build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf --ppl bench/data/ppl_1k.txt -j 8 -k \$k
  done
done
\`\`\`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal backend is currently slower than CPU-only on all tested models #16

Summary

Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)

Why?

Impact on past benchmarks

Action items

Out of scope (won't fix here)

How to reproduce

CPU-only (fast, default)

Metal (currently slower)

Compare

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build	KV type	tok/s
Metal ON	fp32	15.07
Metal OFF	fp32	17.87
Metal ON	turbo_kv_4b	14.17
Metal OFF	turbo_kv_4b	16.53
Metal ON	turbo_kv_5b	13.43
Metal OFF	turbo_kv_5b	15.33

Model	Metal-OFF win
SmolLM2 135M	neutral (within noise)
Llama 3.2 1B	+13–17%
Llama 3.2 3B	+14–22%
Gemma 4 26B	+40%

Metal backend is currently slower than CPU-only on all tested models #16

Description

Summary

Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)

Why?

Impact on past benchmarks

Action items

Out of scope (won't fix here)

How to reproduce

CPU-only (fast, default)

Metal (currently slower)

Compare

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions