-
Notifications
You must be signed in to change notification settings - Fork 39
Metal backend is currently slower than CPU-only on all tested models #16
Description
Summary
While exploring P3 (Metal compute graph for KV attention), we discovered that the existing Metal backend (`TQ_BUILD_METAL=ON`) makes inference 13–40% slower than the CPU-only build on every model size we tested. This applies to both `fp32` and all `turbo_kv_*` paths.
Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)
| Build | KV type | tok/s |
|---|---|---|
| Metal ON | fp32 | 15.07 |
| Metal OFF | fp32 | 17.87 |
| Metal ON | turbo_kv_4b | 14.17 |
| Metal OFF | turbo_kv_4b | 16.53 |
| Metal ON | turbo_kv_5b | 13.43 |
| Metal OFF | turbo_kv_5b | 15.33 |
Across model sizes:
| Model | Metal-OFF win |
|---|---|
| SmolLM2 135M | neutral (within noise) |
| Llama 3.2 1B | +13–17% |
| Llama 3.2 3B | +14–22% |
| Gemma 4 26B | +40% |
Even on the largest model we tested (Gemma 4 26B at 1.0 tok/s with Metal vs 1.4 tok/s without), Metal is net negative.
Why?
The current Metal path uses per-matmul dispatch with command buffer commit + waitUntilCompleted at flush points. At batch-1 inference, the per-op dispatch overhead exceeds the GPU compute benefit. This is the same dispatch-overhead issue documented in our earlier failed compute-graph experiments.
What's surprising is that even on the very large Gemma 4 26B, Metal still loses. The matmul ops are large enough that GPU compute should win, but the dispatch + sync still dominates.
Impact on past benchmarks
All quant.cpp benchmarks published before commit `` (2026-04-08) used `-DTQ_BUILD_METAL=ON` and were therefore 14-22% slower than what users actually get with the default CMake build. README and CHANGELOG numbers have been updated to reflect the honest CPU-only baseline.
The CMake default is and has been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. Only our internal benchmarks were misled.
Action items
- Document the finding in README + CHANGELOG (this commit)
- Re-baseline all benchmarks with `TQ_BUILD_METAL=OFF`
- Investigate the dispatch overhead source — is it the gather/scatter, the wait sync, or the per-encoder begin/end cost?
- Either fix the Metal path (likely requires fewer dispatches per token, e.g., a single command buffer per layer instead of per matmul) or remove it
- If fixed, find the model size threshold above which Metal wins; auto-enable based on model size
- Profile with Xcode Metal Frame Debugger / Instruments
Out of scope (won't fix here)
- Adding new Metal kernels (e.g., for turbo_kv attention) — would compound the problem until the existing dispatch path is fixed
- Full GPU compute graph (already failed in previous attempts)
How to reproduce
```bash
CPU-only (fast, default)
cmake -B build_cpu -DTQ_BUILD_METAL=OFF
cmake --build build_cpu -j
Metal (currently slower)
cmake -B build_metal -DTQ_BUILD_METAL=ON
cmake --build build_metal -j
Compare
for k in fp32 turbo_kv_4b; do
for build in build_cpu build_metal; do
$build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf --ppl bench/data/ppl_1k.txt -j 8 -k $k
done
done
```