Skip to content

Metal backend is currently slower than CPU-only on all tested models #16

@unamedkr

Description

@unamedkr

Summary

While exploring P3 (Metal compute graph for KV attention), we discovered that the existing Metal backend (`TQ_BUILD_METAL=ON`) makes inference 13–40% slower than the CPU-only build on every model size we tested. This applies to both `fp32` and all `turbo_kv_*` paths.

Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)

Build KV type tok/s
Metal ON fp32 15.07
Metal OFF fp32 17.87
Metal ON turbo_kv_4b 14.17
Metal OFF turbo_kv_4b 16.53
Metal ON turbo_kv_5b 13.43
Metal OFF turbo_kv_5b 15.33

Across model sizes:

Model Metal-OFF win
SmolLM2 135M neutral (within noise)
Llama 3.2 1B +13–17%
Llama 3.2 3B +14–22%
Gemma 4 26B +40%

Even on the largest model we tested (Gemma 4 26B at 1.0 tok/s with Metal vs 1.4 tok/s without), Metal is net negative.

Why?

The current Metal path uses per-matmul dispatch with command buffer commit + waitUntilCompleted at flush points. At batch-1 inference, the per-op dispatch overhead exceeds the GPU compute benefit. This is the same dispatch-overhead issue documented in our earlier failed compute-graph experiments.

What's surprising is that even on the very large Gemma 4 26B, Metal still loses. The matmul ops are large enough that GPU compute should win, but the dispatch + sync still dominates.

Impact on past benchmarks

All quant.cpp benchmarks published before commit `` (2026-04-08) used `-DTQ_BUILD_METAL=ON` and were therefore 14-22% slower than what users actually get with the default CMake build. README and CHANGELOG numbers have been updated to reflect the honest CPU-only baseline.

The CMake default is and has been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. Only our internal benchmarks were misled.

Action items

  • Document the finding in README + CHANGELOG (this commit)
  • Re-baseline all benchmarks with `TQ_BUILD_METAL=OFF`
  • Investigate the dispatch overhead source — is it the gather/scatter, the wait sync, or the per-encoder begin/end cost?
  • Either fix the Metal path (likely requires fewer dispatches per token, e.g., a single command buffer per layer instead of per matmul) or remove it
  • If fixed, find the model size threshold above which Metal wins; auto-enable based on model size
  • Profile with Xcode Metal Frame Debugger / Instruments

Out of scope (won't fix here)

  • Adding new Metal kernels (e.g., for turbo_kv attention) — would compound the problem until the existing dispatch path is fixed
  • Full GPU compute graph (already failed in previous attempts)

How to reproduce

```bash

CPU-only (fast, default)

cmake -B build_cpu -DTQ_BUILD_METAL=OFF
cmake --build build_cpu -j

Metal (currently slower)

cmake -B build_metal -DTQ_BUILD_METAL=ON
cmake --build build_metal -j

Compare

for k in fp32 turbo_kv_4b; do
for build in build_cpu build_metal; do
$build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf --ppl bench/data/ppl_1k.txt -j 8 -k $k
done
done
```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions