RFC / help-wanted: Apple M5 GPU Neural Accelerator support for tract-metal (re-vendor MLX's Metal 4 TensorOps GEMM) #2275

czoli1976 · 2026-05-23T17:59:25Z

czoli1976
May 23, 2026

Summary

On Apple M5, tract-metal's GEMM runs entirely on the legacy simdgroup_matrix path and never touches the M5 GPU's per-core Neural Accelerator (its dedicated matrix unit), missing the large matmul speedup that hardware provides. I'm raising this as a scoped, help-wanted opportunity rather than a patch: I don't have M5 hardware to implement or benchmark it, and I gather the maintainers may not either — so this flags the gap, sketches the approach, and asks whether a community PR would be welcome from someone running an M5 + macOS 26.2.

This is the GPU Neural Accelerator, distinct from the CPU-side ARMv9.2 SME work in #2230 / #2273.

The gap

All tract-metal GEMM backends (mlx_gemm, ggml_gemm, mfa, basic) use #include <metal_simdgroup_matrix> / simdgroup_matrix<…,8,8> — the Metal 2.3 path (macOS 11+).
The vendored MLX steel GEMM is a frozen shader-source copy, pinned to MLX 02efb310 (Sep 2024) — i.e. pre-Neural-Accelerator.
Each M5 GPU core embeds a matrix unit ("Neural Accelerator"). On M5 the current path leaves it idle; Apple reports up to ~4× TTFT for LLM inference when MLX uses it vs an M4 baseline.

Why not a hand-written MSL kernel

Apple does not expose the M5 matrix unit to Metal Shading Language. As of Xcode 26.1 the only supported access is the Metal 4 Tensor APIs + Metal Performance Primitives (MPP) framework — so a native .metal kernel can't reach the hardware. The realistic route is a library that already wraps the supported API.

Proposed approach (for a contributor with M5)

MLX added Neural-Accelerator support in 0.30.0 (ml-explore/mlx#2772, macOS 26.2+) via the Metal 4 / MPP path. Since tract-metal already vendors MLX's steel GEMM as shader source, the natural step is to re-vendor the newer MLX Metal-4 TensorOps GEMM as an additional backend selected when compute_capability == 5, keeping the existing path for everything else. This matches the current vendoring pattern rather than introducing a new dependency model.

Scope & caveats

M5 + macOS 26.2 only. M1–M4 unaffected; their path is unchanged (and re-vendoring does not help M1–M4 — post-2024 MLX GEMM work is almost all M5/NAX).
Requires a Metal 4 toolchain for that backend, gated so older-Xcode / non-M5 builds and targets are unaffected.
The win is concentrated in large-matmul / prefill (TTFT); decode/GEMV-heavy and sync/cast-bound workloads benefit less.
Adds build/maintenance surface (a Metal-4 shader set + capability-gated dispatch).

Validation (requires M5 hardware)

A contributor on M5 + macOS 26.2 would run cargo test -p tract-metal + test-metal conformance, plus a gpu_gemm_bakeoff-style sustained-GFLOPS comparison (current path vs NAX path) across prefill / decode / lm_head shapes to quantify the win.

Open questions

Would tract accept M5 GPU Neural-Accelerator support as a community contribution, gated to M5 + macOS 26.2 (Metal 4)?
If so, preferred approach: re-vendor MLX's Metal-4 TensorOps GEMM (matches today's shader-vendoring pattern) vs. an optional link against MPP?
Anyone on M5 interested in picking up the implementation + benchmarking? I can share the bake-off shape list and dispatch-wiring pointers to lower the activation energy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC / help-wanted: Apple M5 GPU Neural Accelerator support for tract-metal (re-vendor MLX's Metal 4 TensorOps GEMM) #2275

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC / help-wanted: Apple M5 GPU Neural Accelerator support for tract-metal (re-vendor MLX's Metal 4 TensorOps GEMM) #2275

Uh oh!

czoli1976 May 23, 2026

Summary

The gap

Why not a hand-written MSL kernel

Proposed approach (for a contributor with M5)

Scope & caveats

Validation (requires M5 hardware)

Open questions

Replies: 0 comments

czoli1976
May 23, 2026