You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Apple M5, tract-metal's GEMM runs entirely on the legacy simdgroup_matrix path and never touches the M5 GPU's per-core Neural Accelerator (its dedicated matrix unit), missing the large matmul speedup that hardware provides. I'm raising this as a scoped, help-wanted opportunity rather than a patch: I don't have M5 hardware to implement or benchmark it, and I gather the maintainers may not either — so this flags the gap, sketches the approach, and asks whether a community PR would be welcome from someone running an M5 + macOS 26.2.
This is the GPU Neural Accelerator, distinct from the CPU-side ARMv9.2 SME work in #2230 / #2273.
The gap
All tract-metal GEMM backends (mlx_gemm, ggml_gemm, mfa, basic) use #include <metal_simdgroup_matrix> / simdgroup_matrix<…,8,8> — the Metal 2.3 path (macOS 11+).
The vendored MLX steel GEMM is a frozen shader-source copy, pinned to MLX 02efb310 (Sep 2024) — i.e. pre-Neural-Accelerator.
Each M5 GPU core embeds a matrix unit ("Neural Accelerator"). On M5 the current path leaves it idle; Apple reports up to ~4× TTFT for LLM inference when MLX uses it vs an M4 baseline.
Why not a hand-written MSL kernel
Apple does not expose the M5 matrix unit to Metal Shading Language. As of Xcode 26.1 the only supported access is the Metal 4 Tensor APIs + Metal Performance Primitives (MPP) framework — so a native .metal kernel can't reach the hardware. The realistic route is a library that already wraps the supported API.
Proposed approach (for a contributor with M5)
MLX added Neural-Accelerator support in 0.30.0 (ml-explore/mlx#2772, macOS 26.2+) via the Metal 4 / MPP path. Since tract-metal already vendors MLX's steel GEMM as shader source, the natural step is to re-vendor the newer MLX Metal-4 TensorOps GEMM as an additional backend selected when compute_capability == 5, keeping the existing path for everything else. This matches the current vendoring pattern rather than introducing a new dependency model.
Scope & caveats
M5 + macOS 26.2 only. M1–M4 unaffected; their path is unchanged (and re-vendoring does not help M1–M4 — post-2024 MLX GEMM work is almost all M5/NAX).
Requires a Metal 4 toolchain for that backend, gated so older-Xcode / non-M5 builds and targets are unaffected.
The win is concentrated in large-matmul / prefill (TTFT); decode/GEMV-heavy and sync/cast-bound workloads benefit less.
Adds build/maintenance surface (a Metal-4 shader set + capability-gated dispatch).
Validation (requires M5 hardware)
A contributor on M5 + macOS 26.2 would run cargo test -p tract-metal + test-metal conformance, plus a gpu_gemm_bakeoff-style sustained-GFLOPS comparison (current path vs NAX path) across prefill / decode / lm_head shapes to quantify the win.
Open questions
Would tract accept M5 GPU Neural-Accelerator support as a community contribution, gated to M5 + macOS 26.2 (Metal 4)?
If so, preferred approach: re-vendor MLX's Metal-4 TensorOps GEMM (matches today's shader-vendoring pattern) vs. an optional link against MPP?
Anyone on M5 interested in picking up the implementation + benchmarking? I can share the bake-off shape list and dispatch-wiring pointers to lower the activation energy.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
On Apple M5,
tract-metal's GEMM runs entirely on the legacysimdgroup_matrixpath and never touches the M5 GPU's per-core Neural Accelerator (its dedicated matrix unit), missing the large matmul speedup that hardware provides. I'm raising this as a scoped, help-wanted opportunity rather than a patch: I don't have M5 hardware to implement or benchmark it, and I gather the maintainers may not either — so this flags the gap, sketches the approach, and asks whether a community PR would be welcome from someone running an M5 + macOS 26.2.This is the GPU Neural Accelerator, distinct from the CPU-side ARMv9.2 SME work in #2230 / #2273.
The gap
tract-metalGEMM backends (mlx_gemm,ggml_gemm,mfa,basic) use#include <metal_simdgroup_matrix>/simdgroup_matrix<…,8,8>— the Metal 2.3 path (macOS 11+).02efb310(Sep 2024) — i.e. pre-Neural-Accelerator.Why not a hand-written MSL kernel
Apple does not expose the M5 matrix unit to Metal Shading Language. As of Xcode 26.1 the only supported access is the Metal 4 Tensor APIs + Metal Performance Primitives (MPP) framework — so a native
.metalkernel can't reach the hardware. The realistic route is a library that already wraps the supported API.Proposed approach (for a contributor with M5)
MLX added Neural-Accelerator support in 0.30.0 (ml-explore/mlx#2772, macOS 26.2+) via the Metal 4 / MPP path. Since
tract-metalalready vendors MLX's steel GEMM as shader source, the natural step is to re-vendor the newer MLX Metal-4 TensorOps GEMM as an additional backend selected whencompute_capability == 5, keeping the existing path for everything else. This matches the current vendoring pattern rather than introducing a new dependency model.Scope & caveats
Validation (requires M5 hardware)
A contributor on M5 + macOS 26.2 would run
cargo test -p tract-metal+test-metalconformance, plus agpu_gemm_bakeoff-style sustained-GFLOPS comparison (current path vs NAX path) across prefill / decode / lm_head shapes to quantify the win.Open questions
Beta Was this translation helpful? Give feedback.
All reactions