arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing by czoli1976 · Pull Request #2278 · sonos/tract

czoli1976 · 2026-05-24T20:03:28Z

Adds an int8 → i32 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2) — ~4× the SMLAL 8x8 kernel at the matmul level — plus the PackedI8K4 (K=4-inner) packing it consumes and the matmul/conv lowering to route it.

Stacked on #2277. The first commit here is #2277 (the int8 matmul dispatch fix). Without it, 2D int8 matmuls select the 64x1 GEMV kernel and this SDOT kernel is never chosen. Please review/merge #2277 first; this rebases cleanly once it lands.

What's here

arm64simd_mmm_i32_8x8_dot — SDOT i8→i32 8×8 kernel, gated on has_dotprod() (TRACT_DOTPROD_DISABLE=1 forces the SMLAL fallback for A/B). Same v16..v31 tile layout as the SMLAL 8x8, so it reuses all the i32 fuse/store/q_scale machinery.
PackedI8K4 — K=4-inner packing (out[(k/4)·r·4 + m·4 + (k%4)], k_alignment=4); one-pass KOut4Writer + a cache-friendly pack_view.
Lowering — OptMatMulPack / einsum / conv-im2col generalized from PackedFormat to Box<dyn MMMInputFormat> so PackedI8K4 flows through matmul and conv; QSumB reads the K=4 layout. PackedFormat paths stay byte-identical (monomorphic, no dyn on the hot path).

Validation

Bit-exact vs the SMLAL kernel (concrete + dynamic-shape int8 models, M1 + M4).
cargo test -p tract-core: 244/244 · cargo test -p tract-linalg: 3817/3817 · cargo fmt --check clean · no new clippy warnings.

Performance (Apple M4, e2e)

On top of #2277 (which selects the SMLAL 8x8), SDOT adds:

model	SMLAL `8x8`	SDOT `8x8_dot`
all-MiniLM-L6-v2 (int8, seq=128)	44.4 ms	24.8 ms	1.79×
InceptionV1 (int8)	51.6 ms	28.4 ms	1.82×

Combined with #2277 (vs the pre-fix 64x1 GEMV): MiniLM 50.5→24.8 ms (2.04×), InceptionV1 53.6→28.4 ms (1.89×).

(UDOT skipped: tract normalises activations to i8, so the u8 path is redundant.)

kali · 2026-05-26T06:29:22Z

Converting to draft for housekeeping while we work on #2777 first. Just tryingt to limit confusion in triage, no bearing on the content.

kali · 2026-05-26T12:12:39Z

Mind giving me a clean rebase now that 2777 is merged ?

czoli1976 · 2026-05-26T12:19:34Z

ok ... wait for it

kali · 2026-05-26T12:28:28Z

FYI, GHA is broken. It's gonna be a complicated afternoon.

arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281). - Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1 forces the SMLAL 8x8 fallback so callers can A/B on one binary. - Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present, SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8 matmuls to a matrix kernel instead of the 64x1 GEMV. - Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench). Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR): MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 · 2026-05-26T12:32:44Z

Done — rebased onto current main (post-#2277).

Heads-up on why the diff shrank: PackedI8K4, the K-outer writer, the dyn-packing lowering (OptMatMulPack/einsum/conv-im2col + QSumB) and the packed_packed k_alignment fix all landed via #2281, so this PR now reduces to just the arm64 SDOT kernel:

arm64simd_mmm_i32_8x8_dot (asm + registration), gated on has_dotprod() and wired into qmmm_i32 (SMLAL 8x8 fallback when FEAT_DotProd is absent / TRACT_DOTPROD_DISABLE=1)
linalg/benches/qmmm_i8.rs microbench

One commit, +346/−1; it consumes the now-upstream PackedI8K4.

Revalidated on the new base (M1): linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25, fmt + clippy clean, bit-exact vs the SMLAL 8x8. M4 e2e is unchanged (kernel identical): MiniLM 44.4→24.8 ms (1.79×), InceptionV1 51.6→28.4 ms (1.82×).

czoli1976 · 2026-05-26T12:37:32Z

Well, go out for an ice cream :)

This was referenced May 24, 2026

Fix int8 matmul kernel selection picking the 64x1 GEMV for 2D matmuls #2277

Merged

arm64 SME2 SMOPA int8 matmul kernel (sme_qmmm_i32_32x32) #2279

Draft

kali marked this pull request as draft May 26, 2026 06:28

kali marked this pull request as ready for review May 26, 2026 12:12

czoli1976 force-pushed the feat/int8-sdot-kernel branch from ca21372 to 632cf39 Compare May 26, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing#2278

arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing#2278
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feat/int8-sdot-kernel

czoli1976 commented May 24, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 24, 2026

What's here

Validation

Performance (Apple M4, e2e)

Uh oh!

kali commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants