arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing#2278
arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing#2278czoli1976 wants to merge 1 commit into
Conversation
|
Converting to draft for housekeeping while we work on #2777 first. Just tryingt to limit confusion in triage, no bearing on the content. |
|
Mind giving me a clean rebase now that 2777 is merged ? |
|
ok ... wait for it |
|
FYI, GHA is broken. It's gonna be a complicated afternoon. |
arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281). - Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1 forces the SMLAL 8x8 fallback so callers can A/B on one binary. - Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present, SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8 matmuls to a matrix kernel instead of the 64x1 GEMV. - Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench). Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR): MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ca21372 to
632cf39
Compare
|
Done — rebased onto current main (post-#2277). Heads-up on why the diff shrank: PackedI8K4, the K-outer writer, the dyn-packing lowering (
One commit, +346/−1; it consumes the now-upstream Revalidated on the new base (M1): linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25, fmt + clippy clean, bit-exact vs the SMLAL 8x8. M4 e2e is unchanged (kernel identical): MiniLM 44.4→24.8 ms (1.79×), InceptionV1 51.6→28.4 ms (1.82×). |
|
Well, go out for an ice cream :) |
Adds an int8 → i32 matmul kernel using SDOT (
FEAT_DotProd, ARMv8.2) — ~4× the SMLAL8x8kernel at the matmul level — plus thePackedI8K4(K=4-inner) packing it consumes and the matmul/conv lowering to route it.What's here
arm64simd_mmm_i32_8x8_dot— SDOT i8→i32 8×8 kernel, gated onhas_dotprod()(TRACT_DOTPROD_DISABLE=1forces the SMLAL fallback for A/B). Samev16..v31tile layout as the SMLAL8x8, so it reuses all the i32 fuse/store/q_scale machinery.PackedI8K4— K=4-inner packing (out[(k/4)·r·4 + m·4 + (k%4)],k_alignment=4); one-passKOut4Writer+ a cache-friendlypack_view.OptMatMulPack/ einsum / conv-im2col generalized fromPackedFormattoBox<dyn MMMInputFormat>soPackedI8K4flows through matmul and conv;QSumBreads the K=4 layout.PackedFormatpaths stay byte-identical (monomorphic, nodynon the hot path).Validation
cargo test -p tract-core: 244/244 ·cargo test -p tract-linalg: 3817/3817 ·cargo fmt --checkclean · no new clippy warnings.Performance (Apple M4, e2e)
On top of #2277 (which selects the SMLAL
8x8), SDOT adds:8x88x8_dotCombined with #2277 (vs the pre-fix
64x1GEMV): MiniLM 50.5→24.8 ms (2.04×), InceptionV1 53.6→28.4 ms (1.89×).(
UDOTskipped: tract normalises activations to i8, so the u8 path is redundant.)