Skip to content

linalg/x86_64: add AVX-512 VNNI int8 GEMM kernel (avx512vnni_mmm_i32_8x8)#2303

Merged
kali merged 1 commit into
sonos:mainfrom
czoli1976:feat/avx512-vnni-int8-gemm
Jun 1, 2026
Merged

linalg/x86_64: add AVX-512 VNNI int8 GEMM kernel (avx512vnni_mmm_i32_8x8)#2303
kali merged 1 commit into
sonos:mainfrom
czoli1976:feat/avx512-vnni-int8-gemm

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

@czoli1976 czoli1976 commented May 28, 2026

Summary

  • New avx512vnni_mmm_i32_8x8 routes qmmm_i32 through vpdpbusd when AVX-512 VNNI is available, replacing the AVX2 per-K widening-multiply inner loop.
  • Reuses the K=4-inner PackedI8K4 layout; A is offset by +128 for vpdpbusd's u8*s8 form, and the 128*sum_k(B) bias is subtracted per output column, so the i32 accumulators stay bit-identical to the AVX2 path and the whole quantization epilogue is reused unchanged.
  • Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback.
  • Adds a vnni_i32 Criterion microbench comparing VNNI vs AVX2.

Tile geometry is unchanged from avx2_mmm_i32_8x8 (8×8 ymm accumulators); only the inner-K matmul changes. A wider 16×8 zmm tile is a follow-up.

The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on Debian stretch cannot encode vpdpbusd ymm (AVX-512 VNNI-VL added in binutils ~2.30); the probe fails, the tract_avx512vnni cfg is not set, and the kernel is omitted entirely — dispatch falls back to the AVX2 i32 path. Follows the same pattern as the existing assembler_supports_sme and compiler_supports_sve probes.

Bench (single-thread, Cascade Lake, Criterion kernel-only, RAYON_NUM_THREADS=1, both kernels run i8i8 over identical PackedI8K4 inputs):

  • 64×256×64: 8.06 → 76.2 Gelem/s ( 9.4× AVX2)
  • 256×256×256: 8.11 → 74.6 Gelem/s ( 9.2× AVX2)
  • 512×512×512: 8.23 → 99.5 Gelem/s (12.1× AVX2)
  • 1024×1024×64: 8.32 → 112.6 Gelem/s (13.5× AVX2)

(512³ fits in L2: 256 KB per int8 matrix ≤ 1 MiB/core L2 — no K-blocking needed; cleanest VNNI throughput.)

Test plan

  • cargo test --release -p tract-linalg — 2780 passed, 0 failed
  • cargo bench --bench vnni_i32 — bench numbers above
  • AVX2-only x86 hosts unchanged (fallback exercised via where(AVX512VNNI) gating)
  • Stretch (binutils 2.28): assembler probe fails → tract_avx512vnni not set → kernel omitted → AVX2 fallback, build succeeds

Validation environment

x86_64 KVM guest, Ubuntu 24.04.4 LTS (kernel 6.18.5), rustc 1.94.1.

  • CPU: Intel Xeon @ 2.80 GHz, family 6 / model 85 / stepping 7 (Cascade Lake-SP); 4 vCPU, 1 thread/core, 1 socket.
  • AVX-512 features (all confirmed by is_x86_feature_detected!): f, vnni, dq, bw, vl, cd.
  • Cache: L1d 32 KiB/core, L2 1 MiB/core, L3 33 MiB shared. 15 GiB RAM.
  • Bench discipline: RAYON_NUM_THREADS=1, taskset -c 0, Criterion warm-up 5 s / measure 15 s, sample size 100.

@czoli1976
Copy link
Copy Markdown
Contributor Author

czoli1976 commented May 29, 2026

@kali this trips on the Embedded targets / linux (x86_64-unknown-linux-gnu-stretch) job.

The kernel is using the vpdpbusd ymm, ymm, ymm mnemonic but Debian Stretch comes with binutils 2.28 (2017), well before AVX-512_VNNI existed so the gas version does not have them in the encoding tables.

The instruction is correct for Intel Cascade Lake+ at runtime; gas just can't assemble the mnemonic on that toolchain.

Is this something that needs to be addressed?

Possibly going for the raw .byte encoding (and mnemonic as a comment) should do the trick.

@czoli1976
Copy link
Copy Markdown
Contributor Author

BTW binutils 2.30+ should handle if upgrading is an option for you

@czoli1976
Copy link
Copy Markdown
Contributor Author

thinking more about it adding a probe in build.rs that disables the kernel when the assembler can't encode vpdpbusd is another way as well

WDYT ?

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 29, 2026

I can't / don't want to break the stretch build.

We already have a probe mecanism in build.rs, so I think we don't invent something else. Plus ".byte" will get out of hands if there are more than one mnemonic that needs to be addressed, at least the build.rs probe is (mostly) contained in build.rs / where the kernel is registered.

@czoli1976
Copy link
Copy Markdown
Contributor Author

Thanks for the quick answer, build.rs then ! :-)

@czoli1976 czoli1976 force-pushed the feat/avx512-vnni-int8-gemm branch from 2e88038 to 374870c Compare May 29, 2026 11:55
@czoli1976
Copy link
Copy Markdown
Contributor Author

done,

  • Modern binutils (≥2.30, Ubuntu 20.04+): probe passes → VNNI kernel compiled → tract_avx512vnni set → full 9–13× speedup at runtime on VNNI hosts
  • Stretch binutils 2.28: probe fails → VNNI files skipped → tract_avx512vnni not set → kernel extern symbols and dispatch call compiled away → AVX2 fallback → build succeeds

Let's see if it passes CI as expected

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 29, 2026

mmm... i can't see the probe, did you push the right thing ?

@czoli1976 czoli1976 force-pushed the feat/avx512-vnni-int8-gemm branch from 374870c to ba4c11f Compare May 29, 2026 12:06
@czoli1976
Copy link
Copy Markdown
Contributor Author

sorry, check again now

kali
kali previously approved these changes May 29, 2026
…8x8)

Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the
AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner
PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the
128*sum_k(B) bias is removed per output column, so the i32 accumulators stay
bit-identical to the AVX2 path and the whole quantization epilogue is reused.

Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback.
Includes a vnni_i32 microbench (VNNI vs AVX2 int8).

The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled
in a separate cc::Build step gated on a build.rs assembler probe
(assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on
Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the
`tract_avx512vnni` cfg is then not set and the kernel is omitted entirely,
with dispatch falling back to the AVX2 i32 path. Follows the same pattern as
the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve)
probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kali kali merged commit 9fbbf31 into sonos:main Jun 1, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants