linalg/x86_64: add AVX-512 VNNI int8 GEMM kernel (avx512vnni_mmm_i32_8x8)#2303
Conversation
189135c to
2e88038
Compare
|
@kali this trips on the Embedded targets / linux (x86_64-unknown-linux-gnu-stretch) job. The kernel is using the vpdpbusd ymm, ymm, ymm mnemonic but Debian Stretch comes with binutils 2.28 (2017), well before AVX-512_VNNI existed so the gas version does not have them in the encoding tables. The instruction is correct for Intel Cascade Lake+ at runtime; gas just can't assemble the mnemonic on that toolchain. Is this something that needs to be addressed? Possibly going for the raw .byte encoding (and mnemonic as a comment) should do the trick. |
|
BTW binutils 2.30+ should handle if upgrading is an option for you |
|
thinking more about it adding a probe in build.rs that disables the kernel when the assembler can't encode vpdpbusd is another way as well WDYT ? |
|
I can't / don't want to break the stretch build. We already have a probe mecanism in build.rs, so I think we don't invent something else. Plus ".byte" will get out of hands if there are more than one mnemonic that needs to be addressed, at least the build.rs probe is (mostly) contained in build.rs / where the kernel is registered. |
|
Thanks for the quick answer, build.rs then ! :-) |
2e88038 to
374870c
Compare
|
done,
Let's see if it passes CI as expected |
|
mmm... i can't see the probe, did you push the right thing ? |
374870c to
ba4c11f
Compare
|
sorry, check again now |
…8x8) Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the 128*sum_k(B) bias is removed per output column, so the i32 accumulators stay bit-identical to the AVX2 path and the whole quantization epilogue is reused. Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback. Includes a vnni_i32 microbench (VNNI vs AVX2 int8). The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the `tract_avx512vnni` cfg is then not set and the kernel is omitted entirely, with dispatch falling back to the AVX2 i32 path. Follows the same pattern as the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve) probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ba4c11f to
8eb60e2
Compare
Summary
avx512vnni_mmm_i32_8x8routesqmmm_i32throughvpdpbusdwhen AVX-512 VNNI is available, replacing the AVX2 per-K widening-multiply inner loop.PackedI8K4layout; A is offset by +128 forvpdpbusd'su8*s8form, and the128*sum_k(B)bias is subtracted per output column, so the i32 accumulators stay bit-identical to the AVX2 path and the whole quantization epilogue is reused unchanged.where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback.vnni_i32Criterion microbench comparing VNNI vs AVX2.Tile geometry is unchanged from
avx2_mmm_i32_8x8(8×8 ymm accumulators); only the inner-K matmul changes. A wider 16×8 zmm tile is a follow-up.The kernel lives in its own
x86_64/avx512vnni/subdirectory and is compiled in a separatecc::Buildstep gated on abuild.rsassembler probe (assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on Debian stretch cannot encodevpdpbusd ymm(AVX-512 VNNI-VL added in binutils ~2.30); the probe fails, thetract_avx512vnnicfg is not set, and the kernel is omitted entirely — dispatch falls back to the AVX2 i32 path. Follows the same pattern as the existingassembler_supports_smeandcompiler_supports_sveprobes.Bench (single-thread, Cascade Lake, Criterion kernel-only, RAYON_NUM_THREADS=1, both kernels run i8i8 over identical PackedI8K4 inputs):
(512³ fits in L2: 256 KB per int8 matrix ≤ 1 MiB/core L2 — no K-blocking needed; cleanest VNNI throughput.)
Test plan
cargo test --release -p tract-linalg— 2780 passed, 0 failedcargo bench --bench vnni_i32— bench numbers abovewhere(AVX512VNNI)gating)tract_avx512vnninot set → kernel omitted → AVX2 fallback, build succeedsValidation environment
x86_64 KVM guest, Ubuntu 24.04.4 LTS (kernel 6.18.5), rustc 1.94.1.
is_x86_feature_detected!): f, vnni, dq, bw, vl, cd.RAYON_NUM_THREADS=1,taskset -c 0, Criterion warm-up 5 s / measure 15 s, sample size 100.