[MLAS] Add 8-bit weights ARM64 Gemm implementation #25110

hariharans29 · 2025-06-18T23:24:34Z

Description

Enable 8-bit weights Gemm on ARM64 via MLAS

Supports 2 flavors of the 8-bit Gemm kernel - one uses vdotq (U8U8) and the other uses vusdotq (U8S8) on platforms where I8MM is supported.
Provides access to these new MLAS Gemm kernels via the MatmulNBits contrib operator
Tests:
MLAS
3 new sets of tests:
- SQ8BitQuantA : Tests the dynamic activation quantization MLAS kernel (fp32 -> uint8_t or fp32 -> int8_t on I8MM platforms)
- SQ8BitPrepack: Tests the prepacking of the weights for the 8-bit Gemm kernels
- SQ8BitGemm: Tests the 8-bit Gemm kernels
MatmulNBits contrib tests
- Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86)

Motivation and Context

Enable 8-bit weights Gemm on ARM64 via MLAS

Based on work and contribution by @fajin-corp

Rebasing

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…nnxruntime into hari/matmul8bits_arm

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…nnxruntime into hari/matmul8bits_arm

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

hariharans29 · 2025-06-28T03:23:41Z

onnxruntime/test/contrib_ops/matmul_8bits_test.cc

@@ -24,7 +24,7 @@
 #include "core/session/ort_env.h"
 #include "core/util/qmath.h"

-#if (defined(MLAS_TARGET_AMD64_IX86) && !defined(USE_DML) && !defined(USE_WEBGPU) && !defined(USE_COREML)) || defined(USE_CUDA) || defined(USE_WEBGPU)
+#if ((defined(MLAS_TARGET_AMD64_IX86) || defined(MLAS_TARGET_ARM64)) && !defined(USE_DML) && !defined(USE_WEBGPU) && !defined(USE_COREML)) || defined(USE_CUDA) || defined(USE_WEBGPU)


Enables tests on ARM64

Copilot

Pull Request Overview

This PR adds support for 8-bit weights Gemm on ARM64 via a new MLAS implementation that leverages both vdotq and i8mm instructions. Key changes include updates and additions in test suites (sq8bitgemm, matmul_8bits_test, matmul_4bits_test), integration of a new source file (sqnbitgemm_kernel_neon_int8_i8mm.cpp) with corresponding build system adjustments, and modifications in various MLAS functions to propagate a BlkBitWidth parameter and handle additional block‐sum data.

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp	Updates to function signatures (e.g. additional blkSum2 parameters) and test kernel evaluations for ARM64 paths.
onnxruntime/test/contrib_ops/matmul_8bits_test.cc and matmul_4bits_test.cc	Renaming test cases to include “4b” or “8b” for clarity and updating test configurations.
onnxruntime/core/mlas/lib/sqnbitgemm_kernel_neon_int8_i8mm.cpp and related MLAS files	New implementation file added with adjustments for i8mm instructions and extended packing routines including blkSum2.
cmake/onnxruntime_mlas.cmake	Build system changes to compile the new source file with proper ARM64 flags (-march=armv8.2-a+i8mm).
onnxruntime/core/mlas/lib/platform.cpp and related header files	Updates to dispatch functions and the introduction of the BlkBitWidth parameter with conditional selection for ARM64.

Comments suppressed due to low confidence (3)

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp:32

The addition of 'refBlkSum2_' in the buffer declarations and subsequent changes in PrepackB and CheckBlkSum functions increases complexity. Consider adding a brief comment describing how and why the additional block sum accumulation is used to improve code clarity.

  MatrixGuardBuffer<uint8_t> inputB_, inputZp_, refB_, packedBuffer_;

onnxruntime/core/mlas/lib/platform.cpp:588

The flag 'ArmNeonQuantAUnsigned' is deliberately overridden when I8MM support is detected; a comment explaining the rationale behind switching from unsigned to signed mode in this context would help maintainability and clarity for future maintainers.

        this->ArmNeonQuantAUnsigned = false;

cmake/onnxruntime_mlas.cmake:441

Ensure that the compile flag '-march=armv8.2-a+i8mm' is consistently used across all ARM64 targets for the new file. Double-check that the flag matches the expected support level when using i8mm instructions and that documentation in the build files reflects this requirement.

        set_source_files_properties(${MLAS_SRC_DIR}/sqnbitgemm_kernel_neon_int8_i8mm.cpp

vraspar · 2025-07-07T20:34:07Z

onnxruntime/core/mlas/lib/qnbitgemm_kernel_neon.cpp

+
+    if constexpr (QuantAUnsigned) {
+        {
+            assert(QuantBBlkSum2 != nullptr);


The assertion fails on android [QuantAUnsigned = true]: assertion "QuantBBlkSum2 != nullptr" failed when testing with phi-4-mini-instruct: cpu-int4-kquant-block-128-mixed-acc-level-4/v3

fajin-corp and others added 11 commits April 28, 2025 23:29

finished prepack

b52a1ce

changed interface to support blocksum2

0523106

finished quantb for quant a unsigned

fd92ab8

finished quantize a

ed5cf8d

finished Q8Int8GemmR2xC8Neon

b9b9691

finished kernels

685baff

fixed build

6747330

passed prepack

b087317

finished ut for quant a

196c04c

fixed build

353d460

Merge remote-tracking branch 'origin/main' into hari/matmul8bits_arm

4d62e32

Rebasing

github-actions bot reviewed Jun 18, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp Outdated Show resolved Hide resolved

hariharans29 and others added 9 commits June 18, 2025 20:24

Comment out some 4 bit tests

e88e32d

Apple I8MM check

58011b0

Tests

acc4b81

Tests 2

2700493

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

76de326

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Changes

159d4d3

Fixes

e4bc74e

Re-enable 4 bit tests

e92055b

Stage

94f3022

github-actions bot reviewed Jun 25, 2025

View reviewed changes

hariharans29 added 3 commits June 25, 2025 01:25

Some tests work

61c1872

Git attempt

16da92b

Lint attempt

3ce481d

github-actions bot reviewed Jun 25, 2025

View reviewed changes

hariharans29 and others added 4 commits June 25, 2025 12:34

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

29f66bd

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

More changesc

987574b

Merge branch 'hari/matmul8bits_arm' of https://github.com/microsoft/o…

d921b06

…nnxruntime into hari/matmul8bits_arm

Fix tests

cf92e6f

github-actions bot reviewed Jun 26, 2025

View reviewed changes

hariharans29 and others added 7 commits June 26, 2025 12:11

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

31c8f93

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

92ec5ff

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Try fix x86 builds

7159d5e

Merge branch 'hari/matmul8bits_arm' of https://github.com/microsoft/o…

7ad1d36

…nnxruntime into hari/matmul8bits_arm

Try fix lint errors

03f2916

Yipee zero point tests are all passing

47420b5

Comments and Nits

2a5100d

hariharans29 changed the title ~~[DO NOT REVIEW] [MLAS] 8 bit weights ARM64 Matmul implementation~~ WIP: [MLAS] 8 bit weights ARM64 Matmul implementation Jun 27, 2025

hariharans29 added 7 commits June 26, 2025 18:14

Enable MatmulNBits test

d64568b

Fixes

0c55755

Merge remote-tracking branch 'origin/main' into hari/matmul8bits_arm

01d4a98

a

c8188d4

I8MM support re-enable

635eec9

Fix warning

f736fae

Enable tests with ZP = false

aa79467

hariharans29 changed the title ~~WIP: [MLAS] 8 bit weights ARM64 Matmul implementation~~ [MLAS] 8 bit weights ARM64 Matmul implementation Jun 28, 2025

github-actions bot reviewed Jun 28, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

hariharans29 changed the title ~~[MLAS] 8 bit weights ARM64 Matmul implementation~~ [MLAS] Add 8-bit weights ARM64 Gemm implementation Jun 28, 2025

Update onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

10e3afa

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

hariharans29 assigned jywu-msft and edgchen1 Jun 28, 2025

I8MM fixes

c4331e0

hariharans29 commented Jun 28, 2025

View reviewed changes

Remove unnecessary template

5b7c3af

jywu-msft requested review from edgchen1 and Copilot June 28, 2025 20:24

Copilot AI reviewed Jun 28, 2025

View reviewed changes

skottmckay mentioned this pull request Jun 30, 2025

[Mobile] MatMulNbits Q8 Errors out on Android #24769

Open

vraspar reviewed Jul 7, 2025

View reviewed changes

[MLAS] Add 8-bit weights ARM64 Gemm implementation #25110

Are you sure you want to change the base?

[MLAS] Add 8-bit weights ARM64 Gemm implementation #25110

Conversation

hariharans29 commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hariharans29 Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

vraspar Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hariharans29 commented Jun 18, 2025 •

edited

Loading