cpu: aarch64: brgemm: Add support for int8 in brgemm kernel #3414

kasturedeeksha · 2025-06-11T08:50:54Z

Description

This PR extends the BRGEMM (Batch-Reduce General Matrix Multiplication) kernel to support additional INT8 data types, enabling broader applicability for low-precision computations, particularly in deep learning workloads.

Supported Data Type Tags
The following source:weight:destination (src:wei:dst) combinations are now supported:

s8:s8:f32
u8:u8:f32
u8:s8:f32

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?

make test output

98% tests passed, 4 tests failed out of 224
 
Total Test time (real) = 1058.22 sec
 
The following tests FAILED:
        172 - test_graph_unit_dnnl_large_partition_cpu (Failed)
        195 - test_benchdnn_modeC_binary_ci_cpu (Failed)
        196 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Failed)
        204 - test_benchdnn_modeC_graph_ci_cpu (Failed)

Output is same before and after the code changes.

brgemm_test_all output
command used :

./benchdnn --brgemm --batch=inputs/brgemm/test_brgemm_all

Before

tests:660480 passed:18496 skipped:641984 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 24.50s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 7.30s (30%); execute: 0.00s (0%); compute_ref: 4.30s (18%); compare: 5.34s (22%);

After

tests:660480 passed:20480 skipped:640000 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 23.11s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 6.83s (30%); execute: 0.00s (0%); compute_ref: 3.82s (17%); compare: 4.40s (19%);

Have you formatted the code using clang-format?
Yes

jondea

This looks generally great, thank you for this contribution.

Do you have a rough idea of the performance. For example, compared to F32, is it ~4x faster?

Also, what's the general idea of this algorithm? SDOT is more complicated than FMLA in that it reduces the elements from Int8 to S32. Do we handle this reduction using some kind of blocking? The other way is operating on the transpose and using FADDV, but I can't see that here. In short, it would be great to have a short overview of how this kernel works.

jondea · 2025-06-19T11:32:33Z

src/cpu/aarch64/brgemm/jit_brgemm_kernel.cpp

+        else if (brg.dt_a == data_type::u8 && brg.dt_b == data_type::s8)
+            usdot(v1.s, v_a.b, v_b.b);
+        else if (brg.dt_a == data_type::s8 && brg.dt_b == data_type::u8)
+            assert(!"unsupported\n");


Can we not just swap v_a and v_b in this case?

jondea · 2025-06-19T11:37:18Z

src/cpu/aarch64/brgemm/jit_brgemm_kernel.cpp

@@ -1254,14 +1270,21 @@ void jit_brgemm_kernel_t::set_A_B_matrices() {
    add(reg_aux_B, reg_aux_B, reg_b_offset);
 }

-void jit_brgemm_kernel_t::dot_product(ZReg v1, ZReg v2, ZReg v3) {
+void jit_brgemm_kernel_t::dot_product(ZReg v1, ZReg v_b, ZReg v_a) {


v1,v2 and v3 is unclear, although it is consistent. v1,v_b and v_a feels like the right direction, but is more confusing.

Why is b first?

What is v1 in this case, if a and b reference the A and B matrices, then it probably makes sense to give v1 a name with a similar theme, maybe v_acc?

jondea · 2025-06-19T11:43:29Z

src/cpu/aarch64/brgemm/jit_brgemm_kernel.cpp

@@ -1470,8 +1500,10 @@ void jit_brgemm_kernel_t::gemm_microkernel_sve512(int bd_block2,
                    const auto bd_by_load_bytes
                            = (bd >= bd_e - rows_by_load_bytes
                                    || brg.brgattr.wary_A_k_tail_read);
-                    broadcast(bcst(), A_offset(bd, rd),
-                            have_to_load_bytes && bd_by_load_bytes, brg.dt_a);
+                    int should_broadcast = static_cast<int>(


What does this mean? We have a variable called should_broadcast but then we call broadcast even if it is not true, are they two different kinds of broadcasting? If so it would be great to make that clear in the variable/function names

kasturedeeksha requested a review from a team as a code owner June 11, 2025 08:50

github-actions bot added the platform:cpu-aarch64 label Jun 11, 2025

cpu: aarch64: support for int8 datatype in brgemm kernel

6b6d60e

kasturedeeksha force-pushed the aarch64-brgemm-int8 branch from 9f565eb to 6b6d60e Compare June 11, 2025 09:00

jondea requested changes Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpu: aarch64: brgemm: Add support for int8 in brgemm kernel #3414

cpu: aarch64: brgemm: Add support for int8 in brgemm kernel #3414

kasturedeeksha commented Jun 11, 2025

Uh oh!

jondea left a comment

Uh oh!

jondea Jun 19, 2025

Uh oh!

jondea Jun 19, 2025

Uh oh!

jondea Jun 19, 2025

Uh oh!

Uh oh!

cpu: aarch64: brgemm: Add support for int8 in brgemm kernel #3414

Are you sure you want to change the base?

cpu: aarch64: brgemm: Add support for int8 in brgemm kernel #3414

Conversation

kasturedeeksha commented Jun 11, 2025

Description

Checklist

General

Uh oh!

jondea left a comment

Choose a reason for hiding this comment

Uh oh!

jondea Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

jondea Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

jondea Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!