Skip to content

cpu: aarch64: Re-Enable JIT Depthwise Convolution for BF16 #3441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2025

Conversation

renato-arantes
Copy link
Contributor

Description

This PR re-enables the PR 3308, which was reverted due to failing nightly tests. It enables JIT depthwise convolution for bf16, preventing it from going to ref and significantly improving its performance, from ~4% for one thread to ~2.5x for 32 threads, when compared to the f32 JIT operation. These performance numbers were extracted from a benchmark executed on an AWS c7g.16xlarge instance.

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?
  • Have you submitted performance data that demonstrates performance improvements?

Performance improvements

Below are small test case logs that demonstrate performance numbers before and after.

OMP_NUM_THREADS=16 ./tests/benchdnn/benchdnn --conv --dt=bf16 --mode=p --alg=convolution_direct g32mb64_ic32oc32_ih112oh112kh3sh1dh0ph1_iw112ow112kw3sw1dw0pw1

bf16: total perf: min(ms):0.815918 avg(ms):0.826171
f32: total perf: min(ms):1.02075 avg(ms):1.03415
ref: total perf: min(ms):225.645 avg(ms):225.808

@renato-arantes renato-arantes requested review from a team as code owners June 18, 2025 16:57
@github-actions github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:common labels Jun 18, 2025
Copy link
Contributor

@dzarukin dzarukin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: which change addressed the aforementioned failures?

if (jcp.dst_dt == data_type::f32) {
ld1w(ZRegS(0), P_ALL_ONE, ptr(reg_tmp_addr));
fadd(zregs_acc, zregs_acc, ZRegS(0));
} else if (jcp.dst_dt == data_type::bf16) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: which change addressed the aforementioned failures?

This is the change that deals with the failed tests. The post-op sum had not been addressed previously.

@Sqvid Sqvid merged commit ad46dbb into uxlfoundation:main Jun 19, 2025
24 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:common platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants