cpu: aarch64: Re-Enable JIT Depthwise Convolution for BF16 #3441
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR re-enables the PR 3308, which was reverted due to failing nightly tests. It enables JIT depthwise convolution for
bf16
, preventing it from going toref
and significantly improving its performance, from ~4% for one thread to ~2.5x for 32 threads, when compared to thef32
JIT operation. These performance numbers were extracted from a benchmark executed on an AWSc7g.16xlarge
instance.General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements
Below are small test case logs that demonstrate performance numbers before and after.
OMP_NUM_THREADS=16 ./tests/benchdnn/benchdnn --conv --dt=bf16 --mode=p --alg=convolution_direct g32mb64_ic32oc32_ih112oh112kh3sh1dh0ph1_iw112ow112kw3sw1dw0pw1
bf16: total perf: min(ms):0.815918 avg(ms):0.826171
f32: total perf: min(ms):1.02075 avg(ms):1.03415
ref: total perf: min(ms):225.645 avg(ms):225.808