Skip to content

head size 512 for sdpa#2976

Merged
syurkevi merged 7 commits intomainfrom
syurkevi/sdpa_h512
Apr 3, 2025
Merged

head size 512 for sdpa#2976
syurkevi merged 7 commits intomainfrom
syurkevi/sdpa_h512

Conversation

@syurkevi
Copy link
Contributor

Description

This PR enables a head size<=512 for the SDPA microkernel. Good configurations were profiled over PVC, BMG, LNL and DG2.
Most shapes with head size 512 still give good speedups vs a primitives based implementation. Even larger head sizes may still be beneficial, more investigation to follow.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@syurkevi syurkevi requested review from a team as code owners March 28, 2025 21:47
@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch labels Mar 28, 2025
@syurkevi
Copy link
Contributor Author

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable benchdnn_all
enable benchdnn_graph
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@TaoLv
Copy link
Contributor

TaoLv commented Apr 2, 2025

@petercad @syurkevi Is it possible to further extend this to 576 to support MLA? There is the same request for flash attention.

@syurkevi syurkevi force-pushed the syurkevi/sdpa_h512 branch from 48978d1 to 6778889 Compare April 2, 2025 03:18
@syurkevi syurkevi requested a review from a team as a code owner April 2, 2025 03:18
@syurkevi
Copy link
Contributor Author

syurkevi commented Apr 2, 2025

@petercad @syurkevi Is it possible to further extend this to 576 to support MLA? There is the same request for flash attention.

576 is close enough to 512 that the current configs may be applicable as is. I've pushed a test branch internally with this change if you'd like to try it out (I have some concerns wrt/correctness). If 576 is all that's needed, I'll perform some additional testing and bump the number in this PR.

If we want even greater head sizes, at some point there will be too much register demands to be performant (or run at all depending on the hardware). Something like head_size=1024 will require more tuning+testing.

@TaoLv
Copy link
Contributor

TaoLv commented Apr 2, 2025

@syurkevi Good to know it. Feel free to add it in another PR. I think 576 will be enough as for now.

@atkassen
Copy link
Contributor

atkassen commented Apr 2, 2025

In src/gpu/intel/ocl/micro_sdpa.cpp,

  • on (new) line 79:
            if (!q_config_str.empty() && quantized) config_str = std::move(q_config_str);
  • before (new) line 133:
                break;

to address some coverity issues. Thanks!

@syurkevi syurkevi force-pushed the syurkevi/sdpa_h512 branch from 6778889 to 5e59499 Compare April 3, 2025 02:41
@syurkevi
Copy link
Contributor Author

syurkevi commented Apr 3, 2025

@syurkevi Good to know it. Feel free to add it in another PR. I think 576 will be enough as for now.

Bad news here after some testing.
There are no good quantized configs for head size=576. There are a few ok configs with fp16 (except DG2) that I did enable.
We'd likely need to make some updates to the microkernel generation or change the kernel to iterate over the head size w/the existing configs, both options quite involved.

@syurkevi
Copy link
Contributor Author

syurkevi commented Apr 3, 2025

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable benchdnn_all
enable benchdnn_graph
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@syurkevi syurkevi merged commit 814d0e9 into main Apr 3, 2025
22 of 23 checks passed
@syurkevi syurkevi deleted the syurkevi/sdpa_h512 branch April 3, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants