-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[webgpu] Use workgroup memory to reduce register pressure #24286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LGTM thanks |
@guschmue Need your help on full perf test to ensure it won't bring regressions on other GPUs. Thanks. |
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline,web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web |
Azure Pipelines successfully started running 5 pipeline(s). |
azp /run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline, |
/azp run Windows OpenVINO CI Pipeline |
No pipelines are associated with this pull request. |
/azp run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run build_x64_release |
No pipelines are associated with this pull request. |
sorry, needs to be merged with main |
emscripten build complains about:
|
On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.
…24286) On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU. (cherry picked from commit 2d5316f)
On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance.
TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.