Skip to content

[webgpu] Use workgroup memory to reduce register pressure #24286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 11, 2025

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Apr 3, 2025

On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.

@qjia7 qjia7 marked this pull request as ready for review April 3, 2025 06:30
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Apr 3, 2025
@sushraja-msft
Copy link
Contributor

LGTM thanks

@qjia7 qjia7 requested review from sushraja-msft and guschmue April 7, 2025 03:20
sushraja-msft
sushraja-msft previously approved these changes Apr 9, 2025
@qjia7
Copy link
Contributor Author

qjia7 commented Apr 9, 2025

@guschmue Need your help on full perf test to ensure it won't bring regressions on other GPUs. Thanks.

@qjia7 qjia7 requested a review from sushraja-msft April 10, 2025 03:28
sushraja-msft
sushraja-msft previously approved these changes Apr 10, 2025
guschmue
guschmue previously approved these changes Apr 10, 2025
@guschmue
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline,web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@guschmue
Copy link
Contributor

azp /run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,
Windows OpenVINO CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows OpenVINO CI Pipeline

Copy link

No pipelines are associated with this pull request.

@guschmue
Copy link
Contributor

/azp run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run build_x64_release

Copy link

No pipelines are associated with this pull request.

@guschmue
Copy link
Contributor

sorry, needs to be merged with main

@guschmue
Copy link
Contributor

emscripten build complains about:

onnxruntime/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc:934:72: error: '&&' within '||' [-Werror,-Wlogical-op-parentheses]
  934 |          (context.AdapterInfo().vendor == std::string_view{"qualcomm"} && parameters.head_size_ % 8 == 0 || parameters.head_size_ % 4 == 0);
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~
/mnt/vss/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc:934:72: note: place parentheses around the '&&' expression to silence this warning
  934 |          (context.AdapterInfo().vendor == std::string_view{"qualcomm"} && parameters.head_size_ % 8 == 0 || parameters.head_size_ % 4 == 0);
      |                                                                        ^
      |           (                 

@qjia7 qjia7 dismissed stale reviews from guschmue and sushraja-msft via a2dd598 April 11, 2025 04:49
@sushraja-msft sushraja-msft merged commit 2d5316f into main Apr 11, 2025
87 of 89 checks passed
@sushraja-msft sushraja-msft deleted the opt_flash_attention branch April 11, 2025 16:54
ashrit-ms pushed a commit that referenced this pull request Apr 24, 2025
On Qualcomm Adreno X1 GPUs, the previous implementation of the
FlashAttentionProgram shader in the WebGPU backend was causing high
register pressure, leading to performance degradation. This PR uses
workgroup memory to reduce the register pressure and improve
performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1
GPU.
ambroser53 pushed a commit to ambroser53/onnxruntime that referenced this pull request May 29, 2025
…24286)

On Qualcomm Adreno X1 GPUs, the previous implementation of the
FlashAttentionProgram shader in the WebGPU backend was causing high
register pressure, leading to performance degradation. This PR uses
workgroup memory to reduce the register pressure and improve
performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1
GPU.

(cherry picked from commit 2d5316f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants