[webgpu] Use workgroup memory to reduce register pressure #24286

qjia7 · 2025-04-03T03:26:14Z

On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

sushraja-msft · 2025-04-04T01:19:04Z

LGTM thanks

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

qjia7 · 2025-04-09T05:12:37Z

@guschmue Need your help on full perf test to ensure it won't bring regressions on other GPUs. Thanks.

guschmue · 2025-04-10T16:18:20Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline,web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web

azure-pipelines · 2025-04-10T16:18:43Z

Azure Pipelines successfully started running 5 pipeline(s).

guschmue · 2025-04-10T21:08:46Z

azp /run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,
Windows OpenVINO CI Pipeline

guschmue · 2025-04-10T21:09:53Z

/azp run Windows OpenVINO CI Pipeline

azure-pipelines · 2025-04-10T21:09:59Z

No pipelines are associated with this pull request.

guschmue · 2025-04-10T21:10:22Z

/azp run web_Debug / build_onnxruntime_web,web_Release / build_onnxruntime_web,webgpu_build_x64_RelWithDebInfo, webgpu_external_dawn_build_x64_RelWithDebInfo,webgpu_minimal_build_edge_build_x64_RelWithDebInfo,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64, QNN CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline

azure-pipelines · 2025-04-10T21:10:35Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2025-04-10T21:11:27Z

/azp run build_x64_release

azure-pipelines · 2025-04-10T21:11:34Z

No pipelines are associated with this pull request.

guschmue · 2025-04-10T21:11:57Z

sorry, needs to be merged with main

guschmue · 2025-04-11T03:50:41Z

emscripten build complains about:

onnxruntime/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc:934:72: error: '&&' within '||' [-Werror,-Wlogical-op-parentheses]
  934 |          (context.AdapterInfo().vendor == std::string_view{"qualcomm"} && parameters.head_size_ % 8 == 0 || parameters.head_size_ % 4 == 0);
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~
/mnt/vss/_work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc:934:72: note: place parentheses around the '&&' expression to silence this warning
  934 |          (context.AdapterInfo().vendor == std::string_view{"qualcomm"} && parameters.head_size_ % 8 == 0 || parameters.head_size_ % 4 == 0);
      |                                                                        ^
      |           (

On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.

…24286) On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU. (cherry picked from commit 2d5316f)

[webgpu] Use workgroup memory to reduce register pressure

3e0c090

sushraja-msft reviewed Apr 3, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

sushraja-msft reviewed Apr 3, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

address comments

a0c608c

qjia7 marked this pull request as ready for review April 3, 2025 06:30

guschmue added the ep:WebGPU ort-web webgpu provider label Apr 3, 2025

sushraja-msft requested changes Apr 4, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

qjia7 added 2 commits April 7, 2025 10:41

Merge branch 'main' into opt_flash_attention

5b2676b

address comments

52391bf

qjia7 requested review from sushraja-msft and guschmue April 7, 2025 03:20

sushraja-msft reviewed Apr 9, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

sushraja-msft previously approved these changes Apr 9, 2025

View reviewed changes

address comments

a3b610f

qjia7 dismissed sushraja-msft’s stale review via a3b610f April 9, 2025 05:07

qjia7 added 2 commits April 9, 2025 13:14

Merge branch 'main' into opt_flash_attention

3eb7139

limit the changes to qualcomm

cd16e95

qjia7 requested a review from sushraja-msft April 10, 2025 03:28

sushraja-msft previously approved these changes Apr 10, 2025

View reviewed changes

guschmue previously approved these changes Apr 10, 2025

View reviewed changes

Merge branch 'main' into opt_flash_attention

8d99db7

fix build errors

a2dd598

qjia7 dismissed stale reviews from guschmue and sushraja-msft via a2dd598 April 11, 2025 04:49

guschmue approved these changes Apr 11, 2025

View reviewed changes

sushraja-msft merged commit 2d5316f into main Apr 11, 2025
87 of 89 checks passed

sushraja-msft deleted the opt_flash_attention branch April 11, 2025 16:54

[webgpu] Use workgroup memory to reduce register pressure #24286

[webgpu] Use workgroup memory to reduce register pressure #24286

Uh oh!

Conversation

qjia7 commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

sushraja-msft commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Apr 9, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

azure-pipelines bot commented Apr 10, 2025

Uh oh!

guschmue commented Apr 10, 2025

Uh oh!

guschmue commented Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!