[webgpu] support intel subgroup matrix on matmul_nbits #24898

xhcao · 2025-05-29T08:01:12Z

The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms.

Description

Motivation and Context

The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms.

xhcao · 2025-05-29T08:05:58Z

The subgroup matrix feature is very relevant with hardware vendors and their architectures, they have different subgroup matrix configs, and different vendors and architectures have the different best subgroup config,
optimizing the algorithm on one hardware will easily hurt others hardwares, so we generate code separately for different vendors in the early stages of development.
The PR currently only supports intel xe-2lpg architecture on vulkan, and the subgroup matrix config is f16(816) x f16(1616) = f32(8*16), we will extend the features when the dawn enables more configs.
The current performance on intel xe-2lpg architecture is ~20% slower than dp4a, and ~10% faster than non-dp4a.

@jchen10 @daijh PTAL, thanks.

xhcao · 2025-05-29T08:23:20Z

Currently, the subgroup matrix config (UINT8(832) x UINT8(328) = UINT32(8*8)) is implementing in the dawn, which the expected result may be better than dp4a.

onnxruntime/core/providers/webgpu/webgpu_context.cc

fs-eire · 2025-06-02T21:20:09Z

Need to merge to latest main branch to fix the CI pipeline issue.

xhcao · 2025-06-03T09:30:16Z

Please do NOT merge it to upstream today, I have some optimization to merge it tomorrow, thanks.

sushraja-msft · 2025-06-04T17:22:42Z

xhcao, does this work on windows. Would it be possible to share instructions on how to try vk backend with ort webgpu ?

xhcao · 2025-06-05T06:41:25Z

xhcao, does this work on windows. Would it be possible to share instructions on how to try vk backend with ort webgpu ?

@sushraja-msft The PR enables the feature on intel+"xe-2lpg" platform, on windows, you could build onnxruntime with below command,
build.bat --config RelWithDebInfo --build_dir build/Rel --parallel --skip_submodule_sync --skip_tests --parallel --use_webgpu --build_shared_lib --enable_pybind --build_wheel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF --cmake_extra_defines onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=ON --cmake_extra_defines onnxruntime_ENABLE_DAWN_BACKEND_D3D12=OFF --cmake_generator "Visual Studio 17 2022"

But after upgrading dawn yesterday, the compilation of shader would fail.

If I output subgroup_matrix_result to global memory directly, or output it through shared memory, it all fails. I am not sure whether your implementation on metal could work correctly now. The error logs are shown as below,
`
Exception: WebGPU validation failed. Error while parsing WGSL: :131:38 error: 'subgroupMatrixStore' requires argument 1 to be uniform
subgroupMatrixStore(&output, matrix_c_offset + subtile_id * m_dim * uniforms.N + 3 * n_dim, matC03, false, uniforms.N);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:89:31 note: builtin 'local_idx' of 'main' may be non-uniform
let subtile_id = u32(local_idx / sg_size);
^^^^^^^^^

While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
and
Exception: WebGPU validation failed. Error while parsing WGSL: :163:29 error: 'subgroupMatrixStore' requires argument 0 to be uniform
subgroupMatrixStore(&scratch[subtile_id], 0, matC03, false, n_dim);
^^^^^^^^^^^^^^^^^^^^
:101:31 note: builtin 'local_idx' of 'main' may be non-uniform
let subtile_id = u32(local_idx / sg_size);
^^^^^^^^^
While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
`

xhcao · 2025-06-06T07:39:12Z

After tuning the work group size (128 -> 256), tile size of A and removing the tile shared memory of C, the performance is better (~20%) than dp4a code path on LNL on win+vulkan.
It is ready to review now, PTAL

onnxruntime/core/providers/webgpu/webgpu_context.h

xhcao · 2025-06-09T08:42:03Z

@sushraja-msft Could help review this PR, thanks?

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

onnxruntime/core/providers/webgpu/webgpu_context.cc

fs-eire · 2025-06-11T02:22:13Z

It looks like all code path will not run subgroup matrix in wasm. (correct me if I'm wrong)

maybe simple exclude subgroup_matrix_matmul_nbits.cc and subgroup_matrix_matmul_nbits.h from web assembly build is easier to resolve the wasm build?

xhcao · 2025-06-11T04:16:38Z

It looks like all code path will not run subgroup matrix in wasm. (correct me if I'm wrong)

maybe simple exclude subgroup_matrix_matmul_nbits.cc and subgroup_matrix_matmul_nbits.h from web assembly build is easier to resolve the wasm build?

You are right. Done. Thanks.

xhcao · 2025-06-12T02:11:34Z

Sorry. It was my fault to exclude one more code for wasm target, I had built wasm target on local machine, but not found the failure.
Is the CUDA Bot failure related with my PR?

fs-eire · 2025-06-12T23:01:46Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-06-12T23:02:05Z

Azure Pipelines successfully started running 5 pipeline(s).

[webgpu] support intel subgroup matrix on matmul_nbits

fd1d56f

The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms.

fs-eire reviewed May 30, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/webgpu_context.cc Show resolved Hide resolved

Merge branch 'main' into subgroup-matrix-on-intel

9dce187

guschmue added the ep:WebGPU label Jun 3, 2025

xhcao added 2 commits June 4, 2025 14:55

Optimize the algorithm

4ba9212

Merge branch 'main' into subgroup-matrix-on-intel

9c21b22

xhcao added 2 commits June 6, 2025 14:35

Fix an error in 8 bits case

b6454f1

Merge branch 'main' into subgroup-matrix-on-intel

56bfcdf

xhcao added 2 commits June 9, 2025 15:50

Fix lint c++ issues

66e3ccf

Merge branch 'main' into subgroup-matrix-on-intel

9fc3ef6

xhcao commented Jun 9, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/webgpu_context.h Show resolved Hide resolved

fs-eire reviewed Jun 9, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Outdated Show resolved Hide resolved

fs-eire reviewed Jun 9, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Show resolved Hide resolved

fs-eire reviewed Jun 10, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Show resolved Hide resolved

Fix errors when building wasm target

fec4b17

xhcao force-pushed the subgroup-matrix-on-intel branch from 7a712bc to fec4b17 Compare June 10, 2025 08:12

Merge branch 'main' into subgroup-matrix-on-intel

49c4b61

fs-eire reviewed Jun 10, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/webgpu_context.cc Show resolved Hide resolved

xhcao added 2 commits June 11, 2025 11:29

Exclude subgroup matrix code for wasm target

e9f4455

Merge branch 'main' into subgroup-matrix-on-intel

32489b7

xhcao added 2 commits June 12, 2025 09:59

Fix error

6168f55

Merge branch 'main' into subgroup-matrix-on-intel

d79a2f5

fs-eire approved these changes Jun 14, 2025

View reviewed changes

fs-eire merged commit 82c1bf9 into microsoft:main Jun 14, 2025
83 checks passed

[webgpu] support intel subgroup matrix on matmul_nbits #24898

[webgpu] support intel subgroup matrix on matmul_nbits #24898

Uh oh!

Conversation

xhcao commented May 29, 2025

Description

Motivation and Context

Uh oh!

xhcao commented May 29, 2025

Uh oh!

xhcao commented May 29, 2025

Uh oh!

Uh oh!

fs-eire commented Jun 2, 2025

Uh oh!

xhcao commented Jun 3, 2025

Uh oh!

sushraja-msft commented Jun 4, 2025

Uh oh!

xhcao commented Jun 5, 2025

Uh oh!

xhcao commented Jun 6, 2025

Uh oh!

Uh oh!

xhcao commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fs-eire commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhcao commented Jun 11, 2025

Uh oh!

xhcao commented Jun 12, 2025

Uh oh!

fs-eire commented Jun 12, 2025

Uh oh!

azure-pipelines bot commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

fs-eire commented Jun 11, 2025 •

edited

Loading