Skip to content

[webgpu] support intel subgroup matrix on matmul_nbits #24898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 14, 2025

Conversation

xhcao
Copy link
Contributor

@xhcao xhcao commented May 29, 2025

The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms.

Description

Motivation and Context

The patch enables intel subgroup matrix on matmul_bits operator,
and temporarily supports it on vulkan backend and xe-2lpg arch,
we will extend the functions on more subgroup matrix configs and
platforms.
@xhcao
Copy link
Contributor Author

xhcao commented May 29, 2025

  1. The subgroup matrix feature is very relevant with hardware vendors and their architectures, they have different subgroup matrix configs, and different vendors and architectures have the different best subgroup config,
    optimizing the algorithm on one hardware will easily hurt others hardwares, so we generate code separately for different vendors in the early stages of development.
  2. The PR currently only supports intel xe-2lpg architecture on vulkan, and the subgroup matrix config is f16(816) x f16(1616) = f32(8*16), we will extend the features when the dawn enables more configs.
  3. The current performance on intel xe-2lpg architecture is ~20% slower than dp4a, and ~10% faster than non-dp4a.

@jchen10 @daijh PTAL, thanks.

@xhcao
Copy link
Contributor Author

xhcao commented May 29, 2025

Screenshot 2025-05-29 161718
Currently, the subgroup matrix config (UINT8(832) x UINT8(328) = UINT32(8*8)) is implementing in the dawn, which the expected result may be better than dp4a.

@fs-eire
Copy link
Contributor

fs-eire commented Jun 2, 2025

Need to merge to latest main branch to fix the CI pipeline issue.

@xhcao
Copy link
Contributor Author

xhcao commented Jun 3, 2025

Please do NOT merge it to upstream today, I have some optimization to merge it tomorrow, thanks.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jun 3, 2025
@sushraja-msft
Copy link
Contributor

xhcao, does this work on windows. Would it be possible to share instructions on how to try vk backend with ort webgpu ?

@xhcao
Copy link
Contributor Author

xhcao commented Jun 5, 2025

xhcao, does this work on windows. Would it be possible to share instructions on how to try vk backend with ort webgpu ?

@sushraja-msft The PR enables the feature on intel+"xe-2lpg" platform, on windows, you could build onnxruntime with below command,
build.bat --config RelWithDebInfo --build_dir build/Rel --parallel --skip_submodule_sync --skip_tests --parallel --use_webgpu --build_shared_lib --enable_pybind --build_wheel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF --cmake_extra_defines onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=ON --cmake_extra_defines onnxruntime_ENABLE_DAWN_BACKEND_D3D12=OFF --cmake_generator "Visual Studio 17 2022"

But after upgrading dawn yesterday, the compilation of shader would fail.

If I output subgroup_matrix_result to global memory directly, or output it through shared memory, it all fails. I am not sure whether your implementation on metal could work correctly now. The error logs are shown as below,
`
Exception: WebGPU validation failed. Error while parsing WGSL: :131:38 error: 'subgroupMatrixStore' requires argument 1 to be uniform
subgroupMatrixStore(&output, matrix_c_offset + subtile_id * m_dim * uniforms.N + 3 * n_dim, matC03, false, uniforms.N);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:89:31 note: builtin 'local_idx' of 'main' may be non-uniform
let subtile_id = u32(local_idx / sg_size);
^^^^^^^^^

  • While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
    and
    Exception: WebGPU validation failed. Error while parsing WGSL: :163:29 error: 'subgroupMatrixStore' requires argument 0 to be uniform
    subgroupMatrixStore(&scratch[subtile_id], 0, matC03, false, n_dim);
    ^^^^^^^^^^^^^^^^^^^^
    :101:31 note: builtin 'local_idx' of 'main' may be non-uniform
    let subtile_id = u32(local_idx / sg_size);
    ^^^^^^^^^
  • While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
    `

@xhcao
Copy link
Contributor Author

xhcao commented Jun 6, 2025

After tuning the work group size (128 -> 256), tile size of A and removing the tile shared memory of C, the performance is better (~20%) than dp4a code path on LNL on win+vulkan.
It is ready to review now, PTAL

@xhcao
Copy link
Contributor Author

xhcao commented Jun 9, 2025

@sushraja-msft Could help review this PR, thanks?

@xhcao xhcao force-pushed the subgroup-matrix-on-intel branch from 7a712bc to fec4b17 Compare June 10, 2025 08:12
@fs-eire
Copy link
Contributor

fs-eire commented Jun 11, 2025

It looks like all code path will not run subgroup matrix in wasm. (correct me if I'm wrong)

maybe simple exclude subgroup_matrix_matmul_nbits.cc and subgroup_matrix_matmul_nbits.h from web assembly build is easier to resolve the wasm build?

@xhcao
Copy link
Contributor Author

xhcao commented Jun 11, 2025

It looks like all code path will not run subgroup matrix in wasm. (correct me if I'm wrong)

maybe simple exclude subgroup_matrix_matmul_nbits.cc and subgroup_matrix_matmul_nbits.h from web assembly build is easier to resolve the wasm build?

You are right. Done. Thanks.

@xhcao
Copy link
Contributor Author

xhcao commented Jun 12, 2025

Sorry. It was my fault to exclude one more code for wasm target, I had built wasm target on local machine, but not found the failure.
Is the CUDA Bot failure related with my PR?

@fs-eire
Copy link
Contributor

fs-eire commented Jun 12, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@fs-eire fs-eire merged commit 82c1bf9 into microsoft:main Jun 14, 2025
83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants