Improving quantized matmul performance by devectorizing shader. #15274

trivedivivek · 2025-10-20T15:46:03Z

Summary:
This diff improves the performance of quantized matrix multiplication by devectorizing the shader.

An example modification is shown below:

// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}

Differential Revision: D85023829

pytorch-bot · 2025-10-20T15:46:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

AWS was down, GHA infrastructure effected / recovering

❌ 1 New Failure, 124 Pending

As of commit 81e68f6 with merge base 5d71c9b ():

NEW FAILURE - The following job has failed:

Test Metal Backend / export-voxtral-metal-artifact / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-20T15:46:10Z

@trivedivivek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85023829.

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Differential Revision: D85023829

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Reviewed By: SS-JIA Differential Revision: D85023829

Summary: The diff includes minor performance improvements to the quantized matrix multiplication shader. Reviewed By: SS-JIA Differential Revision: D84998542

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Reviewed By: SS-JIA Differential Revision: D85023829

trivedivivek requested a review from SS-JIA as a code owner October 20, 2025 15:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 20, 2025

trivedivivek added the release notes: vulkan Changes to the Vulkan backend delegate label Oct 20, 2025

trivedivivek force-pushed the export-D85023829 branch 2 times, most recently from 4913210 to b831b18 Compare October 20, 2025 22:09

trivedivivek force-pushed the export-D85023829 branch from b831b18 to 88f7aa0 Compare October 21, 2025 14:01

trivedivivek force-pushed the export-D85023829 branch from 88f7aa0 to 1a04777 Compare October 21, 2025 14:40

SS-JIA approved these changes Oct 21, 2025

View reviewed changes

trivedivivek added 2 commits October 21, 2025 09:30

Minor perf improvements to quantized mat mul shader. (pytorch#15261)

c2efb80

Summary: The diff includes minor performance improvements to the quantized matrix multiplication shader. Reviewed By: SS-JIA Differential Revision: D84998542

trivedivivek force-pushed the export-D85023829 branch from 1a04777 to 81e68f6 Compare October 21, 2025 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving quantized matmul performance by devectorizing shader. #15274

Improving quantized matmul performance by devectorizing shader. #15274

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improving quantized matmul performance by devectorizing shader. #15274

Are you sure you want to change the base?

Improving quantized matmul performance by devectorizing shader. #15274

Conversation

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

❗ 1 Active SEVs

❌ 1 New Failure, 124 Pending

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading