[WOQ] Add CUDA kernel for _weight_int8pack_mm #159325

bbeckca · 2025-07-29T01:32:54Z

Summary
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. #158849

Motivation
A fused GPU kernel for aten._weight_int8pack_mm would:

Eliminate reliance on the .mul().sum() fallback in quantization.py
Improve performance for quantized inference on CUDA
Extend Inductor’s GPU quantization support across more workloads

Implementation

Implement a Triton kernel for:

out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32

Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
Add unit tests comparing results to the reference fallback path

Test Plan:

buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda

Log: P1882799769

buck2 test 'fbcode//mode/opt' caffe2/test:linalg

https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:

**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5

Rollback Plan:

Differential Revision: D79042656

pytorch-bot · 2025-07-29T01:32:58Z

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @bbeckca, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

pytorch-bot · 2025-07-29T01:33:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159325

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

✅ You can merge normally! (1 Unrelated Failure)

As of commit a2a77d3 with merge base 74a754a ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-07-29T01:33:20Z

This pull request was exported from Phabricator. Differential Revision: D79042656

github-actions · 2025-07-29T01:36:48Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Differential Revision: D79042656

facebook-github-bot · 2025-07-29T01:56:11Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: Pull Request resolved: pytorch#159325 **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Differential Revision: D79042656

facebook-github-bot · 2025-07-29T20:47:11Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Differential Revision: D79042656

bbeckca · 2025-07-30T23:00:21Z

Added benchmark results to Test Plan section of this diff, but changes aren't included in this PR. Feel free to let me know if it'd be helpful to publish for further review.

danielvegamyhre · 2025-07-31T21:05:42Z

Added benchmark results to Test Plan section of this diff, but changes aren't included in this PR. Feel free to let me know if it'd be helpful to publish for further review.

Thanks, lgtm!

facebook-github-bot · 2025-08-01T02:40:17Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Reviewed By: danielvegamyhre Differential Revision: D79042656

facebook-github-bot · 2025-08-05T16:25:59Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Reviewed By: jerryzh168, danielvegamyhre Differential Revision: D79042656

facebook-github-bot · 2025-08-05T16:27:15Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Reviewed By: jerryzh168, danielvegamyhre Differential Revision: D79042656

facebook-github-bot · 2025-08-05T20:01:42Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Reviewed By: jerryzh168, danielvegamyhre Differential Revision: D79042656

facebook-github-bot · 2025-08-05T20:46:07Z

This pull request was exported from Phabricator. Differential Revision: D79042656

Summary: **Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Rollback Plan: Reviewed By: jerryzh168, danielvegamyhre Differential Revision: D79042656

facebook-github-bot · 2025-08-05T22:55:44Z

This pull request was exported from Phabricator. Differential Revision: D79042656

facebook-github-bot · 2025-08-06T07:37:12Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-06T07:39:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. pytorch#158849 **Motivation** A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads **Implementation** - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Benchmark Results: ``` **[Shape B=256, K=1024, N=512]** CPU and CUDA outputs match Max abs diff: 2.59e-04, max rel diff: 0.75 CPU: 144.14 ms, CUDA: 303.67 µs Speedup: ×474.6 **[Shape B=512, K=2048, N=1024]** CPU and CUDA outputs match Max abs diff: 5.49e-04, max rel diff: 0.15 CPU: 1173.27 ms, CUDA: 2.40 ms Speedup: ×488.5 ``` Rollback Plan: Differential Revision: D79042656 Pull Request resolved: pytorch#159325 Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as #159325. Pull Request resolved: #160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

Summary: What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by #159325. Why: Confirm CUDA backend for _weight_int8pack_mm is registered. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda ``` https://www.internalfb.com/intern/testinfra/testrun/2533275104869494 Differential Revision: D82926440 Pull Request resolved: #163461 Approved by: https://github.com/jerryzh168

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

Summary: What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by #159325. Why: Confirm CUDA backend for _weight_int8pack_mm is registered. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda ``` https://www.internalfb.com/intern/testinfra/testrun/2533275104869494 Differential Revision: D82926440 Pull Request resolved: #163461 Approved by: https://github.com/jerryzh168

bbeckca requested review from IvanYashchuk, eqy, lezcano, nikitaved and syed-ahmed as code owners July 29, 2025 01:32

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jul 29, 2025

facebook-github-bot added the fb-exported label Jul 29, 2025

bbeckca force-pushed the export-D79042656 branch from d12ed5a to 9b54f27 Compare July 29, 2025 01:51

bbeckca force-pushed the export-D79042656 branch from 9b54f27 to ee9d7f9 Compare July 29, 2025 01:56

bbeckca force-pushed the export-D79042656 branch from ee9d7f9 to e364903 Compare July 29, 2025 20:47

pytorch-bot bot added the release notes: inductor (aoti) label Jul 29, 2025

danielvegamyhre approved these changes Jul 30, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 30, 2025

bbeckca force-pushed the export-D79042656 branch from e364903 to 2d748c7 Compare August 1, 2025 02:40

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025

bbeckca force-pushed the export-D79042656 branch from 2d748c7 to 2f2c09d Compare August 1, 2025 14:42

bbeckca force-pushed the export-D79042656 branch from 6c90cba to 2f821d4 Compare August 5, 2025 16:27

bbeckca force-pushed the export-D79042656 branch from 2f821d4 to e38f060 Compare August 5, 2025 20:01

bbeckca force-pushed the export-D79042656 branch from e38f060 to 78124c6 Compare August 5, 2025 20:45

bbeckca force-pushed the export-D79042656 branch from 78124c6 to a2a77d3 Compare August 5, 2025 22:55

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2025

pytorchmergebot added the merging label Aug 6, 2025

pytorchmergebot added the Merged label Aug 6, 2025

pytorchmergebot closed this in 98316e5 Aug 6, 2025

pytorchmergebot removed the merging label Aug 6, 2025

xiaowangintel mentioned this pull request Aug 19, 2025

[WOQ] Add XPU kernel for _weight_int8pack_mm #160938

Closed

bbeckca mentioned this pull request Sep 22, 2025

[WOQ][Inductor] Enable CUDA coverage for _weight_int8pack_mm #163461

Closed

[WOQ] Add CUDA kernel for _weight_int8pack_mm #159325

[WOQ] Add CUDA kernel for _weight_int8pack_mm #159325

Uh oh!

Conversation

bbeckca commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 29, 2025

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159325

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

facebook-github-bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Attention! native_functions.yaml was changed

Uh oh!

facebook-github-bot commented Jul 29, 2025

Uh oh!

facebook-github-bot commented Jul 29, 2025

Uh oh!

bbeckca commented Jul 30, 2025

Uh oh!

danielvegamyhre commented Jul 31, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 5, 2025

Uh oh!

facebook-github-bot commented Aug 5, 2025

Uh oh!

facebook-github-bot commented Aug 5, 2025

Uh oh!

facebook-github-bot commented Aug 5, 2025

Uh oh!

facebook-github-bot commented Aug 5, 2025

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

pytorchmergebot commented Aug 6, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bbeckca commented Jul 29, 2025 •

edited

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading