Support aoti_torch_cuda__weight_int4pack_mm #15030

desertfire · 2025-10-10T23:25:59Z

Stack from ghstack (oldest at bottom):

-> Support aoti_torch_cuda__weight_int4pack_mm #15030

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.

aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128

With --qlinear 4w_hqq --qlinear_encoder 4w_hqq, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.

aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590

Differential Revision: D84395275

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030

pytorch-bot · 2025-10-10T23:26:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Macos CI runners unavailable

❌ 3 New Failures, 3 Cancelled Jobs

As of commit 892aa33 with merge base 9b03c13 ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
RuntimeError: Command docker exec -t 5bb9d489ade948e48debd16a965fc524685b0102c9ae889cf97fb8d52450ed0c /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t f8c2187afb9c1e42c5878b67c1463e797179153f3e6980238a85213139c38108 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t 354b305816f51848841dba0ab24f7186268e8de127efba9147a3f86caf897b4b /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-10-10T23:26:49Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

desertfire · 2025-10-10T23:27:22Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

desertfire · 2025-10-10T23:31:48Z

backends/cuda/runtime/shims/tests/targets.bzl

    cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor")
    cuda_shim_cpp_unittest("aoti_torch_copy_")
    cuda_shim_cpp_unittest("aoti_torch_cuda_guard")
+    cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm")


@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?

mergennachin

See inline for additional documentations (i used claude code to generate docs)

This is great, thank you!

mergennachin · 2025-10-12T14:57:08Z

backends/cuda/runtime/shims/int4mm.cuh

+// This is a clone of aten/src/ATen/native/cuda/int4mm.cu from PyTorch,
+// with at::Tensor replaced with ETensor and aten utility functions/macros
+// replaced with their executorch equivalents.
+//


Let's add these lines for future reference

// This file is a port of PyTorch's int4mm.cu kernel implementation // (aten/src/ATen/native/cuda/int4mm.cu) adapted for the ExecuTorch runtime. // // PORTING NOTES: // -------------- // 1. KERNEL CODE (lines 36-1067): Identical to PyTorch - preserved 100% // - All utility templates, vector types, and conversion logic unchanged // - Tensor core kernels (tinygemm_m16n8k16_chunk_kernel) byte-for-byte identical // - Same inline PTX assembly for mma.sync.aligned instructions // - Identical performance characteristics and register allocation // // 2. API ADAPTATIONS: // - Replaced at::Tensor with executorch::backends::aoti::Tensor // - Changed from C++ API to extern "C" for AOTI dlsym() compatibility // - Output returned via pointer-to-pointer instead of by-value // - Error handling uses ET return codes instead of exceptions // // 3. REMOVED FEATURES: // - _convert_weight_to_int4pack_cuda(): Weight conversion happens offline // during model export via optimum-executorch. Runtime only consumes // pre-packed weights. // - isCDNA2orLater() runtime check: Removed dependency on ATen GPU detection // hooks. ROCm support relies on compile-time guards only. // // 4. INFRASTRUCTURE CHANGES: // - Removed c10::cuda::CUDAGuard: Device management handled by AOTI backend // - Removed at::cuda::getCurrentCUDAStream(): Stream passed explicitly // - Input validation simplified: AOTI pre-validates tensors during export

mergennachin · 2025-10-12T15:01:33Z

backends/cuda/runtime/shims/int4mm.cu

+      ret0 != nullptr,
+      InvalidArgument,
+      "aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null");
+


ET_CHECK_OR_RETURN_ERROR( qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256, InvalidArgument, "aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld", static_cast<long long>(qGroupSize));

mergennachin · 2025-10-12T15:03:15Z

backends/cuda/runtime/shims/int4mm.cu

+#endif
+
+AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
+    Tensor* self,


should check whether self is bfloat16?

mergennachin · 2025-10-12T15:03:49Z

backends/cuda/runtime/shims/int4mm.cu

+
+AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
+    Tensor* self,
+    Tensor* mat2,


check whether mat2 is int32

mergennachin · 2025-10-12T15:13:28Z

backends/cuda/runtime/shims/int4mm.h

+#ifdef __cplusplus
+extern "C" {
+#endif
+


/** * Performs quantized INT4 matrix multiplication. * * INT4 weights are stored in a packed tensor core layout optimized for * NVIDIA Ampere+ GPUs (sm_80+) using m16n8k16 tensor core tiles. * * HARDWARE REQUIREMENTS: * - CUDA Compute Capability >= 8.0 (Ampere or later) * - BFloat16 support (native on sm_80+) * * TENSOR REQUIREMENTS: * @param self Input activation matrix [m, k] * - Must be BFloat16 dtype * - Must be 2D * - Must be on CUDA device * - Row-major layout (contiguous) * * @param mat2 Quantized weight matrix in packed tensor core layout * - Must be Int32 dtype (contains packed INT4 values) * - Must be 4D: [n/8][k/(InnerKTiles*16)][32][InnerKTiles/2] * where InnerKTiles = 2, 4, or 8 * - Each Int32 contains 8 packed INT4 values * - Layout optimized for tensor core access patterns * - Must be on CUDA device * * @param qGroupSize Quantization group size (number of values sharing scale/zero) * - Must be one of: 32, 64, 128, or 256 * - Smaller groups = higher accuracy but more metadata * - Must evenly divide k dimension * * @param qScaleAndZeros Dequantization parameters [k/qGroupSize][n][2] * - Must be BFloat16 dtype * - Must be 3D * - [:, :, 0] contains scales * - [:, :, 1] contains zero points * - Must be on CUDA device * * @param ret0 Output parameter for result matrix [m, n] * - Allocated by this function as BFloat16 * - Must not be null * - Caller is responsible for freeing via aoti_torch_delete_tensor_object() * * @return AOTITorchError error code: * - Error::Ok: Success * - Error::InvalidArgument: Null pointer, wrong dtype, wrong dimensions, * or invalid qGroupSize * - Error::Internal: CUDA kernel launch failure */

mergennachin · 2025-10-12T15:16:57Z

backends/cuda/runtime/utils.h


 // Enum for supported data types in et-cuda backend
 enum class SupportedDTypes : int32_t {
+  INT32 = 3, // PyTorch's int64 dtype code


PyTorch's int32 dtype code

mergennachin · 2025-10-12T15:24:38Z

Wait, why is the "Run latency" slower than in int4 cc @swolchok

desertfire requested review from kirklandsign and larryliu0820 as code owners October 10, 2025 23:26

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025

desertfire added the topic: not user facing label Oct 10, 2025

desertfire requested a review from mergennachin October 10, 2025 23:30

desertfire commented Oct 10, 2025

View reviewed changes

mergennachin approved these changes Oct 12, 2025

View reviewed changes

mergennachin requested review from Gasoonjia and swolchok October 12, 2025 15:25

Support aoti_torch_cuda__weight_int4pack_mm #15030

Are you sure you want to change the base?

Support aoti_torch_cuda__weight_int4pack_mm #15030

Uh oh!

Conversation

desertfire commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

❗ 1 Active SEVs

❌ 3 New Failures, 3 Cancelled Jobs

Uh oh!

github-actions bot commented Oct 10, 2025

This PR needs a release notes: label

Uh oh!

desertfire commented Oct 10, 2025

Uh oh!

desertfire Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

desertfire commented Oct 10, 2025 •

edited

Loading

pytorch-bot bot commented Oct 10, 2025 •

edited

Loading

This PR needs a `release notes:` label

mergennachin left a comment •

edited

Loading

mergennachin commented Oct 12, 2025 •

edited

Loading