Skip to content

Conversation

desertfire
Copy link
Contributor

@desertfire desertfire commented Oct 10, 2025

Stack from ghstack (oldest at bottom):

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.

aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128

With --qlinear 4w_hqq --qlinear_encoder 4w_hqq, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.

aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590

Differential Revision: D84395275

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Oct 10, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: a543a05
Pull Request resolved: #15030
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025
Copy link

pytorch-bot bot commented Oct 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 3 Cancelled Jobs

As of commit 892aa33 with merge base 9b03c13 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor")
cuda_shim_cpp_unittest("aoti_torch_copy_")
cuda_shim_cpp_unittest("aoti_torch_cuda_guard")
cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline for additional documentations (i used claude code to generate docs)

This is great, thank you!

// This is a clone of aten/src/ATen/native/cuda/int4mm.cu from PyTorch,
// with at::Tensor replaced with ETensor and aten utility functions/macros
// replaced with their executorch equivalents.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add these lines for future reference

  // This file is a port of PyTorch's int4mm.cu kernel implementation
  // (aten/src/ATen/native/cuda/int4mm.cu) adapted for the ExecuTorch runtime.
  //
  // PORTING NOTES:
  // --------------
  // 1. KERNEL CODE (lines 36-1067): Identical to PyTorch - preserved 100%
  //    - All utility templates, vector types, and conversion logic unchanged
  //    - Tensor core kernels (tinygemm_m16n8k16_chunk_kernel) byte-for-byte identical
  //    - Same inline PTX assembly for mma.sync.aligned instructions
  //    - Identical performance characteristics and register allocation
  //
  // 2. API ADAPTATIONS:
  //    - Replaced at::Tensor with executorch::backends::aoti::Tensor
  //    - Changed from C++ API to extern "C" for AOTI dlsym() compatibility
  //    - Output returned via pointer-to-pointer instead of by-value
  //    - Error handling uses ET return codes instead of exceptions
  //
  // 3. REMOVED FEATURES:
  //    - _convert_weight_to_int4pack_cuda(): Weight conversion happens offline
  //      during model export via optimum-executorch. Runtime only consumes
  //      pre-packed weights.
  //    - isCDNA2orLater() runtime check: Removed dependency on ATen GPU detection
  //      hooks. ROCm support relies on compile-time guards only.
  //
  // 4. INFRASTRUCTURE CHANGES:
  //    - Removed c10::cuda::CUDAGuard: Device management handled by AOTI backend
  //    - Removed at::cuda::getCurrentCUDAStream(): Stream passed explicitly
  //    - Input validation simplified: AOTI pre-validates tensors during export

ret0 != nullptr,
InvalidArgument,
"aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET_CHECK_OR_RETURN_ERROR(
        qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256,
        InvalidArgument,
        "aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld",
        static_cast<long long>(qGroupSize));

#endif

AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
Tensor* self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check whether self is bfloat16?


AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
Tensor* self,
Tensor* mat2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether mat2 is int32

#ifdef __cplusplus
extern "C" {
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  /**
   * Performs quantized INT4 matrix multiplication.
   *
   * INT4 weights are stored in a packed tensor core layout optimized for
   * NVIDIA Ampere+ GPUs (sm_80+) using m16n8k16 tensor core tiles.
   *
   * HARDWARE REQUIREMENTS:
   * - CUDA Compute Capability >= 8.0 (Ampere or later)
   * - BFloat16 support (native on sm_80+)
   *
   * TENSOR REQUIREMENTS:
   * @param self Input activation matrix [m, k]
   *   - Must be BFloat16 dtype
   *   - Must be 2D
   *   - Must be on CUDA device
   *   - Row-major layout (contiguous)
   *
   * @param mat2 Quantized weight matrix in packed tensor core layout
   *   - Must be Int32 dtype (contains packed INT4 values)
   *   - Must be 4D: [n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]
   *     where InnerKTiles = 2, 4, or 8
   *   - Each Int32 contains 8 packed INT4 values
   *   - Layout optimized for tensor core access patterns
   *   - Must be on CUDA device
   *
   * @param qGroupSize Quantization group size (number of values sharing scale/zero)
   *   - Must be one of: 32, 64, 128, or 256
   *   - Smaller groups = higher accuracy but more metadata
   *   - Must evenly divide k dimension
   *
   * @param qScaleAndZeros Dequantization parameters [k/qGroupSize][n][2]
   *   - Must be BFloat16 dtype
   *   - Must be 3D
   *   - [:, :, 0] contains scales
   *   - [:, :, 1] contains zero points
   *   - Must be on CUDA device
   *
   * @param ret0 Output parameter for result matrix [m, n]
   *   - Allocated by this function as BFloat16
   *   - Must not be null
   *   - Caller is responsible for freeing via aoti_torch_delete_tensor_object()
   *
   * @return AOTITorchError error code:
   *   - Error::Ok: Success
   *   - Error::InvalidArgument: Null pointer, wrong dtype, wrong dimensions,
   *     or invalid qGroupSize
   *   - Error::Internal: CUDA kernel launch failure
   */


// Enum for supported data types in et-cuda backend
enum class SupportedDTypes : int32_t {
INT32 = 3, // PyTorch's int64 dtype code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch's int32 dtype code

@mergennachin
Copy link
Contributor

mergennachin commented Oct 12, 2025

Wait, why is the "Run latency" slower than in int4 cc @swolchok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants