[New Feature] CUTLASS kernels for w4a8 quantization #64

supriyar · 2024-03-18T22:08:13Z

We plan to add QAT for LLMs to torchao (as mentioned in the original RFC here #47)

For this to run efficiently on the GPU we'd need kernel support for W4A8 quantization (int4 weights, int8 activations).

Other places where this has been raised before
NVIDIA/cutlass#1316,
NVIDIA/cutlass#1370

cc @andrewor14

supriyar · 2024-03-18T22:09:34Z

cc @alexsamardzic @cpuhrsch

alexsamardzic · 2024-03-19T15:01:06Z

Working on this: NVIDIA/cutlass#1413.

jeromeku · 2024-03-22T17:18:13Z

@alexsamardzic

Great work so far on integrating w4a8 GEMM in Cutlass!

Do you have plans on re-implementing this functionality in pre-Hopper architectures using Cutlass 3.x / CuTe rather the Cutlass 2.x apis that seem to be deprecated?

The 3.x interface has some convenient sub-byte primitives for slicing 4b tensors but warp-level shuffling would still be needed for efficient tensor core loading and mma.

Would be happy to help adapt 4b mixed type gemm using CuTe for Ampere.

alexsamardzic · 2024-03-22T18:40:52Z

Do you have plans on re-implementing this functionality in pre-Hopper architectures using Cutlass 3.x / CuTe rather the Cutlass 2.x apis that seem to be deprecated?

(Please send further comments to the PR mentioned above - I think it makes most sense to discuss CUTLASS features on CUTLASS GitHub pages.)

As it could be seen from my PR, this feature is implemented the same way as F16/S8, and alike. For my purpose, and that is adding support for this operation into PyTorch, for Ampere architecture and for both eager and compiled mode, this is good enough. I'm not sure in which way my changes could be made more 3.x-y, as the functionality is implemented on the warp level, but if you have any suggestions, please post them either into this, or in separate PR.

supriyar mentioned this issue Mar 19, 2024

[Tracker] General feature requests for torchao #65

Open

9 tasks

supriyar assigned alexsamardzic Mar 21, 2024

alexsamardzic removed their assignment Mar 21, 2024

msaroufim added the enhancement New feature or request label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Feature] CUTLASS kernels for w4a8 quantization #64

[New Feature] CUTLASS kernels for w4a8 quantization #64

supriyar commented Mar 18, 2024 •

edited

Loading

supriyar commented Mar 18, 2024

alexsamardzic commented Mar 19, 2024

jeromeku commented Mar 22, 2024

alexsamardzic commented Mar 22, 2024

[New Feature] CUTLASS kernels for w4a8 quantization #64

[New Feature] CUTLASS kernels for w4a8 quantization #64

Comments

supriyar commented Mar 18, 2024 • edited Loading

supriyar commented Mar 18, 2024

alexsamardzic commented Mar 19, 2024

jeromeku commented Mar 22, 2024

alexsamardzic commented Mar 22, 2024

supriyar commented Mar 18, 2024 •

edited

Loading