W4A8 based on CUTLASS #880

alexsamardzic · 2024-09-12T12:39:43Z

@msaroufim @cpuhrsch

pytorch-bot · 2024-09-12T12:39:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/880

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-12T12:39:49Z

Hi @alexsamardzic!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

alexsamardzic · 2024-09-12T13:42:44Z

The kernel implements W4A8 GEMM, with float16 scaling factors. The zero point support is to be eventually added later, for now several hacks (to be removed) are put in the code, that will force int8_dynamic_activation_int4_weight to do symmetric quantization for both activation and weight.

There are several points to discuss:

CUTLASS would have to be made a dependency. IMO, the best approach to satisfy the dependency would be to install nvidia-cutlass package, the only problem is that it doesn't always contain latest changes in CUTLASS. An alternative would be to have CUTLASS repo as submodule of this repo, like in PyTorch.

The group quantization may be a problem. Let's say X is input matrix of size MxK, with Xs vector of input scales of size M, and Wis weight matrix of size NxK. If group size parameter is equal to K, then weight scales Ws will be a vector of size N, and an element of output matrix Y of a linear operator would be calculated as follows (let's ignore bias for now, as it's not relevant):

$$y_{i,j}=\sum_{k}xs_{i}\cdot x_{i,k}\cdot w_{j,k}\cdot ws_{j}=xs_{i}\cdot ws_{j}\cdot \sum_{k}x_{i,k}\cdot w_{j,k}$$

The sum in the last expression could be efficiently calculated as mixed integer data types GEMM on tensor cores, and the result could be then updated by mulitplying the scale factors in. However, if group size parameter is less than K, say 32 for example (32 < K, K % 32 == 0), then weight scales will be matrix of size Nx(K/32). In this case, an element of output matrix Y of a linear operator would be calculated as follows:

$$y_{i,j}=\sum_{k}xs_{i}\cdot x_{i,k}\cdot w_{j,k}\cdot ws_{j,k/32}=xs_{i}\cdot \sum_{k}x_{i,k}\cdot w_{j,k}\cdot ws_{j,k/32}$$

Now, the only approach possible in CUTLASS to do this calculation in integer mixed data types on tensor cores would be to split it into K/32 GEMMs, and try to run them at the same time as so-called grouped GEMM. The code would be much more complicated, and also the update with the scaling factors will be still different for each of these individual GEMMs, so I don't think this approach would be performant. So my question here is: Does it make sense to create a quantization different than int8_dynamic_activation_int4_weight, that would match this kernel better, in particular that would not use group quantization for weight at all? (BTW, creating a new quantization, or at least adding a variant of int8_dynamic_activation_int4_weight is needed anyway, as this one is not packing two 4-bit weight values into a byte, that is required by CUTLASS for int8/int4 GEMM.)

Another related issue is zero point handling. Let's say Xz is vector of size M of input zero point values, and Wz is vector of size N of weight zero point values. Then the linear operator calculation, in PyTorch notation would be as follows: Y=((X-Xz)*Xs)@((W-Wz)*Ws).T (again, let's ignore bias), that translates into following calculation for an individual element of output matrix Y:

$$ \begin{array}{lcl} y_{i,j} & = & \sum_{k}xs_{i}\cdot (x_{i,k}-xz_{i})\cdot (w_{j,k}-wz_{j})\cdot ws_{j} \\ & = & xs_{i}\cdot ws_{j}\cdot (\sum_{k}x_{i,k}\cdot w_{j,k}-wz_{j}\sum_{k}a_{i,k}-xz_{i}\sum_{k}w_{k,j}+K\cdot xz_{i}\cdot wz_{j}) \\ \end{array} $$

Only the first expression within parentheses could be calculated on tensor cores as mixed integer data types GEMM, while the sums in the next two expression are best to be pre-calculated in case of weight values, or calculated on the fly during the input quantization. So it seems to me these are also calling for specialized type of quantization. (Note also that if group quantization used, above mentioned complications for Ws are extended to Wz too.)

All comments/suggestions welcome; in particular I'm pretty much new to quantization specifics so please let me know if I'm missing something obvious.

msaroufim · 2024-09-12T17:20:04Z

I'm on PTO today and tomorrow so will review asap, apologies for the delay

cpuhrsch · 2024-09-12T17:27:46Z

@alexsamardzic - Can we use the CUTLASS that ships with PyTorch? As in, should we change PyTorch to ship the headers used to build its CUTLASS kernels / does the PyTorch nightly already ship those? I see the test is using group size 128. I think it's ok if we don't necessarily support all group sizes or shapes right away.

We have some int4 support via the pattern matched in https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/post_grad.py#L345-L403 which dispatches to https://github.com/pytorch/pytorch/blob/dab7d646d55a2b6696d51dee4816a6743ec1ae5a/torch/_inductor/kernel/unpack_mixed_mm.py#L76 - would an extension for int4x2 X int8 of this be interesting here?

alexsamardzic · 2024-09-12T18:52:41Z

I'm on PTO today and tomorrow so will review asap, apologies for the delay

Thanks Mark - it's really just a draft, so not yet ready for review, but it would be useful to discuss points that I mentioned in my comment above.

alexsamardzic · 2024-09-12T19:03:07Z

@alexsamardzic - Can we use the CUTLASS that ships with PyTorch? As in, should we change PyTorch to ship the headers used to build its CUTLASS kernels / does the PyTorch nightly already ship those?

This CUTLASS version is also lagging behind. My CUTLASS PR with mixed int4/int8 GEMM is merged after the latest (3.5.1) CUTLASS release, hopefully there will be a new release soon. But in any case, this is a kind of problem that we'll have if we use more CUTLASS from torchao - for lots of time, the torchao build will have to be pointed to a bleeding edge CUTLASS checkout.

I see the test is using group size 128. I think it's ok if we don't necessarily support all group sizes or shapes right away.

It uses group size 128 in order to force weight scale to be a vector, and not a matrix. I tried to explain the issue in my comment above, if group quantization is obligatory here, then it's going to be rather complicated to make this work.

We have some int4 support via the pattern matched in https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/post_grad.py#L345-L403 which dispatches to https://github.com/pytorch/pytorch/blob/dab7d646d55a2b6696d51dee4816a6743ec1ae5a/torch/_inductor/kernel/unpack_mixed_mm.py#L76 - would an extension for int4x2 X int8 of this be interesting here?

I'm just looking into the quantization code, to see is it possible to do it there - it's not hard to make this change, but CUTLASS in general doesn't support doing things before GEMM (while fusing operations after GEMM calculated is reasonably well supported), so it would be the best if the quantization code actually put the weight values in int4x2 format.

alexsamardzic · 2024-09-13T11:09:14Z

Updated so that there is a new int8_dynamic_activation_int4_weight_cutlass quantization method available that, for now, would quantize both input and weight symmetrically, and won't use group quantization for weight (so weight scales are always a vector). It should be now possible to try kernel on arbitrary models, if quantized by above quantization method.

jerryzh168 · 2024-09-13T21:12:50Z

torchao/quantization/quant_api.py

@@ -506,6 +508,41 @@ def int8_dynamic_activation_int4_weight(group_size=32, mapping_type=MappingType.
    return _get_linear_subclass_inserter(apply_int8_dynamic_activation_int4_weight_quant, group_size=group_size, mapping_type=mapping_type)


+def apply_int8_dynamic_activation_int4_weight_quant_cutlass(weight):


can this be represented as a different Layout for int8 dynamic activation/int4 weight quantization? docs for Packing/Layout can be found in #391 "Layout and Packing" and simplified example in https://github.com/pytorch/ao/blob/main/tutorials/developer_api_guide/my_dtype_tensor_subclass.py

Thanks for the pointer! Yes, this will need refinement on this and several other places, as I learn about doing things the "torchao way"; but my main goal initially is to connect the dots, so that some benchmarks could be run, and that we could verify that CUTLASS provides some value here.

alexsamardzic · 2024-09-17T17:52:37Z

Made some minor updates, including added support for bfloat16.

Micro-benchmarking script

import copy

import torch

from torchao.utils import (
    TORCH_VERSION_AT_LEAST_2_5,
    unwrap_tensor_subclass,
)
from torchao.quantization.quant_api import (
    quantize_,
    int8_dynamic_activation_int4_weight_cutlass,
)

# FIXME: change this!
_CUTLASS_DIR = ".../cutlass"


class ToyModel(torch.nn.Module):
    def __init__(self, nin, nout1, nout2):
        super().__init__()
        self.linear1 = torch.nn.Linear(nin, nout1)
        self.linear2 = torch.nn.Linear(nout1, nout2, bias=False)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x


methodq = int8_dynamic_activation_int4_weight_cutlass()
compile = False
dtype = torch.float16  # dtype = torch.bfloat16
device = "cuda"
bs, nin, nout1, nout2 = 256, 1024, 2048, 128

inputs = (torch.randn((1, bs, nin), dtype=dtype, device=device),)
model = ToyModel(nin, nout1, nout2).eval().to(dtype).to(device)
modelq = copy.deepcopy(model)

if compile:
    model = torch.compile(model, mode="max-autotune")

quantize_(modelq, methodq)
if not TORCH_VERSION_AT_LEAST_2_5:
    unwrap_tensor_subclass(modelq)

if compile:
    modelq = torch.compile(
        modelq,
        options={
            "max_autotune": True,
            "autotune_in_subproc": False,
            "max_autotune_gemm_backends": "Triton,CUTLASS",
            "cuda.cutlass_dir": _CUTLASS_DIR,
            "use_mixed_mm": True,
        },
    )


if __name__ == "__main__":
    from torchao.utils import benchmark_model

    nruns = 100
    torch._dynamo.reset()
    time = benchmark_model(model, nruns, inputs)
    timeq = benchmark_model(modelq, nruns, inputs)
    print(f"original model mean time  : {time:8.3f}")
    print(f"quantized model mean time : {timeq:8.3f}")
    print(f"speedup by quantization   : {time / timeq:8.3f}")

For particular shapes given in the script above, on A100 the micro-benchmark shows around 2x speedup over the case when float16 MM used, and around 1.8x speedup over the case when bfloat16 MM used. (Note that this is for eager mode execution, as compilation to corresponding CUTLASS kernel is not yet supported by PyTorch.)

Patch to run torchao/_models/llama/generate.py

diff --git a/torchao/_models/llama/generate.py b/torchao/_models/llama/generate.py
index 5fb905d..e5b891b 100644
--- a/torchao/_models/llama/generate.py
+++ b/torchao/_models/llama/generate.py
@@ -206,6 +206,7 @@ def main(
             quantize_,
             int8_weight_only,
             int8_dynamic_activation_int8_weight,
+            int8_dynamic_activation_int4_weight_cutlass,
             int4_weight_only,
             fpx_weight_only,
             uintx_weight_only,
@@ -216,6 +217,8 @@ def main(
             quantize_(model, int8_weight_only())
         if "int8dq" in quantization:
             quantize_(model, int8_dynamic_activation_int8_weight())
+        if "w4a8-cutlass" in quantization:
+            quantize_(model, int8_dynamic_activation_int4_weight_cutlass())
         if "int4wo" in quantization:
             if "hqq" in quantization:
                 use_hqq=True
@@ -414,7 +417,7 @@ if __name__ == '__main__':
     parser.add_argument('--checkpoint_path', type=Path, default=Path("../../../checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth"), help='Model checkpoint path.')
     parser.add_argument('-q', '--quantization', type=str, 
         help=(
-            'Which quantization techniques to apply: int8dq, int8wo, fp6, int4wo-<groupsize>, int4wo-<groupsize>-hqq, autoquant, '
+            'Which quantization techniques to apply: int8dq, w4a8-cutlass, int8wo, fp6, int4wo-<groupsize>, int4wo-<groupsize>-hqq, autoquant, '
             +'autoquant-int4, uintx-<nbits>-<groupsize>, uintx-<nbits>-<groupsize>-hqq, sparse-marlin'
         )
     )
diff --git a/torchao/dtypes/affine_quantized_tensor.py b/torchao/dtypes/affine_quantized_tensor.py
index 1df3549..1252bb8 100644
--- a/torchao/dtypes/affine_quantized_tensor.py
+++ b/torchao/dtypes/affine_quantized_tensor.py
@@ -1158,6 +1158,7 @@ implements = AffineQuantizedTensor.implements
 # so that these can be shared by F.linear, aten.mm, aten.addmm dispatches
 
 def _linear_int8_act_int8_weight_check(input_tensor, weight_tensor, bias):
+    return False
     return (
         isinstance(input_tensor, AffineQuantizedTensor) and
         _aqt_is_int8_reduced_range(input_tensor) and
diff --git a/torchao/kernel/intmm.py b/torchao/kernel/intmm.py
index 3005cb1..451d0e6 100644
--- a/torchao/kernel/intmm.py
+++ b/torchao/kernel/intmm.py
@@ -54,6 +54,8 @@ if TORCH_VERSION_AT_LEAST_2_2:
             and k_is_nonzero_multiple_of_8
         )
 
+        bad_dimensions_for_cublas = False
+
         if device_cpu or bad_dimensions_for_cublas:
             # fallback path
             return torch.matmul(input.cpu().to(torch.int32), mat2.cpu().to(torch.int32)).to(

With the patch above, I was able to run Llama generator.py script. The command to run is as follows:

python generate.py -q w4a8-cutlass

and the output is as follows (again, this is run on A100):

==========
Average tokens/sec: 10.21
Average Bandwidth: 33.78 GB/s
Peak Memory Usage: 14.22 GB
Model Size: 3.31 GB

while the reference output, for the case when no arguments supplied to generate.py, is as follows:

==========
Average tokens/sec: 32.87
Average Bandwidth: 434.31 GB/s
Peak Memory Usage: 13.62 GB
Model Size: 13.21 GB

So the tokens/sec is more than 3x slower, but this is not even that bad, considering that batch size is 1 here, and that the CUTLASS code has it hard-coded for a block of threads to handle input tile size that is 128 for the same dimension, so most of the work is wasted.

So there is a room for improvement regarding the speed. The text generated is garbage, however. Even for the micro-benchmark above, output values visibly deviate from the values produced when native precision used (but at least they resemble each other).

alexsamardzic · 2024-09-18T13:59:16Z

Made an update - turns out that actually CUTLASS needs a fix (posted below for now), and then generate.py script for Llama model would generate meaningful content.

CUTLASS fix

diff --git a/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h b/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
index 1692cc30..5a1b164c 100644
--- a/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
+++ b/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
@@ -263,6 +263,44 @@ struct DefaultIteratorsTensorOp<
   static int const kFragmentsPerIteration = 2;
 };
 
+/// Partial specialization for bfloat16 <= int32_t x 8 epilogues avoids shared memory bank conflicts.
+template <
+  typename ThreadblockShape,
+  typename WarpShape,
+  typename InstructionShape,
+  typename ThreadMap
+>
+struct DefaultIteratorsTensorOp<
+  bfloat16_t, 
+  int32_t, 
+  8, 
+  ThreadblockShape, 
+  WarpShape, 
+  InstructionShape, 
+  ThreadMap> {
+  
+  using WarpTileIterator = cutlass::epilogue::warp::TileIteratorTensorOpMixed<
+    WarpShape,
+    InstructionShape,
+    int32_t,
+    32,
+    16,
+    8,
+    8
+  >;
+
+  using SharedLoadIterator = cutlass::epilogue::threadblock::SharedLoadIteratorMixed<
+    ThreadMap,
+    int32_t,
+    32,
+    16,
+    8,
+    8
+  >;
+
+  static int const kFragmentsPerIteration = 2;
+};
+
 /// Partial specialization for int8/int4b_t <= int32 x 16/8 epilogues avoids shared memory bank conflicts.
 /// Threadblock::kN = 256 still has bank conflicts.
 template <

On the other side, I tried with adapting tile sizes processed by block/warp of threads of corresponding CUTLASS kernel, in order to adapt to the fact that batch size is 1 here. Here is an example of such change:

+++ b/torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu
@@ -418,8 +418,8 @@ s8s4_linear_cutlass(const at::Tensor& input, const at::Tensor& input_scale,
   using ElementA = int8_t;
   using ElementB = cutlass::int4b_t;
   using ElementAccumulator = int32_t;
-  using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 64>;
-  using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
+  using ThreadblockShape = cutlass::gemm::GemmShape<32, 128, 128>;
+  using WarpShape = cutlass::gemm::GemmShape<32, 32, 128>;
   using InstructionShape = cutlass::gemm::GemmShape<16, 8, 32>;
   AT_DISPATCH_SWITCH(
     input_scale.scalar_type(),

However, tokens/sec is not much improved this way. Thus, the performance of this kernel for Llama model will require more work.

Edit: CUTLASS fix posted upstream here.

msaroufim

will make a second pass for the kernel code

msaroufim · 2024-09-27T21:36:41Z

setup.py

@@ -65,6 +65,12 @@ def get_extensions():
    extension = CUDAExtension if use_cuda else CppExtension

    if not IS_WINDOWS:
+        import cutlass_library


interesting: not too familiar with cutlass packaging but what is cutlass_library exactly? only reference I found is this https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library

It's a recent addition to CUTLASS: a Python library that is able to generate C++ code for CUTLASS GEMM templates instantiation (which is nice to have, as these templates have dozen or more arguments, and it's oftentimes hard to get them right). It's used in CUTLASS codegen for TorchInductor, like here. However, recently CUTLASS itself also added a functionality to generate and compile C++ code for GEMM kernels, from a high-level specification in Python - this is part of cutlass Python package, see here. Both cutlass and cutlass_library are available through nvidia-cutlass pip package. It's important to note that this package also contains all of the CUTLASS C++ header files, in order to make it possible to compile the C++ generated kernels.

msaroufim · 2024-09-27T21:42:08Z

setup.py

+        cutlass_library_dir = os.path.dirname(cutlass_library.__file__)
+        cutlass_include_dir = os.path.join(cutlass_library_dir, "source", "include")
+        # FIXME: remove this once CUTLASS package updated to include int4/int8 MM
+        cutlass_include_dir = "/data/quansight/scratch/cutlass/include"


n00b q: what is this exactly? Do you need any help packaging CUTLASS?

I discussed this a bit in my first comment on this PR. In order ao to compile after this PR eventually merged, CUTLASS C++ header files are to be made available. There are at least two ways to do it:

To make CUTLASS repo a submodule of ao repo, just like PyTorch did it.

To make above mentioned nvidia-cutlass package a dependency of ao.

I'm leaning towards the later, and this is what above code, before "FIXME" is expecting. However, in both of above cases, we'll certainly face an issue of having to depend on stuff that is not yet merged into CUTLASS, but we need it. For example, at this very moment:

My CUTLASS PR with int4/int8 GEMM support for CUTLASS is merged, but CUTLASS team has not made a release in the meantime, so this functionality is only available in CUTLASS main branch, and also above mentioned nvidia-cutlass package doesn't contain it yet.

As mentioned in one of my comments above, while working in this PR, I found an omission in CUTLASS. I created a CUTLASS PR with a fix, but this one is not yet merged, so neither CUTLASS main branch nor nvidia-cutlass package contain the fix at the moment, it's only available in my branch. So the only way to proceed with the development of my PR was to create a local copy of this branch - I created it in /data/quansight/scratch/cutlass directory on my machine; in order to try this PR, the local copy of this branch is to be created, and this last line in the snippet above is to be changed to the local directory.

From my experience with this stuff from PyTorch development based on CUTLASS, this is going to be permanent issue - if we decide to use CUTLASS in ao, the for the most of the time we'll need bleeding edge features. So this is to be discussed further, IMO the best approach would be to build our own nvidia-cutlass package, from whatever CUTLASS branch we find the most appropriate.

msaroufim · 2024-09-27T21:49:10Z

torchao/quantization/quant_api.py

@@ -85,6 +85,7 @@
    "_get_subclass_inserter",
    "quantize_",
    "int8_dynamic_activation_int4_weight",
+    "int8_dynamic_activation_int4_weight_cutlass",


do you have some baseline numbers vs int8_dynamic_activation_int4_weight

Now when I have the dots connected, in the sense that I can run a micro-benchmark, and also Lllama model, using this kernel, I'm working on a more detailed profiling, part of this is also comparing the performance of this kernel with int8_dynamic_activation_int4_weight kernel. I'll report all my findings here when I'm done with the profiling.

As a quick update here: Using the micro-benchmarking script above, it seems this PR is just 3-5% faster than int8_dynamic_activation_int4_weight. However, on the Llama generator, it seems about 2x faster, when tokens/sec numbers compared. (Remember that all the caveats from my first comment above still apply, so let's not jump into any conclusions for now.)

msaroufim · 2024-09-27T21:51:59Z

test/test_s8s4_linear_cutlass.py

@@ -0,0 +1,51 @@
+# FIXME: move this test to the appropriate test file!!!


yeah maybe make yourself a cutlass folder to park all your work

Yes. Again, as mentioned in one of my comments above: At the moment, most of the "FIXME"-s in the PR are as I'm aware that I took shortcuts to make things work. If/when we're happy with the main stuff, I'll revisit all of these, and redo them in the proper "ao-way".

msaroufim · 2024-09-27T21:53:41Z

test/test_s8s4_linear_cutlass.py

+        output_ref = model(input)
+
+        modelq = copy.deepcopy(model)
+        quantize_(modelq, int8_dynamic_activation_int4_weight_cutlass())


maybe another reference would be the non cutlass variant

msaroufim · 2024-09-27T21:55:41Z

torchao/quantization/quant_api.py

+    # then corresponding changes made in
+    # _linear_int8_act_int4_weight_cutlass_check and for the check in
+    # the CUTLASS kernel!!!
+    weight.original_weight_tensor.layout_tensor.int_data = (


Maybe a comment like

# Combine pairs of 4-bit values into single bytes weight.original_weight_tensor.layout_tensor.int_data = ( # Take odd-indexed columns, keep lower 4 bits, shift left by 4 bits (weight.original_weight_tensor.layout_tensor.int_data[:, 1::2] & 0xF) << 4 ) | ( # Take even-indexed columns, keep lower 4 bits weight.original_weight_tensor.layout_tensor.int_data[:, 0::2] & 0xF )

cpuhrsch · 2024-09-27T23:59:30Z

torchao/quantization/quant_api.py

+    """
+    return _get_linear_subclass_inserter(apply_int8_dynamic_activation_int4_weight_quant_cutlass)
+
+
 def int4_weight_only(group_size=128, layout_type=TensorCoreTiledLayoutType(inner_k_tiles=8), use_hqq=False):


Unrelated comment, what is this use_hqq? @jerryzh168 do you know?

yeah this means use hqq algorithm to choose qparams and quantize the weight, since it is reusing the tinygemm kernel, we just added this as a separate option here

msaroufim · 2024-09-28T00:16:19Z

torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu

+  const int n = tensor_b.size(0);
+  const int k = tensor_a.size(1);
+
+  constexpr int AlignmentA = 128 / cutlass::sizeof_bits<ElementA>::value;


nit: mind adding a comment for why 128

Also how do you think about padding vs erroring

The 128 bits here is because of how tensor cores work (so it's not CUTLASS-specific), at least for SM 8.x. It's related to the layout of tiles of matrix operands that single warp of thread is multiplying cooperatively. The best explanation that I found so far is in GTC 2020 talk, by CUTLASS team, around slide 15.

We can consider padding (maybe at the later stage?), I believe it would the best to incorporate padding together with the quantization.

msaroufim · 2024-09-28T00:22:10Z

torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu

+  using SmArch = cutlass::arch::Sm80;
+  using ThreadblockSwizzle =
+      cutlass::gemm::threadblock::ThreadblockSwizzleStreamK;
+  constexpr auto NumStages = 4;


cutlass n00b but how do you pick these hyperparams?

These, and others, are the CUTLASS GEMM C++ template arguments. As mentioned above, there is dozen of these to set, but on the other side only small number of combinations of these arguments actually works. The above mentioned cutlass_library package enumerates some of these working combinations. The CUTLASS itself doesn't include any sort of heuristic for selection of these parameters, for example based on GEMM operand shapes. So I had to hard-code some values, at least for now. The values selected here are based on my previous experimentation with different combinations, and different operand shapes - in the sense that these values should provide acceptable performance for number of cases. But certainly there are cases where these values are not good fit, Lllama inference, having batch size 1, is one such example. So we may want to consider adding some heuristic here, but on the longer term we'd probably prefer to do support some auto-tuning, just like what is possible with Triton kernels.

alexsamardzic · 2024-10-06T16:30:23Z

(Pushed an update, where the branch is just rebased on the latest main.)

I did lots of profiling in the meantime, focusing primarily on running Llama generator (torchao/_models/llama/generate.py), using tokens/sec as performance measure, and comparing between this PR and W8A8DQ case (i.e. when model quantized using int8_dynamic_activation_int8_weight). All of the results presented below were for A100 runs, the W8A8DQ run was as follows:

python generate.py -q int8dq

and the run for this PR was as follows (with the patch mentioned above applied beforehand):

python generate.py -q w4a8-cutlass

TLDR (note that each of these items could be verified by profiling W8A8DQ alone, without using this PR at all):

The CUTLASS MM kernel in case of this PR, and also the Triton kernel MM for the W8A8DQ are not, at least at this moment, the most critical for performance. Instead, the other parts of the code, that are run each time along with the linear operator, are taking more execution time - see the remaining two items in the list.
The dispatch checks registered here are re-run over and over again. These take considerable time, and also they make the performance depending on the position of registering the kernel, and corresponding check, in this list: if corresponding item moved between top and bottom of the list, the tokens/sec differ up to 10%. (@jerryzh168)
The dynamic quantization takes considerable time too, more than the MM kernel itself. This could be improved, by working on fusing PyTorch operators used to perform quantization, or by implementing dedicated kernel(s) for dynamic quantization; also for Llama generator in particular by adjusting configs of these kernels to the fact that the number of inputs is 1. (Still, IMO it's questionable is there any performance benefit in using dynamic quantization vs. weight quantization only.)

As an example for item 1 above, here are the performance results, as printed by generate.py script, in case when item registering given kernel and check moved to the first place in the list:

python generate -q int8dq
# ... lots of output here
==========
Average tokens/sec: 4.81
Average Bandwidth: 31.83 GB/s
Peak Memory Usage: 14.86 GB
Model Size: 6.62 GB

python generate -q w4a8-cutlass
# ... lots of output here
==========
Average tokens/sec: 10.31
Average Bandwidth: 34.11 GB/s
Peak Memory Usage: 14.22 GB
Model Size: 3.31 GB

and when moved to the last place in the list:

python generate -q int8dq
# ... lots of output here
==========
Average tokens/sec: 4.35
Average Bandwidth: 28.82 GB/s
Peak Memory Usage: 14.86 GB
Model Size: 6.62 GB

==========
# ... lots of output here
Average tokens/sec: 9.92
Average Bandwidth: 32.82 GB/s
Peak Memory Usage: 14.22 GB
Model Size: 3.31 GB

The generator runs are profiled using pyinstrument, and verified using cProfile and nsys profilers. With the profiling run launched as follows:

python -m pyinstrument generate.py -q w4a8-cutlass

here is the relevant part of the pyinstrument output:

So, for the attention segment of the model, one could see that everything related to running the linear operator takes about 34s in total. Out of this time, 24s are spend in the dynamic quantization, while about 9.4s only are spent on the linear operator itself, and then out of these 9.4s, only 2.4s are spent on the CUTLASS MM kernel execution, while the rest of time get spent on checking to which kernel to dispatch (note that for this run, the check for applicability of the CUTLASS kernel is added last to the list) - these checks are not visible in this snippet, as pyinstrument by default suppresses calls that take shorter time, but attached below is full pyinstrument output to verify it. The distribution of time spent is alike for the feed-forward part of the network - this could be also seen from the full output below.

Here is the pyinstrument --show-all ... output for the run above: pyinstrument.txt.

As mentioned above, profiling results are verified using cProfile and nsys. For example, for nsys run as follows:

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -python-sampling=true $(which python) generate.py -q w4a8-cutlass

here is a screenshot of the timeline as shown by nsys-ui:

Here, one could see that loading of model takes about 30s, then there is a short sequence of copying model to GPU and doing weights quantization, and then the rest of the timeline is the inference. The CUTLASS MM kernel, designated as Kernel2 here, takes less 30% of time of all of the CUDA kernels executed. If timeline zoomed into a segment of time during the inference, one could see that CUDA kernels are not actually executed tightly (because the checks and dynamic quantization are actually a sequence of calls to PyTorch kernels that are not fused):

cpuhrsch · 2024-10-09T18:05:14Z

@alexsamardzic - Was the model torch.compile'd with mode 'max-autotune'? Also you can use torch.profiler to generate kernel traces potentially a bit more quickly than with nsys (at least for rapid iteration). You can then open these with https://ui.perfetto.dev/

alexsamardzic · 2024-10-15T13:06:45Z

@alexsamardzic - Was the model torch.compile'd with mode 'max-autotune'?

Good question 😄 Apparently, there is --compile option for the generate.py script to explicitly activate compilation. So I re-did the analysis, this time using following command to run the generate script:

python generate.py --compile [-q parameter]

and here is a table with results for several different quantization techniques (again, all results are for A100):

`-q parameter`	Average tokens/sec	Average Bandwidth in GB/s	Peak Memory Usage in GB	Model Size in GB
	95.24	258.55	13.90	13.21
`-q int8wo`	155.31	1028.37	8.97	6.62
`-q int4wo-32`	186.70	774.98	5.31	4.15
`-q int4wo-hqq`	186.47	774.01	5.04	4.15
`-q int8dq`	49.64	328.72	9.44	6.62
`-q w4a8-cutlass`	86.14	285.08	4.75	3.31
`-q w4a8-cutlass` (tuned)	119.31	394.86	4.52	3.31

The "tuned" in the last row of the table above means that the size of tiles that each block of CUDA threads in the CUTLASS-based W4A8 kernel is processing is adjusted to the fact that single input is processed - more precisely, following patch is applied on top of this PR:

diff --git a/torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu b/torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu
index e60a69a..e85c9fc 100644
--- a/torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu
+++ b/torchao/csrc/cuda/s8s4_linear_cutlass/s8s4_linear_cutlass.cu
@@ -418,8 +418,8 @@ s8s4_linear_cutlass(const at::Tensor& input, const at::Tensor& input_scale,
   using ElementA = int8_t;
   using ElementB = cutlass::int4b_t;
   using ElementAccumulator = int32_t;
-  using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 64>;
-  using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
+  using ThreadblockShape = cutlass::gemm::GemmShape<32, 128, 128>;
+  using WarpShape = cutlass::gemm::GemmShape<32, 32, 128>;
   using InstructionShape = cutlass::gemm::GemmShape<16, 8, 32>;
   AT_DISPATCH_SWITCH(
     input_scale.scalar_type(),

The pyinstrument and nsys results are harder to interpret when compilation used. It seems that what I mentioned as potential performance bottlenecks in my last comment above may be still affecting performance - as an illustration, here is a screenshot of the timeline as shown by nsys-ui, during the tokens generation:

So there is still large number of small kernels run between successive invocations of W4A8 kernel that take considerable time. I'm looking further into this...

Note: PR is rebased on the latest main, and updated so that the only change needed to run the Llama generator is to point the code to the directory where the latest CUTLASS main is checked out. In the meantime, my CUTLASS PR mentioned above is merged, so when version 3.6.0 of CUTLASS released, it will be enough to install nvidia-cutlass package to satisfy the dependency.

cpuhrsch · 2024-10-15T17:17:25Z

@alexsamardzic So there's a --compile and --compile-prefill option. Maybe try applying both. One thing that should happen is the usage of CUDA graphs. Those should remove the spaces between these small kernels. You can also generate a profile using --profile some_name_1 and then view it with https://ui.perfetto.dev/ . I recommend to gzip the resulting some_name_1.json. You can then also share that profile with us.

alexsamardzic · 2024-10-15T21:10:53Z

@alexsamardzic So there's a --compile and --compile-prefill option. Maybe try applying both. One thing that should happen is the usage of CUDA graphs. Those should remove the spaces between these small kernels. You can also generate a profile using --profile some_name_1 and then view it with https://ui.perfetto.dev/ . I recommend to gzip the resulting some_name_1.json. You can then also share that profile with us.

OK - apparently there is an issue between older versions of nsys and CUDA graphs. After I upgraded nsys, the timeline shows that the W4A8 is taking most of the time:

This is matched by what Torch profiler shows (here is the full trace:
w4a8-cutlass.zip):

The --compile_prefill option helps a little bit, but not significantly.

@cpuhrsch Thanks for your help in running the model properly!

cpuhrsch · 2024-10-15T22:46:43Z

Nice!

So we have int8_dynamic_activation_int8_weight and this adds int8_dynamic_activation_int4_weight_cutlass, can we call it int8_dynamic_activation_int4_weight instead or does that conflict with something else?

alexsamardzic · 2024-10-16T10:34:36Z

So we have int8_dynamic_activation_int8_weight and this adds int8_dynamic_activation_int4_weight_cutlass, can we call it int8_dynamic_activation_int4_weight instead or does that conflict with something else?

Yes, there is int8_dynamic_activation_int4_weight already there (but I'm unsure is it used at all).

cpuhrsch · 2024-10-16T18:12:33Z

@alexsamardzic - Can you dig into what int8_dynamic_activation_int4_weight does? Maybe we can unify it. Then CUTLASS just becomes an implementation detail of int8_dynamic_activation_int4_weight (in cases where it's available).

alexsamardzic · 2024-10-16T18:40:22Z

@alexsamardzic - Can you dig into what int8_dynamic_activation_int4_weight does? Maybe we can unify it. Then CUTLASS just becomes an implementation detail of int8_dynamic_activation_int4_weight (in cases where it's available).

Yes: int8_dynamic_activation_int4_weight is just a quantization function, like all the others, but contrary to other quantization functions, for AffineQuantizedTensor there is no kernel check/impl registered for the case if tensor is quantized using this function. The function is referenced only in some quant API tests, and not used anywhere else. On the other side, there is a comment "for executorch" for one mention of this function, so I'm not sure is it used outside the torchao.

For int8_dynamic_activation_int4_weight_cutlass, I started from int8_dynamic_activation_int4_weight implementation, and changed couple things:

For now, int8_dynamic_activation_int4_weight_cutlass is restricted to symmetric quantization for both input and weight - as discussed in my first comment above, my CUTLASS-based W4A8 kernel still doesn't support zero-point handling. On the other side, int8_dynamic_activation_int4_weight does asymmetric quantization of input, and user can select between symmetric and assymetric quantization of weights.
The result of int8_dynamic_activation_int4_weight is produced as one 4-bit quantized value per byte, while for int8_dynamic_activation_int4_weight_cutlass, I'm storing two 4-bit quantized values per byte, as this is how CUTLASS expects it.

Thus indeed these two could be unified into a single function, with some additional function arguments added. Shall I do it this way for this PR?

cpuhrsch · 2024-10-16T19:27:49Z

@alexsamardzic - so it seems that in order to use CUTLASS here you need

a) symmetric only
b) a special memory layout.

Now, it's useful to use CUTLASS on CUDA.

So if someone uses int8_dynamic_activation_int4_weight while on CUDA (and for certain shapes (and using symmetric quantization)) we could elect to create instances of AQT with a special CUTLASS layout so we can use the CUTLASS kernels.

For this approach we don't need new flags. We just need to carefully check the setting it's being used in. I'd prefer to try this first so we don't add more API surface.

alexsamardzic · 2024-10-16T20:00:05Z

For this approach we don't need new flags. We just need to carefully check the setting it's being used in. I'd prefer to try this first so we don't add more API surface.

Let me clarify what I'm proposing: At the moment, with this PR we have two similar quantization methods:

def int8_dynamic_activation_int4_weight(
    group_size=32,
    mapping_type=MappingType.SYMMETRIC
)

and:

def int8_dynamic_activation_int4_weight_cutlass()

My idea was that I unify these two into:

def int8_dynamic_activation_int4_weight(
    group_size=32,
    mapping_type=MappingType.SYMMETRIC,
    input_mapping_type=MappingType.ASYMMETRIC, 
    pack_bits=False)

This way, it works unchanged for existing users of int8_dynamic_activation_int4_weight, becuase the defaults for the two new arguments of int8_dynamic_activation_int4_weight are set to match what existing int8_dynamic_activation_int4_weight implementation is doing. For CUTLASS-based W4A8 kernel, one would have to quantize the model as follows (and this is for now only, namely if/when zero-point handling implemented into this PR, then MappingType.ASYMMETRIC will be another valid choice for mapping_type and input_mapping_type):

quantize_(
    model,
    int8_dynamic_activation_int4_weight(
        group_size=None,
        mapping_type=MappingType.SYMMETRIC,
        input_mapping_type=MappingType.SYMMETRIC,
        pack_bits=True)
)

The AQT works unchanged: It gets created by some quantization method, possibly one of the above two. When it comes to evaluation of a model, a kernel to use given AQT is searched for. The W4A8 CUTLASS kernel is in the mix, it will get selected if AQT is on CUDA device, and if quantization method used was as in the last code snippet.

So users of existing API are not affected. And the API change is of the same type as when a new callable, having some arguments and implementing a new quantization method, introduced - which I assume is allowed.

cpuhrsch · 2024-10-16T21:28:05Z

Hm, interesting. The signature for int8_dynamic_activation_int8_weight is very different from int8_dynamic_activation_int4_weight at the moment.

In any case, I think what you suggest sounds good. I do think that pack_bits might be replaced by layout similar to how we do it for int8_dynamic_activation_int8_weight?

jerryzh168 · 2024-10-16T22:38:55Z

yeah I feel we should use layout and put packe_bits arg to layout and have special settings based on layout as well, similar to:

ao/torchao/quantization/quant_api.py

Line 556 in c87cc9b

if isinstance(layout, MarlinSparseLayout):

alexsamardzic · 2024-10-17T12:32:33Z

yeah I feel we should use layout and put packe_bits arg to layout and have special settings based on layout as well

OK. Looking into this: So I'll implement a layout, let's name it PackedInt4Layout, which I guess will just "compress" each pair of 4-bit values into a single byte in its pre_process method. Is it then necessary to create PackedInt4TensorImpl class, alike to other TensorImpl classes and if so, which methods should this class provide? This part is somewhat unclear, as different TensorImpl classes are doing different things...

alexsamardzic · 2024-10-26T21:26:12Z

Changed the quantization name from w4a8-cutlass to int8adq-int4w-symm in the torchao/_models/llama/generate.py script.

alexsamardzic · 2024-10-28T21:10:36Z

Made the kernel built only if CUTLASS installation found at build time, also implemented a simple heuristic to improve the performance in case the number of inputs is small.

alexsamardzic · 2024-11-06T19:31:50Z

A benchmark script added.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024

jerryzh168 mentioned this pull request Sep 13, 2024

Plans for 8da4w quantization #883

Closed

alexsamardzic force-pushed the w4a8-cutlass branch from 1bacd02 to e1a1ff1 Compare September 13, 2024 11:05

jerryzh168 reviewed Sep 13, 2024

View reviewed changes

alexsamardzic force-pushed the w4a8-cutlass branch 7 times, most recently from f6383ca to 02f8805 Compare September 17, 2024 08:26

alexsamardzic force-pushed the w4a8-cutlass branch 2 times, most recently from 575e074 to 956fc80 Compare September 18, 2024 13:14

alexsamardzic mentioned this pull request Sep 18, 2024

Fix EVT for S32 accum and BF16 C/output tensors NVIDIA/cutlass#1826

Closed

msaroufim reviewed Sep 27, 2024

View reviewed changes

cpuhrsch reviewed Sep 27, 2024

View reviewed changes

msaroufim reviewed Sep 28, 2024

View reviewed changes

alexsamardzic force-pushed the w4a8-cutlass branch from 956fc80 to bc85146 Compare October 6, 2024 11:11

alexsamardzic force-pushed the w4a8-cutlass branch from bc85146 to 4dbe339 Compare October 15, 2024 12:36

alexsamardzic force-pushed the w4a8-cutlass branch from 4dbe339 to 492a5fa Compare October 18, 2024 18:57

drisspg mentioned this pull request Oct 18, 2024

support W4A8 Marlin kernel #1113

Merged

alexsamardzic force-pushed the w4a8-cutlass branch from 492a5fa to 782faa6 Compare October 26, 2024 21:23

alexsamardzic force-pushed the w4a8-cutlass branch from 782faa6 to 776fc32 Compare October 28, 2024 21:07

alexsamardzic force-pushed the w4a8-cutlass branch 2 times, most recently from d6d0d24 to c28ce04 Compare November 5, 2024 21:46

alexsamardzic force-pushed the w4a8-cutlass branch from c28ce04 to 9026329 Compare November 6, 2024 09:45

alexsamardzic force-pushed the w4a8-cutlass branch from 9026329 to 2534022 Compare November 6, 2024 21:54

W4A8 based on CUTLASS

36cf738

alexsamardzic force-pushed the w4a8-cutlass branch from 2534022 to 36cf738 Compare November 12, 2024 12:45

		@@ -506,6 +508,41 @@ def int8_dynamic_activation_int4_weight(group_size=32, mapping_type=MappingType.
		return _get_linear_subclass_inserter(apply_int8_dynamic_activation_int4_weight_quant, group_size=group_size, mapping_type=mapping_type)


		def apply_int8_dynamic_activation_int4_weight_quant_cutlass(weight):

		@@ -0,0 +1,51 @@
		# FIXME: move this test to the appropriate test file!!!

W4A8 based on CUTLASS #880

Are you sure you want to change the base?

W4A8 based on CUTLASS #880

Conversation

alexsamardzic commented Sep 12, 2024 • edited Loading

pytorch-bot bot commented Sep 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/880

facebook-github-bot commented Sep 12, 2024

Process

alexsamardzic commented Sep 12, 2024 • edited Loading

msaroufim commented Sep 12, 2024

cpuhrsch commented Sep 12, 2024

alexsamardzic commented Sep 12, 2024

alexsamardzic commented Sep 12, 2024

alexsamardzic commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsamardzic commented Sep 17, 2024 • edited Loading

alexsamardzic commented Sep 18, 2024 • edited Loading

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsamardzic commented Oct 6, 2024

cpuhrsch commented Oct 9, 2024

alexsamardzic commented Oct 15, 2024

cpuhrsch commented Oct 15, 2024

alexsamardzic commented Oct 15, 2024

cpuhrsch commented Oct 15, 2024

alexsamardzic commented Oct 16, 2024

cpuhrsch commented Oct 16, 2024

alexsamardzic commented Oct 16, 2024

cpuhrsch commented Oct 16, 2024

alexsamardzic commented Oct 16, 2024

cpuhrsch commented Oct 16, 2024

jerryzh168 commented Oct 16, 2024

alexsamardzic commented Oct 17, 2024

alexsamardzic commented Oct 26, 2024

alexsamardzic commented Oct 28, 2024

alexsamardzic commented Nov 6, 2024

alexsamardzic commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

alexsamardzic commented Sep 12, 2024 •

edited

Loading

alexsamardzic commented Sep 17, 2024 •

edited

Loading

alexsamardzic commented Sep 18, 2024 •

edited

Loading