[Kernel] Initial Activation Quantization Support #4525

dsikka · 2024-05-01T13:59:58Z

Summary

Initial support for Activation Quantization (specifically static-per tensor for W8A8)
Adds CompressedTensorsConfig and CompressedTensorsLinearMethod to support models quantized through sparseml and saved through compressed-tensors
Adds a new optional layer_name parameter to create_weights. The layer_name can be used to match the appropriate quantization scheme from the CompressedTensorsConfig for a given layer
Adds a static-per-tensor quant kernel (Inspired and refactored from Support W8A8 inference in vllm #1508)
Use the nvidia-cutlass python interface to invoke a fused GEMM+dequant kernel.

From Neural Magic, Co-authored by @varun-sundar-rabindranath @robertgshaw2-neuralmagic

@varun-sundar-rabindranath

…for static W8A8 per tensor (#195) - Depending on how we end up parsing `ignore` and `targets` (layer_name vs layer_type) we may not need layer_name to be added to the linear_method. Will experiment using a compressed-tensors function in a follow-up PR - Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8 - Includes fused kernels added by @varun-sundar-rabindranath ```python from vllm import LLM, SamplingParams import torch prompts = [ "Hello, my name is", "The capital of France is", "The US president is", "The future of AI is" ] sampling_params = SamplingParams(temperature=0.80, top_p=0.95) llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - Verification of the different inputs expected for `targets` and `ignore` --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86) - Updates to further optimize fake qunat --------- Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

comaniac

IMHO the layer_name approach is simple and effective, but it also creates more complexity to model implementers. Ideally we should match the scheme automatically after the model is initialized (but before weight loading). In this case we need to make all parameters in meta tensor (like a placeholder) until the weights are actually loaded. In this way we can change data type and don't have to worry about memory footprint.

csrc/pybind.cpp

comaniac · 2024-05-02T18:30:32Z

CMakeLists.txt

@@ -167,6 +167,7 @@ set(VLLM_EXT_SRC
  "csrc/layernorm_kernels.cu"
  "csrc/quantization/squeezellm/quant_cuda_kernel.cu"
  "csrc/quantization/gptq/q_gemm.cu"
+  "csrc/quantization/compressed_tensors/int8_quant_kernels.cu"


It's a bit unclear to me about the name compressed_tensors. I suppose this is the official method name of SparseML? Then can we just use sparseml here?

compressed-tensors is the name of the package responsible for saving quantized and sparse models

So the flow is:

use SparseML to apply quantization / sparsity

save model to safetensors with a compressed-tensors config

load + run in vllm

robertgshaw2-neuralmagic · 2024-05-02T18:49:13Z

vllm/model_executor/layers/linear.py

@@ -403,6 +440,13 @@ def weight_loader(self,
            shard_size = loaded_weight.shape[0]
            shard_offset = loaded_shard_id * shard_size
            param_data = param_data.narrow(0, shard_offset, shard_size)
+
+        # If a param_shard_splitter is defined by the LinearMethod, use it.


This does the same thing as scale_shard_splitter we had for fp8 ... we can rename to match fp8

but yes this will be addressed by the refactor

robertgshaw2-neuralmagic · 2024-05-02T18:49:13Z

vllm/model_executor/layers/linear.py

@@ -403,6 +440,13 @@ def weight_loader(self,
            shard_size = loaded_weight.shape[0]
            shard_offset = loaded_shard_id * shard_size
            param_data = param_data.narrow(0, shard_offset, shard_size)
+
+        # If a param_shard_splitter is defined by the LinearMethod, use it.


This does the same thing as scale_shard_splitter we had for fp8 ... we can rename to match fp8

but yes this will be addressed by the refactor

vllm/worker/model_runner.py

robertgshaw2-neuralmagic · 2024-05-02T18:54:52Z

CMakeLists.txt

@@ -167,6 +167,7 @@ set(VLLM_EXT_SRC
  "csrc/layernorm_kernels.cu"
  "csrc/quantization/squeezellm/quant_cuda_kernel.cu"
  "csrc/quantization/gptq/q_gemm.cu"
+  "csrc/quantization/compressed_tensors/int8_quant_kernels.cu"


compressed-tensors is the name of the package responsible for saving quantized and sparse models

So the flow is:

use SparseML to apply quantization / sparsity

save model to safetensors with a compressed-tensors config

load + run in vllm

robertgshaw2-neuralmagic · 2024-05-02T18:55:01Z

IMHO the layer_name approach is simple and effective, but it also creates more complexity to model implementers. Ideally we should match the scheme automatically after the model is initialized (but before weight loading). In this case we need to make all parameters in meta tensor (like a placeholder) until the weights are actually loaded. In this way we can change data type and don't have to worry about memory footprint.

Per our slack discussion:

Plan is to refactor weight_loading logic generically (separate from this PR) with a flow that looks like this:

model = init_model(...) # parameters are in meta tensors
for key, val in scheme:
    mod = find_module_by_name(model, key)
    config_module(mod, val)
...
weight_loading(model, ckpt)

This is similar to how we do things in SparseML / HF. This would also enable lack of memory savings for fp8

CMakeLists.txt

robertgshaw2-neuralmagic

test

bnellnm · 2024-05-14T20:08:53Z

@dsikka can you add some tests for the new functionality? Can any of the tests from #1508 be reused/adapted?

pcmoritz · 2024-05-22T18:53:33Z

...cutor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_statictensor.py

+import torch
+from torch.nn import Parameter
+
+# TODO (varun) : Unify ops and custom ops


This should not be left as a TODO and instead be done before the PR is merged -- it is a very small amount of work

Yes, definitely. This skipped my radar. This is fixed now. Thanks for catching it.

pcmoritz · 2024-05-22T18:55:30Z

csrc/quantization/compressed_tensors/int8_quant_kernels.cu

+void static_scaled_int8_quant(torch::Tensor& out,    // [..., hidden_size]
+                              torch::Tensor& input,  // [..., hidden_size]
+                              float scale) {
+  assert(input.is_contiguous());


both of these asserts should be TORCH_CHECK so the interpreter doesn't crash if this gets triggered

varun-sundar-rabindranath · 2024-05-22T23:02:04Z

csrc/quantization/compressed_tensors/int8_quant_kernels.cu

+  static constexpr float dt_max =
+      static_cast<float>(std::numeric_limits<int8_t>::max());
+  // round
+  float dst = round(x);


Note - This rounding doesn't match Compressed-tensors's/Torch's/Numpy's rounding method. To fix this I have a patch at neuralmagic#263 - this will be merged before this lands.

pcmoritz · 2024-05-23T17:20:25Z

Thanks for removing the layer name until the weight refactor is ready @dsikka

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

dsikka and others added 8 commits April 30, 2024 18:50

add get_quant method to compressed tensors config

92b3703

small rebase fixed

2a3eb83

format

3dd1fe8

fix mypy complaints

f2f8c52

Merge branch 'main' into ds-quant

c9308eb

format fixes

d9d49b5

Merge branch 'main' into ds-quant

b111ee6

dsikka marked this pull request as ready for review May 1, 2024 14:18

dsikka and others added 6 commits May 1, 2024 14:20

format fix post rebase

c31a7af

lazy import CompressedTensorsW8A8StaticTensor (#220)

ca01b39

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import (#221)

f0197d4

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix asm

4624b46

update shape change

75757d5

add todo

e1df0eb

comaniac reviewed May 2, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed May 2, 2024

View reviewed changes

robertgshaw2-neuralmagic closed this May 2, 2024

robertgshaw2-neuralmagic reviewed May 2, 2024

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

robertgshaw2-neuralmagic reviewed May 2, 2024

View reviewed changes

Yard1 reopened this May 2, 2024

Varun Sundar Rabindranath added 2 commits May 2, 2024 21:12

Rename quant_per_tensor -> static_scaled_int8_quant

bc0991c

Remove cruft

74ad650

tlrmchlsmth mentioned this pull request May 10, 2024

[Kernel] Add w8a8 CUTLASS kernels #4749

Merged

Merge branch 'main' into ds-quant

43c43f3

Varun Sundar Rabindranath added 2 commits May 14, 2024 20:31

fixes : typo

cf5600f

py-cutlass temporary hack for num_prompts==1

169ce7f

clang format again

5c5dc84

dsikka requested review from comaniac and robertgshaw2-neuralmagic May 22, 2024 14:45

comaniac approved these changes May 22, 2024

View reviewed changes

pcmoritz reviewed May 22, 2024

View reviewed changes

Varun Sundar Rabindranath added 2 commits May 22, 2024 20:21

address PR comments

a44b4a0

clang-format

6f0e6e1

varun-sundar-rabindranath reviewed May 22, 2024

View reviewed changes

dsikka and others added 7 commits May 23, 2024 14:27

remove layer name

0090454

remove unused import

4b10fd7

remove parent name

68a59c7

Fix rounding

b0afe67

comment

4f4951e

cruft

869de3f

yapf

e68e391

pcmoritz approved these changes May 23, 2024

View reviewed changes

remove unquantized check

d77cf50

robertgshaw2-neuralmagic enabled auto-merge (squash) May 23, 2024 19:31

robertgshaw2-neuralmagic merged commit a124232 into vllm-project:main May 23, 2024
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Initial Activation Quantization Support #4525

[Kernel] Initial Activation Quantization Support #4525

dsikka commented May 1, 2024 •

edited

Loading

comaniac left a comment

comaniac May 2, 2024

robertgshaw2-neuralmagic May 2, 2024

robertgshaw2-neuralmagic May 2, 2024

robertgshaw2-neuralmagic May 2, 2024

robertgshaw2-neuralmagic May 2, 2024

robertgshaw2-neuralmagic commented May 2, 2024 •

edited

Loading

robertgshaw2-neuralmagic left a comment

bnellnm commented May 14, 2024

pcmoritz May 22, 2024

varun-sundar-rabindranath May 22, 2024

pcmoritz May 22, 2024

varun-sundar-rabindranath May 22, 2024

varun-sundar-rabindranath May 22, 2024

varun-sundar-rabindranath May 23, 2024

pcmoritz commented May 23, 2024

[Kernel] Initial Activation Quantization Support #4525

[Kernel] Initial Activation Quantization Support #4525

Conversation

dsikka commented May 1, 2024 • edited Loading

Summary

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented May 2, 2024 • edited Loading

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

bnellnm commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented May 23, 2024

dsikka commented May 1, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented May 2, 2024 •

edited

Loading