AWQ: Up to 2.66x higher throughput #2566

casper-hansen · 2024-01-23T22:05:54Z

The strategy is to dequantize and run FP16 matmul for longer sequences. This could probably be faster if we just used cublas instead of torch.matmul.

EDIT: It seems throughput can be over 2x in vLLM because context processing is such a crucial part of the framework.

python benchmarks/benchmark_throughput.py --input-len 1024 --output-len 64 --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --num-prompts 100 --max-model-len 1070 --dtype half

Tested on 1x A100 (80GB).

Input Len	Threshold	Main (requests/s, tokens/s)	PR (requests/s, tokens/s)	Speedup (requests/s)
16384	256	0.19, 3138.15	0.54, 8879.04	2.84x
4096	1024	0.84, 3500.00	2.24, 9336.44	2.66x
1024	1024	3.28, 3570.90	6.70, 7286.77	2.04x
512	512	5.96, 3435.14	11.91, 6860.50	1.99x
256	256	10.39, 3323.90	18.49, 5916.05	1.78x

MichaelJayW · 2024-01-25T03:00:42Z

Does it affect the accuracy?

fxmarty · 2024-01-25T09:31:26Z

@casper-hansen that's really cool, in line with the bench here where indeed cublas (that exllama kernel is using for longer sequences) is just better than the AWQ GEMM kernel.

I think it would make our life easier if we had the same kind of dispatch for marlin.

casper-hansen · 2024-01-25T09:43:42Z

Does it affect the accuracy?

This should have no impact on accuracy. The dequantization kernel is strictly equivalent to the dequantization from the GEMM kernel.

@casper-hansen that's really cool, in line with the bench here where indeed cublas (that exllama kernel is using for longer sequences) is just better than the AWQ GEMM kernel.

I think it would make our life easier if we had the same kind of dispatch for marlin.

Yes, I agree that the Marlin kernels could achieve even higher throughput. The most crucial part is just missing - it’s only for symmetric quantization.

fxmarty · 2024-01-25T09:47:12Z

is it that crucial though? Many int4*fp16 models use symmetric weight quantization successfully

casper-hansen · 2024-01-25T10:11:29Z

is it that crucial though? Many int4*fp16 models use symmetric weight quantization successfully

It may turn out to just be an engineering problem, but from my limited experience, the most popular symmetric weight quantization methods suffer from a higher quantization error.

WoosukKwon

Hi @casper-hansen, thanks for submitting the PR! Left some minor comments. Please take a look.

setup.py

csrc/quantization/awq/gemm_kernels.cu

vllm/model_executor/layers/quantization/awq.py

casper-hansen · 2024-01-26T11:25:01Z

Hi @casper-hansen, thanks for submitting the PR! Left some minor comments. Please take a look.

@WoosukKwon Thanks for the review. I applied your suggested fixes and tested that throughput is as expected.

WoosukKwon

LGTM! thanks for the fix!

sitabulaixizawaluduo · 2024-01-30T08:37:09Z

I tested Llama-13b on A30 with tensor parallel size is 4, and I found awq throughput is lower than fp16.

casper-hansen · 2024-01-30T10:40:36Z

I tested Llama-13b on A30 with tensor parallel size is 4, and I found awq throughput is lower than fp16.

This is as expected. You cannot exceed W16A16 performance with W4A16 when you test for throughput. You would need W4A4 (Atom, lower quality model) or W8A8 (SmoothQuant, also lower quality model).

This is because W4A16 methods require dequantization, so when you test throughput, you become compute bound and then it limits the performance.

EDIT: The throughput can also be lower if the TP implementation is not optimized for quantized models. Not sure if it is in vLLM

zcnrex · 2024-01-31T05:41:19Z

vllm/model_executor/layers/quantization/awq.py

+            out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
+            out = torch.matmul(reshaped_x, out)


Curious to learn: would this copy the dequantized weights back to the memory before doing torch.matmul? And a potential optimization is through implementing a more efficient mixed precision matmul that saves 1 data transfer to the memory?

You are probably right that there is potential to eliminate overhead. Exllama runs dequantization and then directly calls cublas for matmul inside the same CUDA kernel. Definitely something to explore!

Faster context processing [INITIAL]

d3a5306

WoosukKwon added the quantization label Jan 24, 2024

casper-hansen added 2 commits January 24, 2024 21:01

Consolidate and fix workspace

bf62aed

Adjust heuristic to 256

d10313b

casper-hansen marked this pull request as ready for review January 24, 2024 21:13

casper-hansen changed the title ~~[WIP] AWQ: Faster context processing~~ [WIP] AWQ: Up to 2.66x higher throughput Jan 24, 2024

casper-hansen changed the title ~~[WIP] AWQ: Up to 2.66x higher throughput~~ AWQ: Up to 2.66x higher throughput Jan 24, 2024

casper-hansen added 2 commits January 24, 2024 22:18

Formatting

1b6521f

Formatting (again)

4149086

WoosukKwon self-requested a review January 25, 2024 19:41

WoosukKwon reviewed Jan 25, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

csrc/quantization/awq/gemm_kernels.cu Show resolved Hide resolved

vllm/model_executor/layers/quantization/awq.py Outdated Show resolved Hide resolved

Apply code suggestions

77811c6

WoosukKwon approved these changes Jan 27, 2024

View reviewed changes

WoosukKwon merged commit beb89f6 into vllm-project:main Jan 27, 2024
16 checks passed

WoosukKwon mentioned this pull request Jan 30, 2024

Implement 60% faster context processing for AWQ #2551

Closed

zcnrex reviewed Jan 31, 2024

View reviewed changes

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

AWQ: Up to 2.66x higher throughput (vllm-project#2566)

1516c9a

zcnrex mentioned this pull request Feb 3, 2024

Refactor 2 awq gemm kernels into m16nXk32 #2723

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

AWQ: Up to 2.66x higher throughput (vllm-project#2566)

17f61af

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

AWQ: Up to 2.66x higher throughput (vllm-project#2566)

1d6bdb8

zcnrex mentioned this pull request Feb 14, 2024

Speed up AWQ 2.5x with updated kernel #2874

Closed

3 tasks

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ: Up to 2.66x higher throughput #2566

AWQ: Up to 2.66x higher throughput #2566

casper-hansen commented Jan 23, 2024 •

edited

MichaelJayW commented Jan 25, 2024

fxmarty commented Jan 25, 2024

casper-hansen commented Jan 25, 2024

fxmarty commented Jan 25, 2024

casper-hansen commented Jan 25, 2024

WoosukKwon left a comment

casper-hansen commented Jan 26, 2024

WoosukKwon left a comment

sitabulaixizawaluduo commented Jan 30, 2024

casper-hansen commented Jan 30, 2024 •

edited

zcnrex Jan 31, 2024

casper-hansen Jan 31, 2024

		out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
		out = torch.matmul(reshaped_x, out)

AWQ: Up to 2.66x higher throughput #2566

AWQ: Up to 2.66x higher throughput #2566

Conversation

casper-hansen commented Jan 23, 2024 • edited

MichaelJayW commented Jan 25, 2024

fxmarty commented Jan 25, 2024

casper-hansen commented Jan 25, 2024

fxmarty commented Jan 25, 2024

casper-hansen commented Jan 25, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

casper-hansen commented Jan 26, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

sitabulaixizawaluduo commented Jan 30, 2024

casper-hansen commented Jan 30, 2024 • edited

zcnrex Jan 31, 2024

Choose a reason for hiding this comment

casper-hansen Jan 31, 2024

Choose a reason for hiding this comment

casper-hansen commented Jan 23, 2024 •

edited

casper-hansen commented Jan 30, 2024 •

edited