[Core] Allow AQLM on Pascal #5058

sasha0552 · 2024-05-26T16:42:59Z

AQLM works on Pascal.
AWQ generates gibberish (repeated exclamation mark !).
GPTQ is very slow. Even just loading Llama 3 8B takes 30 minutes. In comparison, FP16 takes 1 minute.
FP16 works fine.

AQLM

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  1042.37
Total input tokens:                      215196
Total generated tokens:                  182214
Request throughput (req/s):              0.96
Input token throughput (tok/s):          206.45
Output token throughput (tok/s):         174.81
---------------Time to First Token----------------
Mean TTFT (ms):                          295326.22
Median TTFT (ms):                        242159.76
P99 TTFT (ms):                           770025.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1470.49
Median TPOT (ms):                        1232.73
P99 TPOT (ms):                           7437.79
==================================================

AWQ

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  670.48
Total input tokens:                      215196
Total generated tokens:                  26482
Request throughput (req/s):              1.49
Input token throughput (tok/s):          320.96
Output token throughput (tok/s):         39.50
---------------Time to First Token----------------
Mean TTFT (ms):                          246708.10
Median TTFT (ms):                        228687.70
P99 TTFT (ms):                           577918.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6704.52
Median TPOT (ms):                        6226.30
P99 TPOT (ms):                           29068.24
==================================================

FP16

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  756.78
Total input tokens:                      215196
Total generated tokens:                  187161
Request throughput (req/s):              1.32
Input token throughput (tok/s):          284.36
Output token throughput (tok/s):         247.31
---------------Time to First Token----------------
Mean TTFT (ms):                          236744.23
Median TTFT (ms):                        207316.46
P99 TTFT (ms):                           587200.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1161.41
Median TPOT (ms):                        877.68
P99 TPOT (ms):                           7518.22
==================================================

qiudywzh · 2024-09-02T04:05:36Z

Hi sasha0552, thanks for your great work!
I find that on pascal GPUs, AQLM inference throughput drops dramatically when the model getting bigger.
I could run Qwen1.5-32B-Chat-AQLM-1x16 on 1080Ti x 2 (which is amazing itself), but generate avg 3.24 token/s with avg 9.04s first token latency.

Lower AQLM min capability

3c4eb70

sasha0552 mentioned this pull request May 27, 2024

v0.4.3 Release Tracker #4895

Closed

6 tasks

robertgshaw2-neuralmagic enabled auto-merge (squash) May 27, 2024 22:26

robertgshaw2-neuralmagic approved these changes May 27, 2024

View reviewed changes

robertgshaw2-neuralmagic merged commit fbdb7b3 into vllm-project:main May 27, 2024
63 checks passed

sasha0552 deleted the pascal-quants branch May 27, 2024 23:41

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Core] Allow AQLM on Pascal (vllm-project#5058)

63c2616

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Core] Allow AQLM on Pascal (vllm-project#5058)

2c59c91

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core] Allow AQLM on Pascal (vllm-project#5058)

f2ab617

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Core] Allow AQLM on Pascal (vllm-project#5058)

bde72ed

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core] Allow AQLM on Pascal (vllm-project#5058)

811a34f

sasha0552 mentioned this pull request Nov 9, 2024

Waiting for output from MQLLMEngine sasha0552/pascal-pkgs-ci#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Allow AQLM on Pascal #5058

[Core] Allow AQLM on Pascal #5058

sasha0552 commented May 26, 2024

qiudywzh commented Sep 2, 2024 •

edited

Loading

[Core] Allow AQLM on Pascal #5058

[Core] Allow AQLM on Pascal #5058

Conversation

sasha0552 commented May 26, 2024

qiudywzh commented Sep 2, 2024 • edited Loading

qiudywzh commented Sep 2, 2024 •

edited

Loading