Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Allow AQLM on Pascal #5058

Merged
merged 1 commit into from
May 27, 2024

Conversation

sasha0552
Copy link
Contributor

AQLM works on Pascal.
AWQ generates gibberish (repeated exclamation mark !).
GPTQ is very slow. Even just loading Llama 3 8B takes 30 minutes. In comparison, FP16 takes 1 minute.
FP16 works fine.

plt

AQLM
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  1042.37
Total input tokens:                      215196
Total generated tokens:                  182214
Request throughput (req/s):              0.96
Input token throughput (tok/s):          206.45
Output token throughput (tok/s):         174.81
---------------Time to First Token----------------
Mean TTFT (ms):                          295326.22
Median TTFT (ms):                        242159.76
P99 TTFT (ms):                           770025.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1470.49
Median TPOT (ms):                        1232.73
P99 TPOT (ms):                           7437.79
==================================================
AWQ
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  670.48
Total input tokens:                      215196
Total generated tokens:                  26482
Request throughput (req/s):              1.49
Input token throughput (tok/s):          320.96
Output token throughput (tok/s):         39.50
---------------Time to First Token----------------
Mean TTFT (ms):                          246708.10
Median TTFT (ms):                        228687.70
P99 TTFT (ms):                           577918.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6704.52
Median TPOT (ms):                        6226.30
P99 TPOT (ms):                           29068.24
==================================================
FP16
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  756.78
Total input tokens:                      215196
Total generated tokens:                  187161
Request throughput (req/s):              1.32
Input token throughput (tok/s):          284.36
Output token throughput (tok/s):         247.31
---------------Time to First Token----------------
Mean TTFT (ms):                          236744.23
Median TTFT (ms):                        207316.46
P99 TTFT (ms):                           587200.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1161.41
Median TPOT (ms):                        877.68
P99 TPOT (ms):                           7518.22
==================================================

@sasha0552 sasha0552 mentioned this pull request May 27, 2024
6 tasks
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic merged commit fbdb7b3 into vllm-project:main May 27, 2024
63 checks passed
@sasha0552 sasha0552 deleted the pascal-quants branch May 27, 2024 23:41
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024
@qiudywzh
Copy link

qiudywzh commented Sep 2, 2024

Hi sasha0552, thanks for your great work!
I find that on pascal GPUs, AQLM inference throughput drops dramatically when the model getting bigger.
I could run Qwen1.5-32B-Chat-AQLM-1x16 on 1080Ti x 2 (which is amazing itself), but generate avg 3.24 token/s with avg 9.04s first token latency.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants