-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why online seving slower than offline serving?? #2019
Comments
@irasin Hello, About #2257 (comment), Through my testing, In my latest test, when using AsyncLLMEngine, I observed significant fluctuations in GPT-Util (0-100%), but the throughput was high. Previously, when using LLMEngine with bs=1, the utilization was stable between 80-90%. What are your thoughts on this? I am running Llama 70b on 8*A800 80G, and in both scenarios, the Memory Usage is approximately at 74.72GB (gpu_memory_utilization=90%). I'm also curious about the reasons behind such high memory consumption. |
Same issue here, online inference is almost half as fast as offline inference. |
Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions! |
I have observed the same issue |
+1 have observed this also, currently just living with it. |
I think it's slower due to internet latency.
…On Mon, 15 Apr 2024, 15:48 Sam Comber, ***@***.***> wrote:
+1 have observed this also, currently just living with it.
—
Reply to this email directly, view it on GitHub
<#2019 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Have you done any benchmark on this? |
Confused +1 |
2 similar comments
Confused +1 |
Confused +1 |
The performance should be fine, now that the server is run in a separate process. |
offline serving
online serving(fastapi)
log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.
Why is the speed 2 seconds slower when displayed as fastapi??
parameters is same, prompt is same
"Open-Orca/Mistral-7B-OpenOrca" this model same issue
and any llama2 model same issue
python : 3.10.12
my library list.txt
cuda_version : 12.0
gpu: a100 40g
my library list attached
The text was updated successfully, but these errors were encountered: