Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why online seving slower than offline serving?? #2019

Closed
BangDaeng opened this issue Dec 11, 2023 · 11 comments
Closed

why online seving slower than offline serving?? #2019

BangDaeng opened this issue Dec 11, 2023 · 11 comments

Comments

@BangDaeng
Copy link

BangDaeng commented Dec 11, 2023

  1. offline serving
    image

  2. online serving(fastapi)
    image
    image
    log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
    INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.

Why is the speed 2 seconds slower when displayed as fastapi??
parameters is same, prompt is same

"Open-Orca/Mistral-7B-OpenOrca" this model same issue
and any llama2 model same issue

python : 3.10.12
my library list.txt

cuda_version : 12.0
gpu: a100 40g
my library list attached

@BangDaeng BangDaeng changed the title why online seving 2times slow than offline serving?? why online seving slower than offline serving?? Dec 11, 2023
@Lvjinhong
Copy link

Lvjinhong commented Dec 28, 2023

@irasin Hello, About #2257 (comment), Through my testing, In my latest test, when using AsyncLLMEngine, I observed significant fluctuations in GPT-Util (0-100%), but the throughput was high. Previously, when using LLMEngine with bs=1, the utilization was stable between 80-90%. What are your thoughts on this?

I am running Llama 70b on 8*A800 80G, and in both scenarios, the Memory Usage is approximately at 74.72GB (gpu_memory_utilization=90%). I'm also curious about the reasons behind such high memory consumption.

@SardarArslan
Copy link

Same issue here, online inference is almost half as fast as offline inference.

@iamhappytoo
Copy link

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

@rbgo404
Copy link

rbgo404 commented Apr 14, 2024

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

I have observed the same issue

@SamComber
Copy link

+1 have observed this also, currently just living with it.

@SardarArslan
Copy link

SardarArslan commented Apr 15, 2024 via email

@rbgo404
Copy link

rbgo404 commented Apr 15, 2024

I think it's slower due to internet latency.

On Mon, 15 Apr 2024, 15:48 Sam Comber, @.> wrote: +1 have observed this also, currently just living with it. — Reply to this email directly, view it on GitHub <#2019 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM . You are receiving this because you commented.Message ID: @.>

Have you done any benchmark on this?

@xiejibing
Copy link

Confused +1

2 similar comments
@SuperCB
Copy link

SuperCB commented Aug 15, 2024

Confused +1

@AlexBlack2202
Copy link

Confused +1

@DarkLight1337
Copy link
Member

The performance should be fine, now that the server is run in a separate process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants