-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Closed
Labels
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
TTFT is also an important online indicator.In sglang, I find some badcases:
when vram is not enough for the coming req, the req must wait for a while in waiting_queue, then ttft could be bad as user see(including waiting time in waiting queue)
so I want to fuse my some work about it in upstream. if we disable-req-waiting, when vram is not enough for the coming req, the scheduler could return 403 to server and user or router could try again at the service level.
Which parts may be modified:
- in scheduler.py, we need add some free-vram check in "handle_generate_request" and if vram is not enough, just return aborted status to tokenizer
- in tokenizer.py and open_ai/adapter.py , we need to support return this kind of errors , for example, in my previous implementation, return 403 http code to client.
- in schedule_batch.py, we need remian_vram property to know the free-vram and get a possible video memory usage for new requests, to judge whether the new req could be inserted in waiting_queue
What is expected:
- if a request be inserted in waiting_queue, means it could be inferenced quickly(about a forward-step latency) and ttft could be close to the time required for prefill.
- in router/service level, we could make a better load balance
Timeline:
done before 5th May
Related resources
No response
m0g1cian and zhaochenyang20m0g1cian and zhaochenyang20