Skip to content

[Feature] disable-req-waiting #5446

@voidxb

Description

@voidxb

Checklist

Motivation

TTFT is also an important online indicator.In sglang, I find some badcases:
when vram is not enough for the coming req, the req must wait for a while in waiting_queue, then ttft could be bad as user see(including waiting time in waiting queue)
so I want to fuse my some work about it in upstream. if we disable-req-waiting, when vram is not enough for the coming req, the scheduler could return 403 to server and user or router could try again at the service level.

Which parts may be modified:

  1. in scheduler.py, we need add some free-vram check in "handle_generate_request" and if vram is not enough, just return aborted status to tokenizer
  2. in tokenizer.py and open_ai/adapter.py , we need to support return this kind of errors , for example, in my previous implementation, return 403 http code to client.
  3. in schedule_batch.py, we need remian_vram property to know the free-vram and get a possible video memory usage for new requests, to judge whether the new req could be inserted in waiting_queue

What is expected:

  1. if a request be inserted in waiting_queue, means it could be inferenced quickly(about a forward-step latency) and ttft could be close to the time required for prefill.
  2. in router/service level, we could make a better load balance

Timeline:
done before 5th May

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions