Dynamic scheduler delay to improve ITL performance #3279

tdoublep · 2024-03-08T11:53:20Z

We have been benchmarking vLLM internally using a synthetic workload generator that has been fitted to mimic our production workloads. It stresses the inference server using a varying number of concurrent users, all users send requests that are drawn uniformly from a heterogeneous set of requests with different prompt lengths and number of generated tokens.

We have found that for these workloads, vLLM has extremely low TTFT (time to first token) but has relatively high ITL (inter-token latency). An in-depth analysis seems to show that vLLM tends to schedule prompts as soon as possible, resulting in very small prompt batches, which are processed very quickly, but end up starving the decoding phase.

This PR adds a new optional feature --scheduler-use-delay which, if enabled, creates an artificial delay before scheduling prompts. The delay is determined dynamically based on the time to perform the last prompt step. This delay allows the waiting queue to fill up with more requests. This gives the opportunity to make larger prompt batches, but due to the heterogeneous nature of the workload, we then hit issues related to padding overhead. It is thus beneficial to combine this scheduler delay with the --scheduler-policy=reorder feature from #2357 which sorts the waiting queue by sequence length. This allows us to create much larger prompt batches whilst staying with the padding limits, and leads to significant improvements in terms of ITL performance.

This ITL improvement comes at the expense of TTFT performance, since (a) we are applying an artificial delay before scheduling prompts and (b) we are now processing larger batches which take longer to process. Different use-cases may have a preference towards either metric, which is why we feel this makes sense as an optional feature for now.

Benchmarking results (labels on each point indicates the number of concurrent users):

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>

robertgshaw2-neuralmagic · 2024-03-08T14:32:12Z

Take a look also at the chunked prefill efforts to address this

#3106

tdoublep · 2024-03-08T14:40:39Z

@robertgshaw2-neuralmagic Thanks, and agreed: chunked prefill may eventually solve this problem in a different way.

We hope that this relatively simple, optional, change can be used to improve performance in the meantime.

ywang96 · 2024-03-08T17:35:39Z

This delay allows the waiting queue to fill up with more requests.

This might affect #3168 and IMO it's worth thinking about how to integrate these control changes with each other

Yard1 · 2024-03-08T18:45:50Z

@tdoublep We were planning to upstream something similar, but instead of time we used number of decode iterations ("schedule prefill iteration only after N decode iterations have been completed or there are no running sequences"). We believe that this scheme is more generic and easier to implement. I'd be happy to make a PR early next week, if you are interested in trying that out.

njhill · 2024-03-08T18:56:50Z

@Yard1 could you elaborate on "more generic and easier to implement"? Isn't it completely generic and fairly trivial to implement in either case?

We found the adaptive time-based approach to work very well, and it makes more sense to me intuitively at least. The goal is to prevent prefills from starving decode progress - the enforced delay is some fraction of the duration of the last prefill and so equivalent to saying that not more than say 50% of time can be spent in prefill. We chose this min delay to be half the last prefill time which ensures at most 66% of time is spent in prefill.

Of course like in your case, the min delay only applies while there are still running sequences.

Yard1 · 2024-03-08T22:02:34Z

Hmm I now see the delay is dynamic. I think thinking in terms of model iterations is simpler, but I suppose that this approach should be just as good.

@tdoublep would it be possible for you to open source your benchmarking tool?

tdoublep · 2024-03-11T12:21:55Z

@Yard1 Yes - we do plan to open-source the benchmarking tool. We are working through that process internally at the moment.

sh1ng · 2024-03-11T21:20:59Z

@tdoublep Which value of --scheduler-use-delay combined with --scheduler_reorder_window do you use?
I believe the sum of them must be a constant.

tdoublep · 2024-03-12T08:49:13Z

@sh1ng --scheduler-use-delay is a boolean option. If set to true, we apply a delay equal to half of the previous time for a prompt step (e.g., the delay is adaptive based on the workload). For the --scheduler_reorder_window we used a very large value (1000) to ensure that all of the requests in the waiting queue are sorted.

tdoublep · 2024-03-15T12:57:36Z

Based on the discussion here it sounds like sorting the requests in the waiting queue will no longer be necessary once we merge #3236 which effectively removing padding constraints via 1D query.

We have run additional experiments to compare the performance when using 1D query from #3236, as well as to evaluate the performance if we enable the dynamic delay (from this PR) in combination with 1D query:

Conclusion: combining dynamic scheduler delay (#3279) with 1D query (#3236) is even more effective than combining it with sorting requests by length (#2357).

tdoublep · 2024-03-20T15:50:28Z

Update: Added a test case in test_scheduler.py to cover use_delay option.

tdoublep · 2024-03-21T13:33:17Z

Now that 1D query has been merged, the changes from this PR can be effective when applied on top of main branch. Here is latest round of benchmarking results. I've also included performance data collected using TGIS (our fork of TGI) as an additional reference point:

Some conclusions here:

We can see that introducing the scheduler delay dramatically improves the ITL when the inference server is under stress (>2x in some cases), and helps to close the performance gap to TGIS, which is better than vLLM in terms of ITL.
The delay has the effect of processing larger batches of prompts, which worsens the TTFT a bit. However, we can see that the TTFT from vLLM after this change is still significantly better than TGIS (>10x in some cases).

vllm/core/scheduler.py

Yard1 · 2024-03-21T19:23:11Z

Looks good. I think it would be even better if we didn't hardcode it to 0.5. I think we could make the argument a float, and if it is <=0, we don't apply the delay.

vllm/core/scheduler.py

tdoublep · 2024-03-21T20:02:40Z

Looks good. I think it would be even better if we didn't hardcode it to 0.5. I think we could make the argument a float, and if it is <=0, we don't apply the delay.

@Yard1 Good idea - there is no reason to assume that 0.5 an optimum for all scenarios. I've updated the code accordingly.

richardliaw · 2024-03-22T19:10:22Z

@Yard1 are you approving this PR?

tdoublep · 2024-03-22T20:11:22Z

@Yard1 thanks for the review and helpful discussion and suggestions.

rkooo567 · 2024-03-22T22:44:54Z

@tdoublep Does vllm have a doc about configuration? Feel like it is worth adding it there if there is. I.e., there are config setttings to optimize throughput over latency, TTFT over ITL or the other way around. But it seems like things are not that well documented

tdoublep · 2024-03-25T10:59:15Z

@rkooo567 I agree it would be good to have documentation like that.

The closest thing I can find the the developer documentation, e.g.:
https://docs.vllm.ai/en/latest/dev/engine/llm_engine.html

Perhaps we should consider adding some more pages there to documentation the ModelConfig, SchedulerConfig etc.

rkooo567 · 2024-03-25T13:36:27Z

I see. Yeah +1 we need better doc with configs, but it seems like there's no holistic page that explains this.

jvlunteren and others added 2 commits March 7, 2024 18:35

Implement dynamic scheduler delay

0d0d540

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>

SchedulerConfig: add default value for use_delay

75b7f57

tdoublep added 4 commits March 20, 2024 15:37

Add test for scheduler_use_delay

a7b6735

move use_delay test to end

8f15973

Merge branch 'main' into scheduler-delay

8ef047a

code formatting

fd1e5da

tdoublep mentioned this pull request Mar 20, 2024

[1/n][Chunked Prefill] Refactor input query shapes #3236

Merged

Resolve some conflicts with changes on main

69cda2a

Yard1 reviewed Mar 21, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

tdoublep added 3 commits March 21, 2024 18:59

Factor delay logic into separate function

ae28c43

Merge branch 'main' into scheduler-delay

2d2b8e0

Remove print in test

99b0d7d

Yard1 reviewed Mar 21, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

tdoublep added 2 commits March 21, 2024 19:25

Add some comments

e1e3408

Changed use_delay (bool) to delay_factor (float)

a114e74

Yard1 approved these changes Mar 22, 2024

View reviewed changes

Yard1 merged commit cf2f084 into vllm-project:main Mar 22, 2024
32 checks passed

tdoublep deleted the scheduler-delay branch March 22, 2024 20:10

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic scheduler delay to improve ITL performance #3279

Dynamic scheduler delay to improve ITL performance #3279

tdoublep commented Mar 8, 2024 •

edited

robertgshaw2-neuralmagic commented Mar 8, 2024 •

edited

tdoublep commented Mar 8, 2024

ywang96 commented Mar 8, 2024

Yard1 commented Mar 8, 2024

njhill commented Mar 8, 2024 •

edited

Yard1 commented Mar 8, 2024

tdoublep commented Mar 11, 2024

sh1ng commented Mar 11, 2024

tdoublep commented Mar 12, 2024

tdoublep commented Mar 15, 2024

tdoublep commented Mar 20, 2024 •

edited

tdoublep commented Mar 21, 2024 •

edited

Yard1 commented Mar 21, 2024

tdoublep commented Mar 21, 2024 •

edited

richardliaw commented Mar 22, 2024

tdoublep commented Mar 22, 2024

rkooo567 commented Mar 22, 2024

tdoublep commented Mar 25, 2024

rkooo567 commented Mar 25, 2024

Dynamic scheduler delay to improve ITL performance #3279

Dynamic scheduler delay to improve ITL performance #3279

Conversation

tdoublep commented Mar 8, 2024 • edited

robertgshaw2-neuralmagic commented Mar 8, 2024 • edited

tdoublep commented Mar 8, 2024

ywang96 commented Mar 8, 2024

Yard1 commented Mar 8, 2024

njhill commented Mar 8, 2024 • edited

Yard1 commented Mar 8, 2024

tdoublep commented Mar 11, 2024

sh1ng commented Mar 11, 2024

tdoublep commented Mar 12, 2024

tdoublep commented Mar 15, 2024

tdoublep commented Mar 20, 2024 • edited

tdoublep commented Mar 21, 2024 • edited

Yard1 commented Mar 21, 2024

tdoublep commented Mar 21, 2024 • edited

richardliaw commented Mar 22, 2024

tdoublep commented Mar 22, 2024

rkooo567 commented Mar 22, 2024

tdoublep commented Mar 25, 2024

rkooo567 commented Mar 25, 2024

tdoublep commented Mar 8, 2024 •

edited

robertgshaw2-neuralmagic commented Mar 8, 2024 •

edited

njhill commented Mar 8, 2024 •

edited

tdoublep commented Mar 20, 2024 •

edited

tdoublep commented Mar 21, 2024 •

edited

tdoublep commented Mar 21, 2024 •

edited