Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic scheduler delay to improve ITL performance #3279

Merged
merged 12 commits into from
Mar 22, 2024

Conversation

tdoublep
Copy link
Contributor

@tdoublep tdoublep commented Mar 8, 2024

We have been benchmarking vLLM internally using a synthetic workload generator that has been fitted to mimic our production workloads. It stresses the inference server using a varying number of concurrent users, all users send requests that are drawn uniformly from a heterogeneous set of requests with different prompt lengths and number of generated tokens.

We have found that for these workloads, vLLM has extremely low TTFT (time to first token) but has relatively high ITL (inter-token latency). An in-depth analysis seems to show that vLLM tends to schedule prompts as soon as possible, resulting in very small prompt batches, which are processed very quickly, but end up starving the decoding phase.

This PR adds a new optional feature --scheduler-use-delay which, if enabled, creates an artificial delay before scheduling prompts. The delay is determined dynamically based on the time to perform the last prompt step. This delay allows the waiting queue to fill up with more requests. This gives the opportunity to make larger prompt batches, but due to the heterogeneous nature of the workload, we then hit issues related to padding overhead. It is thus beneficial to combine this scheduler delay with the --scheduler-policy=reorder feature from #2357 which sorts the waiting queue by sequence length. This allows us to create much larger prompt batches whilst staying with the padding limits, and leads to significant improvements in terms of ITL performance.

This ITL improvement comes at the expense of TTFT performance, since (a) we are applying an artificial delay before scheduling prompts and (b) we are now processing larger batches which take longer to process. Different use-cases may have a preference towards either metric, which is why we feel this makes sense as an optional feature for now.

Benchmarking results (labels on each point indicates the number of concurrent users):

image

image

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Mar 8, 2024

Take a look also at the chunked prefill efforts to address this

#3106

@tdoublep
Copy link
Contributor Author

tdoublep commented Mar 8, 2024

@robertgshaw2-neuralmagic Thanks, and agreed: chunked prefill may eventually solve this problem in a different way.

We hope that this relatively simple, optional, change can be used to improve performance in the meantime.

@ywang96
Copy link
Collaborator

ywang96 commented Mar 8, 2024

This delay allows the waiting queue to fill up with more requests.

This might affect #3168 and IMO it's worth thinking about how to integrate these control changes with each other

@Yard1
Copy link
Collaborator

Yard1 commented Mar 8, 2024

@tdoublep We were planning to upstream something similar, but instead of time we used number of decode iterations ("schedule prefill iteration only after N decode iterations have been completed or there are no running sequences"). We believe that this scheme is more generic and easier to implement. I'd be happy to make a PR early next week, if you are interested in trying that out.

@njhill
Copy link
Collaborator

njhill commented Mar 8, 2024

@Yard1 could you elaborate on "more generic and easier to implement"? Isn't it completely generic and fairly trivial to implement in either case?

We found the adaptive time-based approach to work very well, and it makes more sense to me intuitively at least. The goal is to prevent prefills from starving decode progress - the enforced delay is some fraction of the duration of the last prefill and so equivalent to saying that not more than say 50% of time can be spent in prefill. We chose this min delay to be half the last prefill time which ensures at most 66% of time is spent in prefill.

Of course like in your case, the min delay only applies while there are still running sequences.

@Yard1
Copy link
Collaborator

Yard1 commented Mar 8, 2024

Hmm I now see the delay is dynamic. I think thinking in terms of model iterations is simpler, but I suppose that this approach should be just as good.

@tdoublep would it be possible for you to open source your benchmarking tool?

@tdoublep
Copy link
Contributor Author

@Yard1 Yes - we do plan to open-source the benchmarking tool. We are working through that process internally at the moment.

@sh1ng
Copy link
Contributor

sh1ng commented Mar 11, 2024

@tdoublep Which value of --scheduler-use-delay combined with --scheduler_reorder_window do you use?
I believe the sum of them must be a constant.

@tdoublep
Copy link
Contributor Author

@sh1ng --scheduler-use-delay is a boolean option. If set to true, we apply a delay equal to half of the previous time for a prompt step (e.g., the delay is adaptive based on the workload). For the --scheduler_reorder_window we used a very large value (1000) to ensure that all of the requests in the waiting queue are sorted.

@tdoublep
Copy link
Contributor Author

Based on the discussion here it sounds like sorting the requests in the waiting queue will no longer be necessary once we merge #3236 which effectively removing padding constraints via 1D query.

We have run additional experiments to compare the performance when using 1D query from #3236, as well as to evaluate the performance if we enable the dynamic delay (from this PR) in combination with 1D query:

image
image

Conclusion: combining dynamic scheduler delay (#3279) with 1D query (#3236) is even more effective than combining it with sorting requests by length (#2357).

@tdoublep
Copy link
Contributor Author

tdoublep commented Mar 20, 2024

Update: Added a test case in test_scheduler.py to cover use_delay option.

@tdoublep
Copy link
Contributor Author

tdoublep commented Mar 21, 2024

Now that 1D query has been merged, the changes from this PR can be effective when applied on top of main branch. Here is latest round of benchmarking results. I've also included performance data collected using TGIS (our fork of TGI) as an additional reference point:
image
image

Some conclusions here:

  • We can see that introducing the scheduler delay dramatically improves the ITL when the inference server is under stress (>2x in some cases), and helps to close the performance gap to TGIS, which is better than vLLM in terms of ITL.
  • The delay has the effect of processing larger batches of prompts, which worsens the TTFT a bit. However, we can see that the TTFT from vLLM after this change is still significantly better than TGIS (>10x in some cases).

vllm/core/scheduler.py Outdated Show resolved Hide resolved
@Yard1
Copy link
Collaborator

Yard1 commented Mar 21, 2024

Looks good. I think it would be even better if we didn't hardcode it to 0.5. I think we could make the argument a float, and if it is <=0, we don't apply the delay.

vllm/core/scheduler.py Outdated Show resolved Hide resolved
@tdoublep
Copy link
Contributor Author

tdoublep commented Mar 21, 2024

Looks good. I think it would be even better if we didn't hardcode it to 0.5. I think we could make the argument a float, and if it is <=0, we don't apply the delay.

@Yard1 Good idea - there is no reason to assume that 0.5 an optimum for all scenarios. I've updated the code accordingly.

@richardliaw
Copy link
Collaborator

@Yard1 are you approving this PR?

@Yard1 Yard1 merged commit cf2f084 into vllm-project:main Mar 22, 2024
32 checks passed
@tdoublep tdoublep deleted the scheduler-delay branch March 22, 2024 20:10
@tdoublep
Copy link
Contributor Author

@Yard1 thanks for the review and helpful discussion and suggestions.

@rkooo567
Copy link
Collaborator

@tdoublep Does vllm have a doc about configuration? Feel like it is worth adding it there if there is. I.e., there are config setttings to optimize throughput over latency, TTFT over ITL or the other way around. But it seems like things are not that well documented

@tdoublep
Copy link
Contributor Author

@rkooo567 I agree it would be good to have documentation like that.

The closest thing I can find the the developer documentation, e.g.:
https://docs.vllm.ai/en/latest/dev/engine/llm_engine.html

Perhaps we should consider adding some more pages there to documentation the ModelConfig, SchedulerConfig etc.

@rkooo567
Copy link
Collaborator

I see. Yeah +1 we need better doc with configs, but it seems like there's no holistic page that explains this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants