Skip to content

[Core] Adding Priority Scheduling#5958

Merged
youkaichao merged 14 commits into
vllm-project:mainfrom
apatke:main
Sep 25, 2024
Merged

[Core] Adding Priority Scheduling#5958
youkaichao merged 14 commits into
vllm-project:mainfrom
apatke:main

Conversation

@apatke
Copy link
Copy Markdown
Contributor

@apatke apatke commented Jun 28, 2024

FILL IN THE PR DESCRIPTION HERE

There are three major changes implemented:


  1. Addition of a new priority scheduling policy to the scheduler config. Also adds a user-defined priority variable to sequence.

  2. All requests in the running queue and the waiting queue are sorted first based on this priority. If there is a tie, it falls back to the FCFS policy.

  3. Force preemption of request from the running queue back into the waiting queue.
    If there are requests in the running queue whose priority is lower than the requests in the waiting queue, they are forcefully preempted out back into the waiting queue to allow immediate execution of the higher priority request.

@njhill @saurabhjha1 @youkaichao @simon-mo

FIX #6077 (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 28, 2024

@apatke you need to run ./format.sh on the code to fix the linter errors.

@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 28, 2024

Something we had been discussing is whether it would make sense for the API to take some kind of scheduing_params dataclass containing the priority, to allow for fields related to future scheduling policy additions without having to add them all as separate top-level parameters.

@apatke apatke changed the title [RFC] [Core] Adding Strict Priority Scheduling [Core] Adding Strict Priority Scheduling Sep 5, 2024
Comment thread vllm/config.py Outdated
Comment thread vllm/core/scheduler.py Outdated
@apatke apatke changed the title [Core] Adding Strict Priority Scheduling [Core] Adding Priority Scheduling Sep 6, 2024
@apatke
Copy link
Copy Markdown
Contributor Author

apatke commented Sep 6, 2024

Performance slowdown from _schedule_priority_preemption is <4% with the priority policy for Llama 8B. No performance degradation when policy is not enabled.

priority: Throughput: 14.56 requests/s, 6052.59 tokens/s
fcfs: Throughput: 15.15 requests/s, 6299.22 tokens/s

@apatke
Copy link
Copy Markdown
Contributor Author

apatke commented Sep 9, 2024

@youkaichao Would you be able to take a look at the PR?

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @apatke! I did a first pass.

As discussed, it's great that that this has no performance impact when the priority scheduling is disabled which is the most important thing.

Would be good to get @youkaichao's thoughts too.

Comment thread vllm/core/scheduler.py
Comment thread vllm/config.py Outdated
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/sequence.py Outdated
Comment thread vllm/entrypoints/llm.py
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/core/scheduler.py Outdated
Comment thread vllm/core/scheduler.py
@youkaichao
Copy link
Copy Markdown
Member

will take a look when i have time :)

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @apatke, could you resolve the new conflicts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary to add in vllm's codebase? looks like the code duplication is large.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a lot of duplication in our benchmarking and examples scripts ... could be a good separate task for someone to look at to streamline that. I will see if there's someone on my side who can look at it.

Copy link
Copy Markdown
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give approval and hand it over to @njhill . I think it is fine since it does not affect default operation

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2024
@youkaichao youkaichao merged commit 6da1ab6 into vllm-project:main Sep 25, 2024
Manikandan-Thangaraj-ZS0321 added a commit to Manikandan-Thangaraj-ZS0321/vllm that referenced this pull request Sep 25, 2024
* [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <dipikasikka1@gmail.com>

* [Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

* [BugFix] Fix clean shutdown issues (vllm-project#8492)

* [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

* [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

* [doc] update doc on testing and debugging (vllm-project#8514)

* [Bugfix] Bind api server port before starting engine (vllm-project#8491)

* [perf bench] set timeout to debug hanging (vllm-project#8516)

* [misc] small qol fixes for release process (vllm-project#8517)

* [Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [refactor] remove triton based sampler (vllm-project#8524)

* [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

* [torch.compile] register allreduce operations as custom ops (vllm-project#8526)

* [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

* [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

* [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

* [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

* [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

* [Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>

* [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

* [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

* [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

* [Misc] Add argument to disable FastAPI docs (vllm-project#8554)

* [CI/Build] Avoid CUDA initialization (vllm-project#8534)

* [CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

* [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

* [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

* [Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: lwilkinson@neuralmagic.com

* [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

* Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

* [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

* [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

* [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

* [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

* [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

* [Doc] Add documentation for GGUF quantization (vllm-project#8618)

* Create SECURITY.md (vllm-project#8642)

* [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

* [Misc] guard against change in cuda library name (vllm-project#8609)

* [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

* [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

* [Core] Support Lora lineage and base model metadata management (vllm-project#6315)

* [Model] Add OLMoE (vllm-project#7922)

* [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

* [Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

* [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

* [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

* [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

* [Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

* [Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

* [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

* [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

* [MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <youkaichao@126.com>

* [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

* [beam search] add output for manually checking the correctness (vllm-project#8684)

* [Kernel] Build flash-attn from source (vllm-project#8245)

* [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

* [Doc] Fix typo in AMD installation guide (vllm-project#8689)

* [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

* [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

* [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

* [Bugfix] Refactor composite weight loading logic (vllm-project#8656)

* [ci][build] fix vllm-flash-attn (vllm-project#8699)

* [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

* [Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

* [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

* [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

* [misc] upgrade mistral-common (vllm-project#8715)

* [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

* [Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <yuan.zhou@intel.com>

* [Bugfix] fix docker build for xpu (vllm-project#8652)

* [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* [Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

* [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

* [Model] Support pp for qwen2-vl (vllm-project#8696)

* [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

* [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <youkaichao@126.com>

* [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

* [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

* Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

* [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

* Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

* re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>

* Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

* [MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

* [Core][Model] Support loading weights by ID within models (vllm-project#7931)

* [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

* [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <michael@neuralmagic.com>

* [Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

* [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

* [CI/Build] fix setuptools-scm usage (vllm-project#8771)

* [misc] soft drop beam search (vllm-project#8763)

* [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

* [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

* [Core] Adding Priority Scheduling (vllm-project#5958)

* [Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

* Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

* [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

* Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

* [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

* [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

* [Bugfix] load fc bias from config for eagle (vllm-project#8790)

---------

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: sasha0552 <admin@sasha0552.org>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Kevin Lin <42618777+kevin314@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: chenqianfzh <51831990+chenqianfzh@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>
Co-authored-by: Alexey Kondratiev(AMD) <143633163+alexeykondrat@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com>
Co-authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Geun, Lim <shing100@Naver.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Charlie Fu <charlifu@amd.com>
Co-authored-by: 盏一 <w@hidva.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Amit Garg <mitgarg17495@gmail.com>
Co-authored-by: William Lin <SolitaryThinker@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: saumya-saran <saumya.saran@c3.ai>
Co-authored-by: Pastel! <1627301104@qq.com>
Co-authored-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Co-authored-by: zyddnys <zyddnys@outlook.com>
Co-authored-by: youkaichao <youkaichao@126.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Huazhong Ji <hzji210@gmail.com>
Co-authored-by: litianjian <45817262+litianjian@users.noreply.github.com>
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Yanyi Liu <wolfsonliu@163.com>
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: jiqing-feng <107918818+jiqing-feng@users.noreply.github.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Peter Salas <peter@fixie.ai>
Co-authored-by: Hanzhi Zhou <hanzhi713@gmail.com>
Co-authored-by: Andy <37781802+aandyw@users.noreply.github.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Archit Patke <apatke@illinois.edu>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: sohamparikh <sohamparikh47@gmail.com>
@ZJUFangzh
Copy link
Copy Markdown

hi, why this priority scheduling not support AsyncLLMEngine?

@schoennenbeck
Copy link
Copy Markdown
Contributor

@youkaichao @njhill Do you know if somebody is already working on supporting this in AsyncLLMEngine? If not I could go ahead open a PR.

@schoennenbeck
Copy link
Copy Markdown
Contributor

Port can be found here: #8850

@tonyaw
Copy link
Copy Markdown

tonyaw commented Oct 10, 2024

Thanks for your effort!
How can I use it with openai client?
The vllm version I'm using is vllm-0.6.3.dev152+gde895f16.d20241010.
Currently, I'm putting priority into 'extra_body' of client.chat.completions.create:

{'model': 'hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', 'stream': False, 'max_tokens': 20480, 'temperature': 0, 'n': 1, 'seed': 42, 'extra_body': {'top_k': 1, 'priority': 10}}

Is it right?

@schoennenbeck
Copy link
Copy Markdown
Contributor

@tonyaw Yes, that should be correct. The support was added in another PR. Remember that a lower value for priority means earlier handling (this is in line with how python's queue.PriorityQueue works).

@tonyaw
Copy link
Copy Markdown

tonyaw commented Oct 10, 2024

@schoennenbeck Thanks for your prompt response!
Next question is: When the preemption can handle by priority?

        if len(prefills.seq_groups
               ) == 0 and self.scheduler_config.policy == "priority":
            self._schedule_priority_preemption(budget)

Does it mean as long as there is one request in prefill stage, the priority_preemption will never be triggered?

@tonyaw
Copy link
Copy Markdown

tonyaw commented Oct 10, 2024

Also, I realized if "--enable-chunked-prefill" is set, priority scheduling won't be triggered.
To get better performance, I need to enable chunked-prefill, but in this case, I can't use priority scheduling any more.
May I ask the reason?

@tonyaw
Copy link
Copy Markdown

tonyaw commented Oct 11, 2024

@apatke @schoennenbeck ,
I met some issue, and think current code has some issue:
#9272
Could you please help to check? Thanks in advance! :-)

@tonyaw
Copy link
Copy Markdown

tonyaw commented Oct 11, 2024

@apatke and @schoennenbeck , as I mentioned in #9272, even if the priority is propagated successfully, vllm always crashes as long as preemption happens.
I just tested with vllm-0.6.3.dev173+g36ea7907.d20241011. The only change I made is following fix and some logs:
#9277

Could you please help to guide me how to WR it?
Also, I realized --enable_chunked_prefill is default be True for Llama3.1 as it is a long context length model.
Why enable_chunked_prefill can't work with priority scheduling together? It will reduce the vllm performance a lot.

Reproduce procedure:

  1. Start vllm:
python3 -m vllm.entrypoints.openai.api_server --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
        --host 0.0.0.0 --port 8080  --seed 42 --trust-remote-code --scheduling-policy priority \
        --tensor-parallel-size 2 --max-num-seqs 10 --enable_chunked_prefill False
  1. Use openai client to make a 15 concurrent load.
  2. Use another openai client to send some requests with priority -100.
  3. As long as preemption is triggered, vllm crashes:
INFO 10-11 06:51:31 engine.py:292] Added request chat-29ccd21f37fb4a8ea96c1c8c189a6a49.
INFO 10-11 06:51:31 engine.py:294] tonyaw:Added request -100.
INFO 10-11 06:51:31 scheduler.py:1025] tonyaw: len(prefills.seq_groups) = 0
INFO 10-11 06:51:31 scheduler.py:1025] tonyaw: len(prefills.seq_groups) = 0
INFO 10-11 06:51:31 scheduler.py:807] tonyaw: _schedule_priority_preemption: waiting_queue is not None.
INFO 10-11 06:51:31 scheduler.py:808] tonyaw: seq_group chat-29ccd21f37fb4a8ea96c1c8c189a6a49 priority:-100
INFO 10-11 06:51:31 scheduler.py:837] tonyaw: _schedule_priority_preemption: vseq_group chat-55b8bdce4ae14eb8869839561fac50f9 is pop up, and will preempt.
WARNING 10-11 06:51:31 scheduler.py:1493] Sequence group chat-55b8bdce4ae14eb8869839561fac50f9 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
INFO 10-11 06:51:31 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241011-065131.pkl...
WARNING 10-11 06:51:31 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't get local object 'weak_bind.<locals>.weak_bound'
INFO:     10.254.17.246:54142 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
INFO:     10.254.17.246:54086 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
ERROR 10-11 06:51:31 engine.py:160] ValueError('Error in model execution: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.')
ERROR 10-11 06:51:31 engine.py:160] Traceback (most recent call last):
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-11 06:51:31 engine.py:160]     return func(*args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1698, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     model_input.async_callback()
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1122, in weak_bound
ERROR 10-11 06:51:31 engine.py:160]     unbound(inst, *args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1210, in _process_model_outputs
ERROR 10-11 06:51:31 engine.py:160]     self.do_log_stats(scheduler_outputs, outputs, finished_before,
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1543, in do_log_stats
ERROR 10-11 06:51:31 engine.py:160]     stats = self._get_stats(scheduler_outputs, model_output,
ERROR 10-11 06:51:31 engine.py:160]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1664, in _get_stats
ERROR 10-11 06:51:31 engine.py:160]     latency = seq_group.get_last_latency(now)
ERROR 10-11 06:51:31 engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 772, in get_last_latency
ERROR 10-11 06:51:31 engine.py:160]     raise ValueError(
ERROR 10-11 06:51:31 engine.py:160] ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.
ERROR 10-11 06:51:31 engine.py:160] 
ERROR 10-11 06:51:31 engine.py:160] The above exception was the direct cause of the following exception:
ERROR 10-11 06:51:31 engine.py:160] 
ERROR 10-11 06:51:31 engine.py:160] Traceback (most recent call last):
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 10-11 06:51:31 engine.py:160]     self.run_engine_loop()
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 10-11 06:51:31 engine.py:160]     request_outputs = self.engine_step()
ERROR 10-11 06:51:31 engine.py:160]                       ^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 239, in engine_step
ERROR 10-11 06:51:31 engine.py:160]     raise e
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 10-11 06:51:31 engine.py:160]     return self.engine.step()
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1386, in step
ERROR 10-11 06:51:31 engine.py:160]     outputs = self.model_executor.execute_model(
ERROR 10-11 06:51:31 engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-11 06:51:31 engine.py:160]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-11 06:51:31 engine.py:160]     return self.driver_worker.execute_model(execute_model_req)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     output = self.model_runner.execute_model(
ERROR 10-11 06:51:31 engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-11 06:51:31 engine.py:160]     return func(*args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 10-11 06:51:31 engine.py:160]     raise type(err)(f"Error in model execution: "
ERROR 10-11 06:51:31 engine.py:160] ValueError: Error in model execution: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Signed-off-by: Alvant <alvasian@yandex.ru>
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
kwang1012 pushed a commit to kwang1012/vllm that referenced this pull request Oct 28, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
@justadogistaken
Copy link
Copy Markdown

@tonyaw Hi, I'm curious that does "Why enable_chunked_prefill can't work with priority scheduling together" fixed?

@hxt365
Copy link
Copy Markdown

hxt365 commented Apr 2, 2025

Is this usable now? I don't find the usage guide anywhere in the doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Priority Scheduling

8 participants