Add distributed model executor abstraction #3191

zhuohan123 · 2024-03-05T06:27:23Z

This PR pulls out the distributed worker manager part of LLMEngine to a new set of classes, namely ModelExecutors. This can benefit us in separating the code for different hardware backends, as well as enabling support of single-box distributed execution without ray. Specifically, this PR implements 4 types of model executors:

SingleGPUModelExecutor: Previous code path when not using ray.
SingleGPUModelExecutorAsync: SingleGPUModelExecutor + several async function calls.
RayDistributedModelExecutor: Previous distributed implementation with ray.
RayDistributedModelExecutorAsync: RayDistributedModelExecutor + several async function calls.

TODOs after this PR

Separate neuron's code path to a separate model executor.
Rename directory name model_executor -> model.
Add multi-processing based distributed execution (Make ray optional for single-node deployment #2898).

njhill · 2024-03-05T23:01:03Z

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

zhuohan123 · 2024-03-06T00:23:03Z

Thanks @zhuohan123, this is the kind of thing I had in mind in this comment #2898 (comment)! If you like I can rework #2898 to plug into your abstraction.

Yes, #2898 is the exact next PR I'm thinking about after this PR. This PR is still WIP and I might change things here and there. Let me ping you once this PR is finalized :)

njhill · 2024-03-06T00:44:02Z

@zhuohan123 sounds great... yeah I meant once you were finished with this, no rush at all!

Yard1 · 2024-03-06T18:55:40Z

vllm/executor/ray_distributed_executor.py

+USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0))
+
+
+class RayDistributedModelExecutor:


suggest adding a common abstract class for different model executors, so that they all implement the same public API.

zhuohan123 · 2024-03-07T00:03:04Z

@Yard1 Just FYI, one change that I made in this PR is that I moved PlacementGroup into the ParallelConfig. Please let me know if this is not a good idea.

Yard1 · 2024-03-07T00:08:00Z

I think it should be fine

WoosukKwon

@zhuohan123 Awesome! Thanks for the great work! This refactor substantially cleans up the current system architecture while providing better extensibility 😄.

Overall, I'm happy with the PR and only have small concerns:

I don't feel RayDistributedExecutor and SingleGPUModelRunner are good names. I'd propose RayGPURunner (or RayGPUExecutor) and GPURunner instead. WDYT?
As a result of the refactoring, there is some duplicated code between RayDistributedExecutor and SingleGPUModelRunner. Can we reduce the duplication?

Please check out my review for more details.

WoosukKwon · 2024-03-08T07:30:32Z

tests/models/test_marlin.py

-Note: GPTQ and Marlin do not have bitwise correctness. 
-As a result, in this test, we just confirm that the top selected tokens of the 
+Note: GPTQ and Marlin do not have bitwise correctness.
+As a result, in this test, we just confirm that the top selected tokens of the


Just wondering: Is our formatter not able to catch this kind of trailing whitespaces?

Yeah seems like this is the case. cc @simon-mo

vllm/engine/llm_engine.py

vllm/executor/single_gpu_executor_async.py

vllm/executor/utils.py

vllm/executor/single_gpu_executor_async.py

vllm/executor/ray_distributed_executor.py

vllm/engine/llm_engine.py

njhill · 2024-03-09T00:09:24Z

@zhuohan123 this looks great thanks! Related to @WoosukKwon's comment above though, it feels like there's a fair amount of duplication of logic between the implementations which would need to be updated in multiple places any time it changes.

Especially when also thinking about how to rework the multiprocessing abstraction from #2898.

I think that could be addressed with another abstraction layer beneath your ModelExecutor one covering a subset of the implementations - specifically all of the current GPU-process based ones, but not neuron. WDYT? I'd be happy to show what I mean in another branch.

I agree with @Yard1 that it would be good to include an abstract base class.

…oject/vllm into add-executor-abstraction

zhuohan123 · 2024-03-10T07:17:13Z

@njhill @WoosukKwon Regarding duplicated code. I think I have tried my best to pull out shared codes between the two executors. How about we merge this PR first, and then see whether we can further reduce code logic duplication?

zhuohan123 · 2024-03-10T07:18:09Z

@WoosukKwon This PR is ready for review.

WoosukKwon

LGTM! Thanks for addressing my review!

binarycrayon · 2024-03-15T03:12:16Z

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

@varun-sundar-rabindranath

SUMMARY: * upstream merge (sync) up to `54be8a0` ## NOTES - Updated ruff configs had line limits. Had to clean up a lot of files manually. I think `./format.sh` runs yapf and ruff only on the `nm-vllm/vllm` directory whereas our automation runs on everything in the `nm-vllm`, so it was a bit tricky for me to catch why the automation was failing. cc @varun-sundar-rabindranath please review the benchmark directory in detail ### Primary upstream changes: #### Kernels - [`batched_rotary_embedding` ](vllm-project@7e9bd08) - [`gelu_tanh_and_mul`]() #### Core - [`LLMEngine` refactor](vllm-project#3191) <<< adds new layer of abstraction to vLLM. **All should look at this** TEST PLAN: - nightly automation --------- Signed-off-by: Tao He <sighingnow@gmail.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Sherlock113 <sherlockxu07@gmail.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Mustafa Eyceoz <maxusmusti@gmail.com> Co-authored-by: Roy <jasonailu87@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Massimiliano Pronesti <massimiliano.pronesti@gmail.com> Co-authored-by: 44670 <44670@users.noreply.github.com> Co-authored-by: zhaoyang-star <zhaoyangstar@foxmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Jared Moore <27744679+jlcmoore@users.noreply.github.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: 张大成 <1345739055@qq.com> Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com> Co-authored-by: Jingru <niejingru@hotmail.com> Co-authored-by: Dylan Hawk <51147702+dylanwhawk@users.noreply.github.com> Co-authored-by: Tao He <sighingnow@gmail.com> Co-authored-by: Ganesh Jagadeesan <ganesh.jcs@gmail.com> Co-authored-by: Allen.Dou <allen.dou@hotmail.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: CHU Tianxiang <tianxiang.ctx@alibaba-inc.com> Co-authored-by: Jae-Won Chung <jwnchung@umich.edu> Co-authored-by: Seonghyeon <seonghyeon.drew@gmail.com> Co-authored-by: Billy Cao <aliencaocao@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: felixzhu555 <79335195+felixzhu555@users.noreply.github.com> Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Sherry <503147114@qq.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Huarong <huohuarong@gmail.com> Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: cloudhan <cloudhan@outlook.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu> Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Jason Cox <jason@jasonacox.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: TianYu GUO <guoty9@mail2.sysu.edu.cn> Co-authored-by: Jialun Lyu <43287111+pian13131@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com> Co-authored-by: Chen Wang <Chen.Wang1@ibm.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: TechxGenus <jianghao0728@mail.ustc.edu.cn> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: whyiug <whyiug@hotmail.com> Co-authored-by: Terry <149540247+tterrysun@users.noreply.github.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com> Co-authored-by: DAIZHENWEI <32122197+DAIZHENWEI@users.noreply.github.com> Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com> Co-authored-by: Bo-Wen Wang <1849994161@qq.com> Co-authored-by: Ronan McGovern <78278410+RonanKMcGovern@users.noreply.github.com> Co-authored-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Co-authored-by: 陈序 <chenxu2048@gmail.com> Co-authored-by: Or Sharir <or+github@sharir.org> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Thomas Parnell <tom.parnell@gmail.com> Co-authored-by: Dan Clark <44146800+declark1@users.noreply.github.com> Co-authored-by: Daniel Clark <daniel.clark@ibm.com> Co-authored-by: youkaichao <youkaichao@126.com>

zhuohan123 · 2024-03-21T20:45:07Z

Hi I don't have the context of this pr, but how can we enable SingleGPUModelRunner path?

By default if you use 1 GPU you will go to that path.

zhuohan123 added 3 commits March 5, 2024 06:27

Add distributed model executor abstraction

d8c0998

fix

16de289

fix

15a1fe7

zhuohan123 added 3 commits March 6, 2024 08:15

Merge branch 'main' into add-executor-abstraction

ac2e888

format

675190d

health check

2592130

zhuohan123 changed the title ~~[WIP] Add distributed model executor abstraction~~ Add distributed model executor abstraction Mar 6, 2024

zhuohan123 added 2 commits March 6, 2024 09:03

pull out common functionalities and fix tests

e381ca3

Merge branch 'main' into add-executor-abstraction

3bdda0b

zhuohan123 requested review from Yard1 and WoosukKwon March 6, 2024 09:14

fix lora test

c348371

Yard1 reviewed Mar 6, 2024

View reviewed changes

zhuohan123 requested a review from cadedaniel March 8, 2024 01:14

WoosukKwon reviewed Mar 8, 2024

View reviewed changes

zhuohan123 added 9 commits March 9, 2024 02:19

Fix style

002c67f

fix review comments

198e794

rename

f82841b

Merge branch 'main' into add-executor-abstraction

390dbaf

Add base class

ebcd813

refactor async executors

fe2ef93

fix async style

89e0cac

Merge branch 'add-executor-abstraction' of https://github.com/vllm-pr…

22ee8ca

…oject/vllm into add-executor-abstraction

lazy import

1c77da8

zhuohan123 requested a review from WoosukKwon March 10, 2024 07:18

Merge branch 'main' into add-executor-abstraction

4b4206d

WoosukKwon approved these changes Mar 11, 2024

View reviewed changes

zhuohan123 merged commit 4c92270 into main Mar 11, 2024

esmeetu mentioned this pull request Mar 12, 2024

[BugFix] Fix async engine running on ray #3343

Closed

starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024

Add distributed model executor abstraction (vllm-project#3191)

ec75970

robertgshaw2-redhat mentioned this pull request Mar 15, 2024

Upstream sync 2024 03 14 neuralmagic/nm-vllm#127

Merged

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Add distributed model executor abstraction (vllm-project#3191)

a5793f3

zhuohan123 deleted the add-executor-abstraction branch April 26, 2024 00:27

njhill mentioned this pull request May 1, 2024

[Core] Centralize GPU Worker construction #4419

Merged

		USE_RAY_COMPILED_DAG = bool(os.getenv("VLLM_USE_RAY_COMPILED_DAG", 0))


		class RayDistributedModelExecutor:

Uh oh!

Add distributed model executor abstraction #3191

Add distributed model executor abstraction #3191

Conversation

zhuohan123 commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Mar 5, 2024

Uh oh!

zhuohan123 commented Mar 6, 2024

Uh oh!

njhill commented Mar 6, 2024

Uh oh!

Yard1 Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuohan123 commented Mar 7, 2024

Uh oh!

Yard1 commented Mar 7, 2024

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Mar 9, 2024

Uh oh!

zhuohan123 commented Mar 10, 2024

Uh oh!

zhuohan123 commented Mar 10, 2024

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

binarycrayon commented Mar 15, 2024

Uh oh!

zhuohan123 commented Mar 21, 2024

Uh oh!

Uh oh!

zhuohan123 commented Mar 5, 2024 •

edited

Loading

Yard1 Mar 6, 2024 •

edited

Loading