[Core] remove cupy dependency #3625

youkaichao · 2024-03-26T01:31:44Z

Separate code from #3442 , only remove cupy dependency.

youkaichao · 2024-03-26T05:22:42Z

Support evidence from #3617 : this PR solved cupy initialization issue 👍

WoosukKwon · 2024-03-26T16:25:43Z

Can we get confirmation from AMD people that we can do the same thing to replace PyTorch RCCL? While I believe we don't have to do this in the current PR, I'd like to get confirmation and find someone who takes that.

simon-mo

i made an initial pass and it looks good overall. I think we are maybe lacking a bit more tests in the test_pynccl.py?

CMakeLists.txt

.buildkite/test-pipeline.yaml

tests/distributed/test_comm_ops.py

tests/distributed/test_pynccl.py

vllm/model_executor/parallel_utils/pynccl.py

youkaichao · 2024-03-26T19:41:23Z

@WoosukKwon @simon-mo thanks for the review! I talked to amd folks to ask for review. Let's see if we can get feedback today. If not, we can merge first, and let them send fix PRs later.

WoosukKwon

Thanks for the great work! Left some questions.

vllm/model_executor/parallel_utils/pynccl.py

WoosukKwon · 2024-03-27T03:02:11Z

vllm/worker/model_runner.py

-        # Delete the CUDA graphs before deleting the CuPy NCCL communicator.
+        # Delete the CUDA graphs before deleting the pynccl communicator.
        # NOTE(woosuk): This is necessary because otherwise deadlocks can
        # happen.
        # FIXME(woosuk): This is a bit hacky. Find a more robust solution.
        self.graph_runners.clear()
-        self.cupy_nccl_backend = None


I think this part can be deleted as we remove CuPy?

I don't know why this code was needed before. So I don't want to change it 🤣 If you have more context, we can discuss if we can delete this.

When using CuPy + CUDA graphs, deadlocks happen when the CuPy backend is deleted before the CUDA graphs using it are deleted. I actually don't know the reason for this, but this doesn't happen when using NCCL through PyTorch probably because the NCCL communicator managed by PyTorch is deleted at the very end.

@youkaichao could you check whether we can delete this? You can simply run python examples/llm_engine_example.py -tp 2 and see if the process hangs.

Okay, will have a try.

BTW, I do agree with you that we can remove this code some time later. If you'd like to do so, could you please add TODO in the code?

I tested the code (removing the above two lines) for 100 times for i in {1..100}; do python3 examples/llm_engine_example.py -tp=2 && echo "Test passed $i times" || break; done , and I don't see any deadlocks. This gives us confidence in removing the code later. I left a comment there to remove it in v0.4.1 .

vllm/worker/model_runner.py

vllm/model_executor/parallel_utils/pynccl.py

youkaichao · 2024-03-27T06:11:15Z

Overall I think we should take a small step for every release. Distributed related bugs like hang/deadlock are highly unstable and difficult to test.

My plan is to use pynccl as an equivalent replacement of cupy in v0.4.0 , and collect user feedback to decide the next move for v0.4.1 (e.g. removing tricky code wrote for cupy).

cc @WoosukKwon

WoosukKwon

Thanks again for the great work!

youkaichao added 7 commits March 25, 2024 17:51

update CMakeLists.txt

24868ee

import file from huge PR

0af327b

isort

d3439b9

support amd rccl

37fd9fd

leave todo

9ec0b67

add pynccl into test

4904b72

linter

53e2ca3

rkooo567 mentioned this pull request Mar 26, 2024

TCPStore is not available #3334

Open

youkaichao added 11 commits March 25, 2024 19:41

update and merge from vllm/worker/model_runner.py in main

5cdeb59

fix isort

af94dc6

restore and merge vllm/worker/worker.py

af8e254

fix hip condition

db3044a

fix cuda condition

f078110

add error message when libnccl cannot be found

18c9437

fix test

3f1db02

fix lint

6c7082d

restore test of model and llava

80eec0b

try to clean up cupy

efb52bf

lint

e6fb64d

youkaichao added 3 commits March 25, 2024 22:44

unify init_method

5eb972e

further cleanup cupy

915213c

do not know why, but this fixes ray test

9b4d6fc

youkaichao changed the title ~~[WIP][Core] remove cupy dependency~~ [Core] remove cupy dependency Mar 26, 2024

simon-mo approved these changes Mar 26, 2024

View reviewed changes

youkaichao added 4 commits March 26, 2024 11:41

merge distributed tests

983243e

update comment for allreduce graph capture

d2c9b4b

explain why delete CUDA_VISIBLE_DEVICES

cefae38

explain update_env

da58b34

youkaichao added 2 commits March 26, 2024 12:37

move successful load message to debug level

850eca1

fix lint

0ed6527

simon-mo added the release-blocker This PR/issue blocks the next release, therefore deserves highest priority label Mar 27, 2024

WoosukKwon reviewed Mar 27, 2024

View reviewed changes

WoosukKwon approved these changes Mar 27, 2024

View reviewed changes

youkaichao added 2 commits March 26, 2024 23:30

restore CMakeLists.txt

0361127

leave a todo for __del__

6d18496

youkaichao merged commit 8f44fac into vllm-project:main Mar 27, 2024
33 checks passed

youkaichao deleted the remove_cupy branch March 27, 2024 07:33

esmeetu mentioned this pull request Mar 28, 2024

[Core] Support multi-node inference(eager and cuda graph) #3686

Merged

youkaichao mentioned this pull request Mar 28, 2024

[Bug]: Custom all reduce not work. #3688

Open

hongxiayang mentioned this pull request Mar 28, 2024

[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic #3699

Merged

This was referenced Mar 29, 2024

[Installation]: CuPy and Multiprocess raise the AssertionError: daemonic processes are not allowed to have children #3724

Closed

[Bug]: GH200 MGX platform serving is broken after the cupy dependency addition #3744

Open

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 31, 2024

[Core] remove cupy dependency (vllm-project#3625)

8719180

youkaichao mentioned this pull request Apr 3, 2024

[Core] Upgrade to pytorch 2.2, remove cupy dependency, avoid nccl 2.19 bug #3442

Closed

youkaichao mentioned this pull request Apr 12, 2024

cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error #3222

Closed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] remove cupy dependency #3625

[Core] remove cupy dependency #3625

youkaichao commented Mar 26, 2024

youkaichao commented Mar 26, 2024

WoosukKwon commented Mar 26, 2024

simon-mo left a comment

youkaichao commented Mar 26, 2024

WoosukKwon left a comment

WoosukKwon Mar 27, 2024

youkaichao Mar 27, 2024

WoosukKwon Mar 27, 2024

WoosukKwon Mar 27, 2024

youkaichao Mar 27, 2024

WoosukKwon Mar 27, 2024

youkaichao Mar 27, 2024

youkaichao commented Mar 27, 2024

WoosukKwon left a comment

[Core] remove cupy dependency #3625

[Core] remove cupy dependency #3625

Conversation

youkaichao commented Mar 26, 2024

youkaichao commented Mar 26, 2024

WoosukKwon commented Mar 26, 2024

simon-mo left a comment

Choose a reason for hiding this comment

youkaichao commented Mar 26, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Mar 27, 2024

Choose a reason for hiding this comment

youkaichao Mar 27, 2024

Choose a reason for hiding this comment

WoosukKwon Mar 27, 2024

Choose a reason for hiding this comment

WoosukKwon Mar 27, 2024

Choose a reason for hiding this comment

youkaichao Mar 27, 2024

Choose a reason for hiding this comment

WoosukKwon Mar 27, 2024

Choose a reason for hiding this comment

youkaichao Mar 27, 2024

Choose a reason for hiding this comment

youkaichao commented Mar 27, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment