[Core][Distributed] fix pynccl del error #4508

youkaichao · 2024-05-01T00:08:54Z

It is observed from #4488 , that CI actually has errors, although it is ignored. Therefore, essentially the ncclCommDestroy in NCCLCommunicator.__del__ is never called. We can remove the code to avoid the CI error bothering users.

youkaichao · 2024-05-01T01:03:42Z

TL;DR; destruction of nccl is a collective call, like a broadcast or allreduce, and thus is blocking. However, it is not guaranteed to be called in the same order because Python's garbage collection system destructs objects in random order.

We cannot rely on Python's garbage collection system to work here, because The destruction of modules and objects in modules is done in random order.

The driver process holds a communicator and a handle to ray actor, the worker process holds a communicator.

If the driver process calls del for the communicator first, it will stuck, because it is a collective call, and it need to wait for the other process to delete the communicator as well. However, that worker process will run the main loop to wait for command from the driver process. Thus, we will see a deadlock.

If the driver process calls del for the ray actor first, the worker process will hang in the collective call, until the driver process also called the collective call to destroy the communicator.

Things can go crazy when we have multiple communicators (e.g. PyTorch nccl backend has nccl communicators), because it will require the same destruction order for these communicators, which is essentially impossible for Python.

One possible solution is to add cleanup logic in ray_gpu_executor's __del__ function. That's also questionable, because ray module might also be destroyed before ray_gpu_executor.

The ultimate solution might be to provide some context manager like with vllm.context(), and to ask users place any vllm specific code under that context. This way, we can do something under __exit__ , before the Python's garbage collection system starts to shutdown the Python interpreter.

youkaichao · 2024-05-01T17:57:19Z

Per our offline discussion with @zhuohan123 @WoosukKwon @simon-mo @LiuXiaoxuanPKU , we can just skip the destruction to avoid deadlocks.

fix pynccl del error

d2c990a

rkooo567 approved these changes May 1, 2024

View reviewed changes

zhuohan123 approved these changes May 1, 2024

View reviewed changes

youkaichao merged commit 6ef09b0 into vllm-project:main May 1, 2024
48 checks passed

youkaichao deleted the fix_pynccl_del branch May 1, 2024 22:23

youkaichao mentioned this pull request May 2, 2024

[Core][Distributed] add cleanup code for model parallel #4471

Closed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024

[Core][Distributed] fix pynccl del error (vllm-project#4508)

5b174c4

youkaichao mentioned this pull request May 6, 2024

[Core] [3/N] Pipeline Parallel Support #4412

Open

16 tasks

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Core][Distributed] fix pynccl del error (vllm-project#4508)

07632b0

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

[Core][Distributed] fix pynccl del error (vllm-project#4508)

ee497f9

This was referenced May 9, 2024

[Bug]: offline test, Process hangs without exiting when using cuda graph #4263

Closed

[Core] Add MultiprocessingGPUExecutor #4539

Merged

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

youkaichao mentioned this pull request May 16, 2024

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Open

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Core][Distributed] fix pynccl del error (vllm-project#4508)

7932c06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] fix pynccl del error #4508

[Core][Distributed] fix pynccl del error #4508

youkaichao commented May 1, 2024

youkaichao commented May 1, 2024 •

edited

youkaichao commented May 1, 2024

[Core][Distributed] fix pynccl del error #4508

[Core][Distributed] fix pynccl del error #4508

Conversation

youkaichao commented May 1, 2024

youkaichao commented May 1, 2024 • edited

youkaichao commented May 1, 2024

youkaichao commented May 1, 2024 •

edited