Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Distributed] fix pynccl del error #4508

Merged
merged 1 commit into from
May 1, 2024

Conversation

youkaichao
Copy link
Member

It is observed from #4488 , that CI actually has errors, although it is ignored. Therefore, essentially the ncclCommDestroy in NCCLCommunicator.__del__ is never called. We can remove the code to avoid the CI error bothering users.

@youkaichao
Copy link
Member Author

youkaichao commented May 1, 2024

TL;DR; destruction of nccl is a collective call, like a broadcast or allreduce, and thus is blocking. However, it is not guaranteed to be called in the same order because Python's garbage collection system destructs objects in random order.

We cannot rely on Python's garbage collection system to work here, because The destruction of modules and objects in modules is done in random order.

The driver process holds a communicator and a handle to ray actor, the worker process holds a communicator.

If the driver process calls del for the communicator first, it will stuck, because it is a collective call, and it need to wait for the other process to delete the communicator as well. However, that worker process will run the main loop to wait for command from the driver process. Thus, we will see a deadlock.

If the driver process calls del for the ray actor first, the worker process will hang in the collective call, until the driver process also called the collective call to destroy the communicator.

Things can go crazy when we have multiple communicators (e.g. PyTorch nccl backend has nccl communicators), because it will require the same destruction order for these communicators, which is essentially impossible for Python.

One possible solution is to add cleanup logic in ray_gpu_executor's __del__ function. That's also questionable, because ray module might also be destroyed before ray_gpu_executor.

The ultimate solution might be to provide some context manager like with vllm.context(), and to ask users place any vllm specific code under that context. This way, we can do something under __exit__ , before the Python's garbage collection system starts to shutdown the Python interpreter.

@youkaichao
Copy link
Member Author

Per our offline discussion with @zhuohan123 @WoosukKwon @simon-mo @LiuXiaoxuanPKU , we can just skip the destruction to avoid deadlocks.

@youkaichao youkaichao merged commit 6ef09b0 into vllm-project:main May 1, 2024
48 checks passed
@youkaichao youkaichao deleted the fix_pynccl_del branch May 1, 2024 22:23
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants