-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so #25605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a crash on ROCm when certain NCCL symbols are not present by silently ignoring them during library initialization. However, the current implementation appears to be flawed. It checks if the function pointer is None
, whereas ctypes
is expected to raise an AttributeError
for missing symbols. This suggests the fix may not work as intended. I have provided a suggestion to correctly handle missing symbols using a try...except
block, which should make the fix more robust.
I'm not sure if it is better to exclude nccl specific functions like what you did here or include them like: #25608 |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
That's also possible, only the condition should not be CUDA/ROCm, but ideally, NCCL version. |
"ncclCommWindowDeregister" | ||
]: | ||
# These symbols require NCCL >= 2.27.03, and having | ||
# an exception here on ROCm platform is not allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a logger.warn_once here that mentions that these symbols failed to import and that users should update their nccl/rccl libraries if they want to use them?
…nabled Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
I confirm that it is necessary to be able to run mGPU models in ROCM, it solves the following error:
|
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ccl.so (#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Fixing an issue from #24532
ncclCommWindowRegister and ncclCommWindowDeregister only exist on nccl >= 2.27.03
On ROCm having an unhandled exception (trying to import symbols in the above part of the code crashes graph capturing due to the
HIP error: operation not permitted when stream is capturing
error.I suppose the least invasive solution would be to swallow the import error on ROCm for these two symbols silently.
cc @Amir-19