-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CuPy for CUDA graphs #2811
Use CuPy for CUDA graphs #2811
Conversation
@@ -71,7 +73,7 @@ def init_process_group(world_size: int, rank: int, host: str, | |||
|
|||
if isinstance(cupy, Exception): | |||
raise ImportError( | |||
"NCCLBackend is not available. Please install cupy.") from cupy | |||
"NCCLBackend is not available. Please install cupy==13.0.0.") from cupy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"NCCLBackend is not available. Please install cupy==13.0.0.") from cupy | |
"NCCLBackend is not available. Please install cupy-cuda12x==13.0.0.") from cupy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. Actually, there are two issues:
- I changed the PR to use cupy 12.3 instead of 13.0 because cupy 13.0 does not support python 3.8 (I wasn't able to find the wheel in pypi).
- Users need to install different versions of cupy depending on their env. For example, CUDA 11.8 users should install cupy-cuda11x. ROCm users should install cupy-rocm.
@WoosukKwon FYI custom allreduce doesn't work for all cases (e.g. 8 PCIE gpus) so this fix might be needed anyway |
I think #2731 might be related. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a temporary fix for the memory leak issue when using CUDA graph w/o the custom all-reduce kernel. The PR uses CuPy NCCL instead of PyTorch NCCL.