-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[core] add nccl symmetric memory for all reduce #24532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has merge conflicts that must be resolved before it can be |
cb370ec
to
7a57a5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces NCCL symmetric memory for all-reduce operations to improve performance. It achieves this by adding a custom PyTorch operation that uses a pluggable allocator based on ncclMemAlloc
. The changes are extensive, touching CUDA graph capturing, device communicators, and adding a new environment variable to control the feature. While the implementation is clever, I have identified a few critical issues. The feature has a hard dependency on a pre-release version of PyTorch, which is a significant risk. There is also a potential resource leak related to NCCL communication windows, and verbose logging is enabled by default during JIT compilation. These issues should be addressed to ensure the stability and usability of this new feature.
7a57a5c
to
9431a53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making things work! I left some comments.
Some benchmarking comparing allreduce implementations would be helpful to figure out the input size bounds of the nccl symm mem allreduce.
9431a53
to
db043bc
Compare
db043bc
to
8609bad
Compare
This pull request has merge conflicts that must be resolved before it can be |
8609bad
to
6a1c68c
Compare
ad0b592
to
040903f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update!
- We need to decide on allreduce algo dispatching. If we leave it for follow up PR, then let's keep constants in one place.
- PR needs to have some tests: allreduce primitive,
get_nccl_mem_pool
, etc. - Compilation of nccl allocator needs to be improved for better UX in cases when there is no system NCCL installed.
- PR needs a bit of cleaning (commented code, prints).
"Failed to compile NCCL memory allocator. " | ||
"Symmetric memory will be disabled. " | ||
"This is expected if NCCL headers are not available. " | ||
"optionally set VLLM_NCCL_INCLUDE_PATH to point to NCCL header." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ... point to a directory with NCCL header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be a list of paths.
[envs.VLLM_NCCL_INCLUDE_PATH]
, otherwise string will be interpreted as a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilation also fails for me on linking. Usually torch comes with installed nccl. I'd suggest to use that one by default replacing with VLLM_NCCL_PATH
(for me it work with some hacking, e.g. adding symlinks).
Please, try to use it on the system without pre-installed NCCL in order to debug and add better UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_inline
's log also shows:
[1/2] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=nccl_allocator -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -fPIC -std=c++17 -c /tmp/main.cpp -o main.o
[2/2] c++ main.o -shared -lnccl -L/usr/local/lib/python3.12/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o nccl_allocator.so
/usr/local/lib/python3.12/dist-packages/torch/include
is the location that the compiler looks for the headers and nccl is header file location is '/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/cuda/nccl.h'
systems without the header and missing/incorrect location provided by the user will fail to compile the allocator and disables nccl symm memory allocation and registration with setting _nccl_allocator_failed_to_compile=True
, which is the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we don't crash when it fails to compile. NCCL usually comes with nvidia-nccl-cu12
package which is installed along with pytorch, the path to nccl home in my case is ENV_PATH/lib/python3.12/site-packages/nvidia/nccl/
. So users are actually able to use the nccl symm mem even if they don't system nccl. Moreover, we already use "non system" nccl, when we bind to it. Check out find_nccl_library
func. I would suggest to do something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added find_nccl_include_paths()
which looks for nccl.h
under the package nvidia-nccl-cuXX
or specified directory with env var VLLM_NCCL_INCLUDE_PATH
. even without installing nvidia-nccl-cuXX
, proper pytorch setup for multi-gpu includes nccl.h
which load_line
by default uses torch.utils.cpp_extension.include_paths
34312a3
to
cb36b90
Compare
Signed-off-by: Amir Samani <asamani@nvidia.com>
Signed-off-by: Amir Samani <asamani@nvidia.com>
41d0e5c
to
0bdb252
Compare
Signed-off-by: Michael Goin <mgoin64@gmail.com>
With this PR on ROCm there are now crashes on, among others, |
could you share the full log? this feature is behind an env var, so it shouldn't touch the default path. |
log.txt |
Yeah, it's from exposing ncclCommWindowRegister
Can't say, I fully understand how it affects the graph capturing, but removing these functions, assuming nobody tries to use them later on, or wrapping the for loop over NCCLLibrary.exported_functions contents in try-except eliminates the crash |
Signed-off-by: Amir Samani <asamani@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Amir Samani <asamani@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
add nccl symmetric memory for all reduce. currently because of the limitations of torch compile we have to create a custom op and copy the input to a tensor allocated and registered with nccl.
Test Plan
benchmark_throughput.py
Test Result
before:
after with addition of
VLLM_USE_NCCL_SYMM_MEM=1 NCCL_NVLS_ENABLE=1 NCCL_CUMEM_ENABLE=1
:to ensure the memory allocation and collective uses symmetric memory you can use
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=TUNING,REG
and should seeNCCL INFO AllReduce [Symmetric]
in the logs.Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.