-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
"distributed" NCCL tests fail when having more than 3 GPUs #46248
Comments
Thanks for reporting! I can confirm I can repro this on a 4-GPU machine, and it indeed looks like we should have WORLD_SIZE = no. of GPUs used in the test. I'm not sure why we hardcode that value to '3' in |
We recently checked in a batch of NCCL change. Not sure if this failure is related. @osalpekar @mingzhe09088 |
Imo the usage of only 3 ranks makes sense and pytorch should handle this. Given that multi gpu per process seems to be deprecated in ddp a valid fix would include limiting the test to 1 gpu per process. But that should still work even when more are available If you can point me at the commits I can try it they help |
I don't think we are hardcoding the number of GPUs to 2 ro 3. For the failed test, we do check all available GPUs and assign them evenly among processes here https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/distributed/distributed_test.py#L2655. |
We do assign GPUs evenly to processes, but isn't the issue that only 3 processes are started instead of the expected 6? It seems that issue comes from here: https://github.com/pytorch/pytorch/blob/v1.7.0-rc1/test/run_test.py#L181 which sets the |
See my comment above (#46248 (comment)): Using 3 processes on a system with 6 GPUs should be fully valid. It can be discussed if that should use only 3 GPUs (1/process) or all 6. It would be good to know where/why NCCL tries to use 6 ranks if only 3 are participating |
Ok some more (printf) debugging: I recorded the invocation of This still happens on master, so not fixed and pretty sure a bug in the framework code. Continuing to investigate... |
Ok, I likely found the cause after a lot of tedious debugging. The TLDR is that the So what happens is:
So at some point a wrong communicator seems to be used. Going a bit back I see that prior to creation of the DDP an ALLREDUCE happens among all 3 processes (actually pytorch/torch/testing/_internal/distributed/distributed_test.py Lines 371 to 376 in 1f791c0
When broadcasting a single tensor only 1 GPU, namely the first, is used. So rank 0 finds an already existing communicator but the others don't which explains the blocking. See pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp Lines 1070 to 1072 in 1f791c0
My recommendation would be to completely disallow using DDP with more than 1 GPU (raise an exception) and change the test(s) to use only 1 GPU per rank. Edit: The first communicator is created by a call to |
@rohan-varma @mrshenli How to continue here? I'm specifically concerned there are deeper issues. For example I traced the broadcasts to |
The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one. See pytorch#46248 for details
Hey @Flamefire Sorry for dropping the ball on this.
This is great discovery. I wonder for this use case, should the application create two different
We totally want to go there, but there are legacy applications that's preventing us from retiring that DDP mode. Hopefully, we can retire it in the next few releases.
IIUC, users can create three process groups, and call allreduce on GPU [0, 1, 2] and then [0, 5, 6] respectively, which will hit the error you mentioned above. It's seems hard to tell process 0 to not use the cached communicator and create new one instead for the second call, without hurting the perf. But since the second call will timeout (please correct me if I am wrong), will it be sufficient that we provide better error message (maybe by checking the store)? cc @osalpekar does this belong to the NCCL reliability project? |
Thanks for the investigation! We've been working on a number of fixes for NCCL reliability in general so we should take this case into account as well. Let me dig into this and try to fix the incorrect communicator usage |
Not sure how that would help. The underlying problem is an implicit (from users view) barrier on creating the PG which uses a heuristic on which GPUs to use for that. And that heuristic fails as later other GPUs are used by the second+ ranks. So one solution would be to specify the GPUs to use when creating the PG or not caching the comm for the implicit barrier which is easier.
As from a users perspective the usage is perfectly valid, having to create different PGs is not great. However storing more info, e.g. the expected number of ranks in the store would allow detection and reporting of the later issue which is better then the entirely cryptic error. It could even suggest to use a different GPU distribution (i.e. round-robin) but I guess not caching (or removing afterwards) of the created communicator based on that heuristic should workaround that problem. |
馃悰 Bug
When running the distributed NCCL tests on a system with more than 3 GPUs available the tests fail with a connection refused error.
Tracing that down via
NCCL_DEBUG=INFO
reveals a warning "bootstrap.cc:190 NCCL WARN Bootstrap Root : mismatch in rank count from procs 6 : 3", which hints that 6 GPUs are available and hence expected but only 3 processes are started which is indeed what happens. See https://github.com/pytorch/pytorch/blob/v1.7.0-rc1/test/run_test.py#L181As a test I increased the WORLD_SIZE to 6 and the tests succeeded.
To Reproduce
Steps to reproduce the behavior:
NCCL_DEBUG="INFO" TEMP_DIR=/tmp/tmp74prol6l BACKEND=nccl INIT_METHOD=env:// WORLD_SIZE=3 TEST_REPORT_SOURCE_OVERRIDE=dist-nccl python distributed/test_distributed_spawn.py --verbose TestDistBackendWithFork.test_DistributedDataParallel
(extracted fromrun_test.py
)Relevant part of the output:
Environment
cc @mruberry @VitalyFedyunin @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski
The text was updated successfully, but these errors were encountered: