Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

Open
netw0rkf10w opened this issue Jan 20, 2021 · 7 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@netw0rkf10w
Copy link

netw0rkf10w commented Jan 20, 2021

馃悰 Bug

This issue is related to #42107: torch.distributed.launch: despite errors, training continues on some GPUs without printing any logs, which is quite critical: In a multi-GPU training with DDP, if one GPU is out of memory, then the GPU utilization of the others are stuck at 100% forever without training anything. (Imagine burning your allocated GPU resources without knowing it, e.g., when sleeping.)

In #42107 @mrshenli suggested to set NCCL_BLOCKING_WAIT=1 so that NCCL timeout is taken into account. I did, and it worked. However, only until a few days ago that I realized that doing this makes training much slower. I did some benchmark and found that training time on 4 or 8 GPUs is about the same as training on 1 GPU (sometimes even slower).

To Reproduce

Steps to reproduce the behavior:

  1. In your bashrc, add export NCCL_BLOCKING_WAIT=1.
  2. Start your training on multiple GPUs using DDP.
  3. It should be as slow as on a single GPU.

Expected behavior

  • By default, training should stop whenever there is an issue.
  • The above, without sacrificing performance.

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Linux
  • Python version: 3.7.9
  • CUDA/cuDNN version: (CUDA 10.1.2, cuDNN 7.6.5, NCCL 2.6.4) or (CUDA 10.2, cuDNN 8.0.4, NCCL 2.7.8).

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

@ngimel ngimel added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 20, 2021
@osalpekar osalpekar added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 20, 2021
@osalpekar
Copy link
Member

@netw0rkf10w An alternative to NCCL_BLOCKING_WAIT is NCCL_ASYNC_ERROR_HANDLING. It is expected that NCCL_BLOCKING_WAIT results in anywhere from 5-60% performance regression depending on your model and environment since the main thread is blocked until allreduce in each backward pass completes. On the other hand, NCCL_ASYNC_ERROR_HANDLING scans for errors/collective timeouts asynchronously, so there should be little to no overhead for this option as compared to regular training. However, NCCL_ASYNC_ERROR_HANDLING crashes the training process upon detecting an error or timeout (whereas blocking wait would throw an exception that users could elect to handle).

Here are the docs for more information: https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group

Using NCCL_ASYNC_ERROR_HANDLING along with Torchelastic is a good way of auto-detecting and restarting training processes in the face of errors if that is desired.

@netw0rkf10w
Copy link
Author

@osalpekar Thanks for your reply. The last time I checked the documentation, there was no NCCL_ASYNC_ERROR_HANDLING and no information on the performance regression caused by NCCL_BLOCKING_WAIT. I have checked again and found that these are now well documented. Thus I guess this feature is quite recent. Could you tell me if it works for PyTorch 1.6.0 as well or only for >= 1.7.0?
Thanks again.

@osalpekar
Copy link
Member

@netw0rkf10w Yes, this is a new feature that was introduced in PyTorch 1.7. It will not work with earlier PyTorch versions

@ruotianluo
Copy link
Contributor

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

@zzj403
Copy link

zzj403 commented Jul 17, 2022

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

i also meet the same problem in my training, one gpu (0%) others(100%)and the training process is stucked.

@PangziZhang523
Copy link

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

i also meet the same problem in my training, one gpu (0%) others(100%)and the training process is stucked.

i also meet the same problem锛宒o you solve it?

@sankethvedula
Copy link

@zzj403 I'm facing the same problem: one or two GPUs have 0 utilization, and the others have 100%, did you find a solution to this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants