NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

netw0rkf10w · 2021-01-20T17:28:26Z

🐛 Bug

This issue is related to #42107: torch.distributed.launch: despite errors, training continues on some GPUs without printing any logs, which is quite critical: In a multi-GPU training with DDP, if one GPU is out of memory, then the GPU utilization of the others are stuck at 100% forever without training anything. (Imagine burning your allocated GPU resources without knowing it, e.g., when sleeping.)

In #42107 @mrshenli suggested to set NCCL_BLOCKING_WAIT=1 so that NCCL timeout is taken into account. I did, and it worked. However, only until a few days ago that I realized that doing this makes training much slower. I did some benchmark and found that training time on 4 or 8 GPUs is about the same as training on 1 GPU (sometimes even slower).

To Reproduce

Steps to reproduce the behavior:

In your bashrc, add export NCCL_BLOCKING_WAIT=1.
Start your training on multiple GPUs using DDP.
It should be as slow as on a single GPU.

Expected behavior

By default, training should stop whenever there is an issue.
The above, without sacrificing performance.

Environment

PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Linux
Python version: 3.7.9
CUDA/cuDNN version: (CUDA 10.1.2, cuDNN 7.6.5, NCCL 2.6.4) or (CUDA 10.2, cuDNN 8.0.4, NCCL 2.7.8).

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

The text was updated successfully, but these errors were encountered:

osalpekar · 2021-01-20T21:53:06Z

@netw0rkf10w An alternative to NCCL_BLOCKING_WAIT is NCCL_ASYNC_ERROR_HANDLING. It is expected that NCCL_BLOCKING_WAIT results in anywhere from 5-60% performance regression depending on your model and environment since the main thread is blocked until allreduce in each backward pass completes. On the other hand, NCCL_ASYNC_ERROR_HANDLING scans for errors/collective timeouts asynchronously, so there should be little to no overhead for this option as compared to regular training. However, NCCL_ASYNC_ERROR_HANDLING crashes the training process upon detecting an error or timeout (whereas blocking wait would throw an exception that users could elect to handle).

Here are the docs for more information: https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group

Using NCCL_ASYNC_ERROR_HANDLING along with Torchelastic is a good way of auto-detecting and restarting training processes in the face of errors if that is desired.

netw0rkf10w · 2021-01-20T22:44:51Z

@osalpekar Thanks for your reply. The last time I checked the documentation, there was no NCCL_ASYNC_ERROR_HANDLING and no information on the performance regression caused by NCCL_BLOCKING_WAIT. I have checked again and found that these are now well documented. Thus I guess this feature is quite recent. Could you tell me if it works for PyTorch 1.6.0 as well or only for >= 1.7.0?
Thanks again.

osalpekar · 2021-01-21T00:12:37Z

@netw0rkf10w Yes, this is a new feature that was introduced in PyTorch 1.7. It will not work with earlier PyTorch versions

ruotianluo · 2021-02-23T20:49:58Z

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

zzj403 · 2022-07-17T09:08:05Z

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

i also meet the same problem in my training, one gpu (0%) others(100%)and the training process is stucked.

PangziZhang523 · 2023-03-21T03:06:00Z

I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught?

i also meet the same problem in my training, one gpu (0%) others(100%)and the training process is stucked.

i also meet the same problem，do you solve it?

sankethvedula · 2024-03-11T12:10:44Z

@zzj403 I'm facing the same problem: one or two GPUs have 0 utilization, and the others have 100%, did you find a solution to this problem?

ngimel added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 20, 2021

osalpekar added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 20, 2021

tianweiy mentioned this issue May 22, 2022

Get stuck after one epoch of training (Multi GPU DDP) See this! tianweiy/CenterPoint#203

Open

developer0hye mentioned this issue May 16, 2023

Increase DDP timeout to 3600s ultralytics/ultralytics#2657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

netw0rkf10w commented Jan 20, 2021 •

edited by pytorch-probot bot

osalpekar commented Jan 20, 2021

netw0rkf10w commented Jan 20, 2021

osalpekar commented Jan 21, 2021

ruotianluo commented Feb 23, 2021

zzj403 commented Jul 17, 2022

PangziZhang523 commented Mar 21, 2023

sankethvedula commented Mar 11, 2024

NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820

Comments

netw0rkf10w commented Jan 20, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

osalpekar commented Jan 20, 2021

netw0rkf10w commented Jan 20, 2021

osalpekar commented Jan 21, 2021

ruotianluo commented Feb 23, 2021

zzj403 commented Jul 17, 2022

PangziZhang523 commented Mar 21, 2023

sankethvedula commented Mar 11, 2024

netw0rkf10w commented Jan 20, 2021 •

edited by pytorch-probot bot