New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL_BLOCKING_WAIT=1 makes training extremely slow (but if not set then OOM on one device will hang training) #50820
Comments
@netw0rkf10w An alternative to Here are the docs for more information: https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group Using |
@osalpekar Thanks for your reply. The last time I checked the documentation, there was no |
@netw0rkf10w Yes, this is a new feature that was introduced in PyTorch 1.7. It will not work with earlier PyTorch versions |
I encountered the same error. I used to get OOM and the program will just automatically crash. But for some reason, I also see the same behavior (1gpu 0utility, other full)now. Is there any reason why OOM can no longer be caught? |
i also meet the same problem in my training, one gpu (0%) others(100%)and the training process is stucked. |
i also meet the same problem锛宒o you solve it? |
@zzj403 I'm facing the same problem: one or two GPUs have 0 utilization, and the others have 100%, did you find a solution to this problem? |
馃悰 Bug
This issue is related to #42107: torch.distributed.launch: despite errors, training continues on some GPUs without printing any logs, which is quite critical: In a multi-GPU training with DDP, if one GPU is out of memory, then the GPU utilization of the others are stuck at 100% forever without training anything. (Imagine burning your allocated GPU resources without knowing it, e.g., when sleeping.)
In #42107 @mrshenli suggested to set
NCCL_BLOCKING_WAIT=1
so that NCCL timeout is taken into account. I did, and it worked. However, only until a few days ago that I realized that doing this makes training much slower. I did some benchmark and found that training time on 4 or 8 GPUs is about the same as training on 1 GPU (sometimes even slower).To Reproduce
Steps to reproduce the behavior:
bashrc
, addexport NCCL_BLOCKING_WAIT=1
.Expected behavior
Environment
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd
The text was updated successfully, but these errors were encountered: