New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL error using DDP and PyTorch 1.7 #4420
Comments
Hi, thanks for reporting. |
Yah I'm using 1.0.4 Here's the full source for my .py file:
|
ok, I can confirm this is only happening on pytorch 1.7 |
I have the same issue on 1080ti, with V100 GPUs everything works fine. |
@maxjeblick sounds like a driver issue? Edit: |
I tested the following with our examples: so far was not able reproduce with pytorch examples :( need to dig deep |
I can confirm the same error using the latest Lightning and PyTorch using Tesla V100s. Does not happen on a single node with 2 GPUs, but once I go to multiple nodes the error happens. |
same error with A100 gpus. RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 |
Have the same issue with 2x2080ti on ubuntu 20.04 using pytorch 1.7 and cuda 11. |
Could it be this fix in pytorch? |
pytorch closed their issue because this issue exists and you close this issue because their issue exists... |
@julian3xl are you referring to the one I posted? I was under the impression that the fix was merged into pytorch master. |
Have the same issue with single node 2x rtx 3090 on ubuntu 18.04 using pytorch 1.7, Driver Version: 455.45.01 CUDA Version: 11.1 , pytorch-lightning 1.0.8 |
I am also hitting this and I am not even using lightning. :-( |
@min-xu-ai And can you confirm that in pytorch 1.8 nightly it is fixed? |
Great suggestion! I installed latest 1.8.0dev version and it still fails. But the error msg seems to be more helpful than before. @awaelchli Do you think the underlying error should have been fixed in 1.8.0dev?
My versions:
|
Actually, I found out the reason. It seems that my unit test is trying to start a world_size=3 on 2 GPUs. The error msg is definitely hard to parse. It would be nice that dist.init_process_group just check the world_size. FWIW, gloo backend works fine in this case. |
For others who might run into this: In previous PyTorch Lightning versions, the Trainer received an argument |
Hi @NoahDrisort, I fixed it by using raw Pytorch. Then I do not have this error using DataParallel, but the error occurred again using DistributedDataParallel. Then I fixed it by using docker image 'nvcr.io/nvidia/pytorch:21.05-py3'. But I haven't check this image together with PyTorch Lightning. |
If you land here on this thread because you got an NCCL error and it looks like this (not exactly what OP posted):
It may be because you have too little shared memory. The solution is to increase the shared memory (google it for your operating system) or if you use docker set General advice for NCCL errors: Run your command with the environment variable |
The package versions: if not args.use_env:
cmd.append("--local_rank={}".format(local_rank))
cmd.extend(args.training_script_args) --> cmd.extend(args.training_script_args)
if not args.use_env:
cmd.append("--local_rank={}".format(local_rank)) In my case, this error is caused by the local_rank parameter not being passed in. It's a bug. |
would it slow down the training process? |
For those with A100s, |
This disables some important nccl features and you could simply use gloo backend instead (which works fine by default). Disabling these features can potentially decrease the performance (especially if you use nvlink). However, I just tried your suggestion and my StyleGAN2-ADA training speed on 4x A6000s not only decreased, but even slightly improved (by 2%). But note, that I do not have nvlink |
Realizing this bug is very misleading as it seems to be the landing point of every NCCL error for PyTorch and DDP. NCCL errors are varied as NCCL encompasses CUDA, NVLink, networking (Sockets and Infiniband/RoCE), and other mechanisms like shared memory, as well as performing topology detection to optimize communication between GPUs. So different users will have very different problems which need to be solved in different ways. The first thing to do whenever a NCCL error happens, as suggested by the NCCL troubleshooting page is to run again with Now, rewinding the bug to try to categorize the different issues... In the first part of the issue (@min-xu-ai and @ohmeow) the error reported by NCCL is
Then, @mhpfuchs probably got a After that, @brando90 got a
Setting Finally, @awaelchli got a ncclSystemError due to too little shared memory being available; setting NCCL_DEBUG=WARN would have probably printed something like:
and helped fix the problem as well. |
Same bug on V100 32GB |
As suggested above,
This works for me on 2 V100 servers. |
export NCCL_SHM_DISABLE=1 |
fixed for me with pytorch1.9 cuda111 nccl2.7.8 v100 |
If you're fine leaving performance on the table, it's ok, but performance using RDMA is much higher than using TCP/IP, plus it has a much lesser load on the CPU. The most common issue when using RDMA is the memlock limit.
This is usually due to the container not setting enough space for /dev/shm. Can be fixed by launching the container with Also, as a reminder, make sure you run with |
Could you point me to the location of the NCCL logs? FWIW, I'm working on a single node running CentOS with 2 GPUs and |
NCCL logs will be printed to the standard output if NCCL_DEBUG is set. |
@sjeaugey I'm experience the same issue. With 2 nodes, using NCCL_IB_DISABLE makes training extremely slow, but without the flag, NCCL reports error. I'm not using docker, the error message looks like: |
@Hanpx20 this is a very different problem. See how to increase your memory limits for RDMA operations here: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#infiniband Another reason for that failure can be a bad mismatch between the container IB stack and the host. |
Fixed for me with NCCL_IB_DISABLE=1 lightning run model ... |
Issue solved after adding
|
If you have an IB fabric, you probably don't want to disable IB, as it would use TCP instead and that may affect performance by an order of magnitude. So, disabling IB is not a solution for everyone, only for those who actually don't need IB but happen to have a misconfigured active IB NIC on their system (or a NIC they don't want to use for various reasons). The real solution is usually to fix the IB configuration and use it. |
🐛 Bug
Getting this error when attempting to use ddp with the "getting started" autoencoder example:
Stack Trace:
To Reproduce
Follow the code in the getting started question with these parameters to
Trainer
:Expected behavior
For it to train on multiple GPUs :)
Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: