-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Release NCCL distributed backend from experimental #4921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torch/distributed/__init__.py
Outdated
contain correctly-sized tensors on each GPU to be used for output of | ||
the collective. | ||
e.g. output_tensor_lists[i] contrains the all_gather |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Hi Best, |
The NCCL library provided in the repo is version 1. Version 2 is closed source and you have to download it from NVIDIA and use |
Hi Adam,
I also downloaded NCCL2 from Nvidia website, tried WITH_SYSTEM_NCCL=1, and specified NCCL_INCLUDE_DIR, NCCL_LIB_DIR, NCCL_ROOT_DIR to install pytorch. The installed pytorch version is 0.4.0a0+7703670. When I run the following simple test example (toy.py), an error message was thrown:
before init
after init
begin rank 1
Traceback (most recent call last):
File "toy.py", line 32, in <module>
init_processes(args.rank, size, run, 'nccl')
File "toy.py", line 23, in init_processes
fn(rank, size)
File "toy.py", line 11, in run
dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
File "/home/liy/programs/pytorch/torch/distributed/__init__.py", line 326, in all_reduce
return torch._C._dist_all_reduce(tensor, op, group)
RuntimeError: NCCL error in: /home/liy/programs/pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:324, unhandled system error
Am I using something wrongly?
Best Regards,
Lissa
cat toy.py:
import torch
import torch.distributed as dist
import argparse
def run(rank, size):
""" Simple point-to-point communication. """
print('begin rank', rank)
group = dist.new_group([0, 1])
tensor = torch.ones(1).cuda()
dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
init_method="tcp://10.6.48.150:13530"
dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
print('after init')
fn(rank, size)
if __name__ == "__main__":
size = 2
parser = argparse.ArgumentParser()
parser.add_argument('--rank', default=-1, type=int,
help='rank')
args = parser.parse_args()
init_processes(args.rank, size, run, 'nccl')
====================================
From: Adam Paszke [mailto:notifications@github.com]
Sent: Monday, February 26, 2018 2:27 PM
To: pytorch/pytorch <pytorch@noreply.github.com>
Cc: Yi Li <Yi.Li@jax.org>; Comment <comment@noreply.github.com>
Subject: Re: [pytorch/pytorch] Release NCCL distributed backend from experimental (#4921)
The NCCL library provided in the repo is version 1. Version 2 is closed source and you have to download it from NVIDIA and use WITH_SYSTEM_NCCL=1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrfU6YBGkrGJiYd4Fhvxh_qockX2wks5tYwVlgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
Hi Adam,
My system info is as follows:
CentOS Linux release 7.3.1611 (Core)
kernel version: 3.10.0-693.11.6.el7.x86_64
Cuda8 and cudnn6
gcc 4.9.2
No IB devices, so I set WITH_GLOO_IBVERBS=0
Do you think cuda 8 causes the error?
Best Regards,
Lissa
From: Adam Paszke [mailto:notifications@github.com]
Sent: Monday, February 26, 2018 2:27 PM
To: pytorch/pytorch <pytorch@noreply.github.com>
Cc: Yi Li <Yi.Li@jax.org>; Comment <comment@noreply.github.com>
Subject: Re: [pytorch/pytorch] Release NCCL distributed backend from experimental (#4921)
The NCCL library provided in the repo is version 1. Version 2 is closed source and you have to download it from NVIDIA and use WITH_SYSTEM_NCCL=1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrfU6YBGkrGJiYd4Fhvxh_qockX2wks5tYwVlgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
Hi Adam, Yes, I downloaded NCCL2 from Nvidia website, tried WITH_SYSTEM_NCCL=1, and specified NCCL_INCLUDE_DIR, NCCL_LIB_DIR, NCCL_ROOT_DIR to install pytorch. The installed pytorch version is 0.4.0a0+7703670. When I run the following simple test example (toy.py), an error message was thrown: before init Am I using something wrongly? ========================== def run(rank, size): def init_processes(rank, size, fn, backend): if name == "main": |
Hmm I don't know, it looks good at a first glance, and the error is coming somewhere from the inside of NCCL where we can't easily tell what's wrong 😕 |
I am using Cuda8 and Nvidia driver version is 375.51. Will this cause NCCL error?
Best Regards,
Lissa
From: Adam Paszke [mailto:notifications@github.com]
Sent: Tuesday, February 27, 2018 10:29 AM
To: pytorch/pytorch <pytorch@noreply.github.com>
Cc: Yi Li <Yi.Li@jax.org>; Comment <comment@noreply.github.com>
Subject: Re: [pytorch/pytorch] Release NCCL distributed backend from experimental (#4921)
Hmm I don't know, it looks good at a first glance, and the error is coming somewhere from the inside of NCCL where we can't easily tell what's wrong 😕
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrQ38mUr7RVJ0O3nkOWZy2QO3sP52ks5tZB9RgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
Sorry, I don't know that |
@Yi-Li this could possibly mean that NCCL is picking up the wrong interface. What does your ifconfig show? |
@Yi-Li I would first try to set NCCL_SOCKET_IFNAME to the interface you would like NCCL to communicate. Also set NCCL_DEBUG=INFO and run your program will give more info on why your program is failing on NCCL |
Hi Teng Li,
I dont know how to set it. Which choices do the interface have?
NCCL2 seems need glibc2.19 or higher, but mine is glibc2.17, which may cause the error.
Best Regards,
Lissa
On 27 Feb 2018, at 2:57 PM, Teng Li <notifications@github.com<mailto:notifications@github.com>> wrote:
@Yi-Li<https://github.com/yi-li> I would first try to set NCCL_SOCKET_IFNAME to the interface you would like NCCL to communicate.
Also set NCCL_DEBUG=INFO and run your program will give more info on why your program is failing on NCCL
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrb5MFa2BBp_RePCKtaKFHcpIyX54ks5tZF3hgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
yeah, I have seen similar issues with a mismatching glibc version. |
Hi Teng Li,
My system administrator seems to get the glibc2.19 ready. I want to test the following small code (try.py) which you put on forum website. As you suggest, I will run "python try.py 1" on one terminal, and run "python try.py 2" on another terminal, and should get the output of "2".
I want to check whether this code can be run on two gpu nodes. In init_processes, I set ?
init_method="tcp://10.6.48.150:13530"?. My understanding is that 10.6.48.150 is the ip address of the master gpu node. Is it correct that we don't need to set the ip address of any slave gpu node? Thank you for your help!
cat try.py
import torch
import torch.distributed as dist
import argparse
def run(rank, size):
""" Simple point-to-point communication. """
print('begin rank', rank)
group = dist.new_group([0, 1])
tensor = torch.ones(1).cuda()
dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
init_method="tcp://10.6.48.150:13530"
dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
print('after init')
fn(rank, size)
if __name__ == "__main__":
size = 2
parser = argparse.ArgumentParser()
parser.add_argument('--rank', default=-1, type=int,help='rank')
args = parser.parse_args()
init_processes(args.rank, size, run, 'nccl')
…________________________________
From: Teng Li <notifications@github.com>
Sent: Tuesday, February 27, 2018 5:03 PM
To: pytorch/pytorch
Cc: Yi Li; Mention
Subject: Re: [pytorch/pytorch] Release NCCL distributed backend from experimental (#4921)
yeah, I have seen similar issues with a mismatching glibc version.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBremmrKMGnITDdFBEQcLqRpWyMeGEks5tZHt0gaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
Exactly. That address should be the IP and port that you gave the first process to listen at. |
I'm having the same issue as @Yi-Li with the imagenet example. I tried the toy.py she posted and found I couldn't reproduce it. I realized that the difference is that I set The modified toy.py I used: import torch
import torch.distributed as dist
import argparse
def run(rank, size):
""" Simple point-to-point communication. """
print('begin rank', rank)
group = dist.new_group([0, 1])
tensor = torch.ones(1).cuda()
dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
init_method="file://./sync"
#dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
dist.init_process_group(backend,world_size=size,init_method=init_method)
print('after init')
fn(rank, size)
if __name__ == "__main__":
size = 2
parser = argparse.ArgumentParser()
parser.add_argument('--rank', default=-1, type=int,
help='rank')
args = parser.parse_args()
init_processes(args.rank, size, run, 'nccl') The commands I used on different terminal of the same machine to trigger the error:
I'm not exactly sure this is the same problem with @Yi-Li but could anyone help with using NCCL on different CUDA devices? |
|
@cyang49 Yeah, |
@apaszke @teng-li Using
The above error is reproducible on my system and happens at the same 1259th operation on rank 1. And I also got this other kind of error:
The code is here import torch
import torch.distributed as dist
import argparse
def run(rank, size):
""" Simple point-to-point communication. """
print('begin rank', rank)
group = dist.new_group([0, 1])
tensor = torch.FloatTensor(torch.ones(1)).cuda()
dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
torch.cuda.device(rank)
init_method="tcp://127.0.0.1:16543"
dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
print('after init')
for i in range(100000):
print('{}th operation'.format(i))
fn(rank, size)
if __name__ == "__main__":
size = 2
parser = argparse.ArgumentParser()
parser.add_argument('--rank', default=-1, type=int,
help='rank')
args = parser.parse_args()
init_processes(args.rank, size, run, 'nccl') |
This looks like NCCL issue to me, could you get the NCCL logs |
Adding
|
Hi Cyang,
Where you put torch.cuda.set_device(rank)?? inside if __name__ == "__main__":? ?
Best Regards,
Yi
…________________________________
From: cyang <notifications@github.com>
Sent: Friday, March 30, 2018 3:49 PM
To: pytorch/pytorch
Cc: Yi Li; Mention
Subject: Re: [pytorch/pytorch] Release NCCL distributed backend from experimental (#4921)
@apaszke<https://github.com/apaszke> @teng-li<https://github.com/teng-li> Using torch.cuda.device() worked. Thanks!
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrUKVspfBQEOIIb0aPVoNyjjEnxaVks5tjoxXgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
torch.cuda.set_device(rank)
init_method="tcp://127.0.0.1:16543"
dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
print('after init')
for i in range(100000):
print('{}th operation'.format(i))
fn(rank, size) Edit: |
Thanks.
Best Regards,
Yi-Li
On 30 Mar 2018, at 4:24 PM, cyang <notifications@github.com<mailto:notifications@github.com>> wrote:
@Yi-Li<https://github.com/Yi-Li>
def init_processes(rank, size, fn, backend):
""" Initialize the distributed environment. """
print('before init')
torch.cuda.device(rank)
init_method="tcp://127.0.0.1:16543"
dist.init_process_group(backend,rank=rank,world_size=size,init_method=init_method)
print('after init')
for i in range(100000):
print('{}th operation'.format(i))
fn(rank, size)
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4921 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AdgBrRyMC5kzSM3-h2VfkQkL19heoxvdks5tjpRYgaJpZM4Rxif7>.
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
|
@cyang49 I'm having the same issue as you, 2 docker in one node, and use nccl get error: use NCCL_DEBUG=INFO, info: if i set the same CUDA_VISIBLE_DEVICES, it works fine. and gloo is also ok. have you settled the problem |
I have the same problem, but i find out the main reason is that the sample code does distributed.new_group every iteration. It seems to allocate new memory when calling. So i just do it one time when initialization, then the problem solved. |
After removing the hacky clear NCCL communicator cache and adding all the tests. I think we are in a good shape to release the NCCL backend from experimental.