-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch in multi-cpu cluster #2733
Comments
Just to confirm - when you say multi-cpu cluster you don't mean having 2 CPUs within a single computer (NUMA), but an actual cluster with multiple machines, right? The problem is that |
Yes I meant a cluster with multiple machines. No problem, I'll keep an eye for new updates. Thanks |
Hi
and I call the file as below:
I got the error that Thanks for your assistance |
Excuse me, does |
@sth1997 @marcsv87 DDP on CPU devices should be available now (see doc). I am closing this one, but feel free to reopen this issue if it fails to work for you. @rabeeh that seems to be a different issue than the original published one. If you still need assistance on that, could you please post create a new issue for it? |
Hi,
I would like to use the distributed module to train a convolution net in a CPU cluster. Investigating your code, the function torch.cuda.device_count() is called in several places, and is used to populate the device_ids list. Since I don't have any GPU devices in my cluster, the method device_count() will always return 0 and any subsequent attempt to access device_ids[0] will result in an index exception.
Taking a naive path and changing device_count so that it always returns the number of nodes I intend to use then I get a different error:
if not all(input.is_cuda for input in inputs):
raise TypeError('Broadcast function not implemented for CPU tensors')
So I would like to ask you whether you have any plans to implement the distributed module to train networks in a multi-cpu cluster.
Many thanks
The text was updated successfully, but these errors were encountered: