-
Notifications
You must be signed in to change notification settings - Fork 22.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed support on MacOS is missing #20380
Comments
@yaroslavvb from what I remember, we dont plan to add distributed support on OSX or Windows |
How hard would it be to add 1-worker DDP support to Mac? For instance This would make it easier for people who prototype on Macs before deploying to Linux. Currently it's a bit annoying with cc @jspisak who would know if there's enough people using PyTorch on Mac. From my Bay Area vantage, it seems to be everybody. |
@yaroslavvb Good point. I agree. This has been raised before and I agree we should revisit. Adding macOS support (and Windows support for that matter) will be contingent on making the Gloo TCP transport work with something other than |
+1 for Windows support of the distributed training! Currently getting "AttributeError: module 'torch.distributed' has no attribute 'init_process_group'" when trying to run distributed training on Windows. The excat same code works fine on Linux. |
@Ownmarc In what kind of environment would you use this? Multiple machine or only multiple processes on a single machine? I have not heard of folks wanting to use |
To take advantage of 2 GPUs in the same machine. @pietern here is an other person. I am using this repo too ultralytics/yolov3#336 (comment) |
@yaroslavvb This is now available in the nightlies. It is full blown support, so you should be able to run DDP on a big stack of MacBooks, if you wanted to... :D |
Great work Pieter!! |
@pietern is this going to support Windows too ? |
@Ownmarc It could, because the implementation is based on libuv. Is this something you'd be interested in contributing? I think the majority of time spent will be in 1) getting the Gloo build to work, 2) getting the PyTorch build to work (there are only 2 ifdef'ed pieces of code in |
@pietern , I just took a look at ProcessGroupGloo.cpp and was kinda lost. Unfortunatly, I do not know much other then Python, I am still in my early days of programming and just recently switched from tensorflow to pytorch! |
Time to put all my 4-cores of macbook pro to good work! Seriously though, this is great for code consistency, time to finally banish nn.DataParallel from my code :) |
@Ownmarc Do you compile PyTorch from source on Windows? If so, I could guide you through some of the steps to do this, but do realize it likely won't be a walk in the park. |
@pietern awesome this is exactly what I needed! |
Perfect, thank you @pietern! |
@Ownmarc, have you solve this problem on Windows? I have tried the preview version and still get same problem. |
@tbwxmu This issue tracked support for macOS, not for Windows. |
Opened it officialy for Windows, lets see if I am the only one with more then 1 GPU on windows ahah. |
@pietern Is GLOO supported on MacOS? This code hangs for me at import datetime
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank, world_size):
print(f"starting rank {rank}")
dist.init_process_group(
"gloo",
init_method="tcp://localhost:12345",
rank=rank,
world_size=world_size,
timeout=datetime.timedelta(seconds=10)
)
print(f"started rank {rank}")
cleanup()
def cleanup():
dist.destroy_process_group()
if __name__ == "__main__":
mp.spawn(setup, args=(2, ), nprocs=2, join=True) |
Has support been reverted with 1.15 release? I was testing the
|
Can you share your work? |
@y78h11b09 I am not good enough with C++ to implement this, but I think Windows said they would support Pytorch recently, maybe they will implement this!! :D |
For Windows support, please check this RFC (#42095) Hey @neggert, yes PyTorch + Gloo works on MacOS, but you will need to compile from source using the following steps:
|
Currently trying to use distributed on MacOS crashes because torch.distributed namespace is empty
I vaguely recall it working a year ago.
This is useful for quick sanity checks on my MacBook before deploying to cluster.
The text was updated successfully, but these errors were encountered: