Error distributed run

Hi, 
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime. 

_```
Traceback (most recent call last):
  File "conv_dist.py", line 117, in <module>
    main()
  File "conv_dist.py", line 51, in main
    mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
  File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
    model = DDP(model, device_ids=[gpu])
  File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
```_

Not able to figure out the cause of error. 
Please help, thanks. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error distributed run #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error distributed run #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions