Support for NCCL backend #61

raviskolli · 2020-03-18T21:39:38Z

❓ Questions and Help

Question

ImageNet example provided two options for distributed backed- gloo and nccl. I see the following error when I specify nccl as the backend

INFO 2020-03-18 18:12:24,436 All peers arrived. Confirming membership.
INFO 2020-03-18 18:12:24,504 Waiting for confirmations from all peers.
INFO 2020-03-18 18:12:24,506 Rendezvous version 4 is complete. Final state: {'status': 'final', 'version': '4', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_bc8168d4694311eaa33f000d3a77161e/rdzv/v_4/rank_0'], 'num_workers_waiting': 0}
INFO 2020-03-18 18:12:24,506 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-03-18 18:12:24,509 coordinator_p2p: Got next rendezvous: rank 0, world size 1
[INFO] 2020-03-18 18:12:24,516 coordinator_p2p: Initialized process group rank 0, world size 1
[ERROR] 2020-03-18 18:12:24,517 coordinator_p2p: Rank: 0
Error: Tensors must be CUDA and dense
ErrorType: <class 'RuntimeError'>
StackTrace: Traceback (most recent call last):
File "/opt/miniconda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 94, in run_train
state.sync(world_size, rank)
File "main.py", line 96, in sync
self._sync_state(rank)
File "main.py", line 130, in _sync_state
dist.broadcast(state_size, src=max_rank)
File "/opt/miniconda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 804, in broadcast
work = _default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

Does elastic training support NCCL backend?
Are there plans to support NCCL in future, if it isn't supported currently

kiukchung · 2020-03-19T19:53:20Z

Thanks for reporting. This is a bug in our examples/imagenet/main.py. torchelastic supports all pytorch backends. Issue is in the way ImagenetState._sync_state() is implemented. Specifically on this line: https://github.com/pytorch/elastic/blob/master/examples/imagenet/main.py#L129

It creates a CPU tensor to broadcast the current size of the state object but uses the default process group (NCCL), which is not compatible with cpu tensors. A quick fix for this is to create a gpu tensor for the size broadcast or to create a second process group with gloo as the backend for send/recv of these small messages that are typically created on the main process (CPU).

More fundamentally the issue here is that there is no clear distinction between data plane messages and control plane messages in torchelastic's user code. Our coordinator indeed uses a separate gloo-based process group ONLY for passing control plane messages to workers. See: https://github.com/pytorch/elastic/blob/master/torchelastic/p2p/coordinator_p2p.py#L153

perhaps the users could follow a similar pattern.

kiukchung · 2020-03-19T19:56:16Z

I've filed a bug report #64. Closing this, please track the fix on the other issue. Thanks!

kiukchung mentioned this issue Mar 19, 2020

imagenet example does not work with nccl #64

Closed

11 tasks

kiukchung closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for NCCL backend #61

Support for NCCL backend #61

raviskolli commented Mar 18, 2020

kiukchung commented Mar 19, 2020

kiukchung commented Mar 19, 2020

Support for NCCL backend #61

Support for NCCL backend #61

Comments

raviskolli commented Mar 18, 2020

❓ Questions and Help

Question

kiukchung commented Mar 19, 2020

kiukchung commented Mar 19, 2020