Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Support for NCCL backend #61

Closed
raviskolli opened this issue Mar 18, 2020 · 2 comments
Closed

Support for NCCL backend #61

raviskolli opened this issue Mar 18, 2020 · 2 comments

Comments

@raviskolli
Copy link

❓ Questions and Help

Question

ImageNet example provided two options for distributed backed- gloo and nccl. I see the following error when I specify nccl as the backend

INFO 2020-03-18 18:12:24,436 All peers arrived. Confirming membership.
INFO 2020-03-18 18:12:24,504 Waiting for confirmations from all peers.
INFO 2020-03-18 18:12:24,506 Rendezvous version 4 is complete. Final state: {'status': 'final', 'version': '4', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_bc8168d4694311eaa33f000d3a77161e/rdzv/v_4/rank_0'], 'num_workers_waiting': 0}
INFO 2020-03-18 18:12:24,506 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-03-18 18:12:24,509 coordinator_p2p: Got next rendezvous: rank 0, world size 1
[INFO] 2020-03-18 18:12:24,516 coordinator_p2p: Initialized process group rank 0, world size 1
[ERROR] 2020-03-18 18:12:24,517 coordinator_p2p: Rank: 0
Error: Tensors must be CUDA and dense
ErrorType: <class 'RuntimeError'>
StackTrace: Traceback (most recent call last):
File "/opt/miniconda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 94, in run_train
state.sync(world_size, rank)
File "main.py", line 96, in sync
self._sync_state(rank)
File "main.py", line 130, in _sync_state
dist.broadcast(state_size, src=max_rank)
File "/opt/miniconda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 804, in broadcast
work = _default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

  • Does elastic training support NCCL backend?
  • Are there plans to support NCCL in future, if it isn't supported currently
@kiukchung
Copy link
Contributor

Thanks for reporting. This is a bug in our examples/imagenet/main.py. torchelastic supports all pytorch backends. Issue is in the way ImagenetState._sync_state() is implemented. Specifically on this line: https://github.com/pytorch/elastic/blob/master/examples/imagenet/main.py#L129

It creates a CPU tensor to broadcast the current size of the state object but uses the default process group (NCCL), which is not compatible with cpu tensors. A quick fix for this is to create a gpu tensor for the size broadcast or to create a second process group with gloo as the backend for send/recv of these small messages that are typically created on the main process (CPU).

More fundamentally the issue here is that there is no clear distinction between data plane messages and control plane messages in torchelastic's user code. Our coordinator indeed uses a separate gloo-based process group ONLY for passing control plane messages to workers. See: https://github.com/pytorch/elastic/blob/master/torchelastic/p2p/coordinator_p2p.py#L153

perhaps the users could follow a similar pattern.

@kiukchung
Copy link
Contributor

I've filed a bug report #64. Closing this, please track the fix on the other issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants