You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
ImageNet example provided two options for distributed backed- gloo and nccl. I see the following error when I specify nccl as the backend
INFO 2020-03-18 18:12:24,436 All peers arrived. Confirming membership.
INFO 2020-03-18 18:12:24,504 Waiting for confirmations from all peers.
INFO 2020-03-18 18:12:24,506 Rendezvous version 4 is complete. Final state: {'status': 'final', 'version': '4', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_bc8168d4694311eaa33f000d3a77161e/rdzv/v_4/rank_0'], 'num_workers_waiting': 0}
INFO 2020-03-18 18:12:24,506 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-03-18 18:12:24,509 coordinator_p2p: Got next rendezvous: rank 0, world size 1
[INFO] 2020-03-18 18:12:24,516 coordinator_p2p: Initialized process group rank 0, world size 1
[ERROR] 2020-03-18 18:12:24,517 coordinator_p2p: Rank: 0
Error: Tensors must be CUDA and dense
ErrorType: <class 'RuntimeError'>
StackTrace: Traceback (most recent call last):
File "/opt/miniconda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 94, in run_train
state.sync(world_size, rank)
File "main.py", line 96, in sync
self._sync_state(rank)
File "main.py", line 130, in _sync_state
dist.broadcast(state_size, src=max_rank)
File "/opt/miniconda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 804, in broadcast
work = _default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Does elastic training support NCCL backend?
Are there plans to support NCCL in future, if it isn't supported currently
The text was updated successfully, but these errors were encountered:
It creates a CPU tensor to broadcast the current size of the state object but uses the default process group (NCCL), which is not compatible with cpu tensors. A quick fix for this is to create a gpu tensor for the size broadcast or to create a second process group with gloo as the backend for send/recv of these small messages that are typically created on the main process (CPU).
More fundamentally the issue here is that there is no clear distinction between data plane messages and control plane messages in torchelastic's user code. Our coordinator indeed uses a separate gloo-based process group ONLY for passing control plane messages to workers. See: https://github.com/pytorch/elastic/blob/master/torchelastic/p2p/coordinator_p2p.py#L153
❓ Questions and Help
Question
ImageNet example provided two options for distributed backed- gloo and nccl. I see the following error when I specify nccl as the backend
The text was updated successfully, but these errors were encountered: