This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
ray ddp fails with 2 gpu workers #174
Comments
(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py
Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=47136) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47136) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47136) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=47136) new_rank_zero_deprecation(
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=47136) return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=47136) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=47137) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47137) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47137) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47137) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) distributed_backend=nccl
(RayExecutor pid=47136) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136)
Traceback (most recent call last):
File "ray_ddp_example.py", line 169, in <module>
train_mnist(
File "ray_ddp_example.py", line 78, in train_mnist
trainer.fit(model)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch
ray_output = self.run_function_on_workers(
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results
ray.get(ready)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/worker.py", line 2178, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=47137, ip=10.0.2.18, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fc7e5d1cac0>)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 329, in execute
return fn(*args, **kwargs)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 235, in _wrapping_function
results = function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
self.__setup_profiler()
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1877, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc). |
if we change the backend from nccl to gloo, it works on different gpus. https://github.com/JiahaoYao/ray_lightning/blob/main/ray_lightning/ray_ddp.py#L161-L165 |
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 48716 C ray::RayExecutor.execute() 933MiB |
| 0 N/A N/A 48717 C ray::RayExecutor.execute() 933MiB |
+-----------------------------------------------------------------------------+ |
(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch
pytorch-lightning 1.6.4
torch 1.12.0
torchmetrics 0.9.2
torchvision 0.13.0 the suggestion from ultralytics/yolov5#4530 then pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 I have the pytorch (tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch
pytorch-lightning 1.6.4
torch 1.12.0+cu116
torchaudio 0.12.0+cu116
torchmetrics 0.9.2
torchvision 0.13.0+cu116 but still fails broadcast(object_sizes_tensor, src=src, group=group)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption |
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Using network Socket
(RayExecutor pid=9152) NCCL version 2.10.3+cuda11.3
(RayExecutor pid=9152)
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b0
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Using network Socket
(RayExecutor pid=9153)
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1b0
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO group.cc:72 -> 5 [Async thread] |
the reason behind this is that the root_device is overwrite by the strategy from the main (RayExecutor pid=34134) torch.device("cuda", device_id): device(type='cuda', index=2)
(RayExecutor pid=34133) ic| self._strategy.root_device: device(type='cuda', index=1)
(RayExecutor pid=34133) function.__self__.strategy.root_device: device(type='cuda', index=0)
(RayExecutor pid=34135) ic| self._strategy.root_device: device(type='cuda', index=3)
(RayExecutor pid=34135) function.__self__.strategy.root_device: device(type='cuda', index=0) |
JiahaoYao
added a commit
to JiahaoYao/ray_lightning
that referenced
this issue
Jul 6, 2022
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: