Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

ray ddp fails with 2 gpu workers #174

Closed
JiahaoYao opened this issue Jul 6, 2022 · 10 comments
Closed

ray ddp fails with 2 gpu workers #174

JiahaoYao opened this issue Jul 6, 2022 · 10 comments

Comments

@JiahaoYao
Copy link
Contributor

  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1817, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruptio

use this branch: https://github.com/sxjscience/autogluon/tree/kaggle_california_house and install autogluon via
bash full_install.sh
Afterwards, try this script: https://gist.github.com/sxjscience/53bc799e37cc0680ca9e53c2fea75cd7
Internally, the ray strategy are constructed here: https://github.com/sxjscience/autogluon/blob/59f01b95381fba5651db17fd98fa84164ad168c2/multimodal/src/autogluon/multimodal/predictor.py#L1036-L1052 .

@JiahaoYao
Copy link
Contributor Author

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py 
Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=47136) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47136) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47136) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=47136)   new_rank_zero_deprecation(
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=47136)   return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=47136) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=47137) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47137) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47137) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47137) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) distributed_backend=nccl
(RayExecutor pid=47136) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) 
Traceback (most recent call last):
  File "ray_ddp_example.py", line 169, in <module>
    train_mnist(
  File "ray_ddp_example.py", line 78, in train_mnist
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/worker.py", line 2178, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=47137, ip=10.0.2.18, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fc7e5d1cac0>)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 329, in execute
    return fn(*args, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 235, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
    self.__setup_profiler()
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1877, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

@JiahaoYao
Copy link
Contributor Author

if we change the backend from nccl to gloo, it works on different gpus.

https://github.com/JiahaoYao/ray_lightning/blob/main/ray_lightning/ray_ddp.py#L161-L165

@JiahaoYao
Copy link
Contributor Author

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     48716      C   ray::RayExecutor.execute()        933MiB |
|    0   N/A  N/A     48717      C   ray::RayExecutor.execute()        933MiB |
+-----------------------------------------------------------------------------+

@JiahaoYao
Copy link
Contributor Author

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch 
pytorch-lightning                  1.6.4
torch                              1.12.0
torchmetrics                       0.9.2
torchvision                        0.13.0

the suggestion from ultralytics/yolov5#4530

then

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

I have the pytorch

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch 
pytorch-lightning                  1.6.4
torch                              1.12.0+cu116
torchaudio                         0.12.0+cu116
torchmetrics                       0.9.2
torchvision                        0.13.0+cu116

but still fails

    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

@JiahaoYao
Copy link
Contributor Author

@JiahaoYao
Copy link
Contributor Author

(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Using network Socket
(RayExecutor pid=9152) NCCL version 2.10.3+cuda11.3
(RayExecutor pid=9152) 
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b0
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Using network Socket
(RayExecutor pid=9153) 
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1b0
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO group.cc:72 -> 5 [Async thread]

@JiahaoYao
Copy link
Contributor Author

@JiahaoYao
Copy link
Contributor Author

rapidsai/dask-cuda#446

@JiahaoYao
Copy link
Contributor Author

@JiahaoYao
Copy link
Contributor Author

the reason behind this is that the root_device is overwrite by the strategy from the main

(RayExecutor pid=34134)     torch.device("cuda", device_id): device(type='cuda', index=2)
(RayExecutor pid=34133) ic| self._strategy.root_device: device(type='cuda', index=1)
(RayExecutor pid=34133)     function.__self__.strategy.root_device: device(type='cuda', index=0)
(RayExecutor pid=34135) ic| self._strategy.root_device: device(type='cuda', index=3)
(RayExecutor pid=34135)     function.__self__.strategy.root_device: device(type='cuda', index=0)

JiahaoYao added a commit to JiahaoYao/ray_lightning that referenced this issue Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant