You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
trainer.fit(lit_model, train_dataloader, val_dataloader)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
ray_output = self.run_function_on_workers(
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
ray.get(ready)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
return fn(*args, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
self._strategy._worker_setup(process_idx=global_rank)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
self._process_group_backend = self._get_process_group_backend()
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
or get_default_process_group_backend_for_device(self.root_device)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
cuda_visible_list = [
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
int(dev) for dev in cuda_visible_str.split(",")
ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'
It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,
$ echo $CUDA_VISIBLE_DEVICES
0,1
which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:
I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICESspecification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
gpu_id=ray.get_gpu_ids()[0] # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3cuda_visible_str=os.environ.get("CUDA_VISIBLE_DEVICES", "")
ifcuda_visible_strandcuda_visible_str!="NoDevFiles":
cuda_visible_list= [
int(dev) fordevincuda_visible_str.split(",")
]
device_id=cuda_visible_list.index(gpu_id)
returntorch.device("cuda", device_id)
I'm currently trying to use
ray_lightning
to distribute model training over the resources in my ray cluster, like so:However, this code results in a
ValueError
:It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,
which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:
So it seems like there are two options:
CUDA_VISIBLE_DEVICES
specification where it states that device names of the formGPU-<UUID>
is the second option in addition to integer indicesI think the block should be changed to:
Thanks for the great work so far!
The text was updated successfully, but these errors were encountered: