Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Multi-GPU training fails with ValueError on systems with UUID GPU IDs #236

Closed
davidegraff opened this issue Dec 19, 2022 · 1 comment · Fixed by #239
Closed

Multi-GPU training fails with ValueError on systems with UUID GPU IDs #236

davidegraff opened this issue Dec 19, 2022 · 1 comment · Fixed by #239

Comments

@davidegraff
Copy link

I'm currently trying to use ray_lightning to distribute model training over the resources in my ray cluster, like so:

ngpu = int(ray.cluster_resources().get("GPU", 0))
use_gpu = ngpu > 0
num_workers = ngpu
ncpu = 8
strategy = RayStrategy(num_workers,ncpu,use_gpu, find_unused_parameters=False)
# define dataloaders
# define callbacks
trainer = PlTrainer(
    logger=False,
    max_epochs=50,
    callbacks=callbacks,
    gpus=1,
    enable_model_summary=False,
    enable_checkpointing=False,
    strategy=strategy,
)
trainer.fit(lit_model, train_dataloader, val_dataloader)

However, this code results in a ValueError:

  File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
    trainer.fit(lit_model, train_dataloader, val_dataloader)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
    ray_output = self.run_function_on_workers(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
    ray.get(ready)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
    return fn(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
    self._strategy._worker_setup(process_idx=global_rank)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
    self._process_group_backend = self._get_process_group_backend()
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
    or get_default_process_group_backend_for_device(self.root_device)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
    cuda_visible_list = [
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
    int(dev) for dev in cuda_visible_str.split(",")
ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'

It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,

$ echo $CUDA_VISIBLE_DEVICES
0,1

which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:

$ echo $CUDA_VISIBLE_DEVICES
GPU-23c5e712-9b16-e21a-df00-7dab564ade42,GPU-cdaae969-b14c-6b80-2fa2-de8e9efe87a1

So it seems like there are two options:

  1. I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICES specification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
  2. This block of code is altered ray_lightning/ray_ddp.py#L292:
gpu_id = ray.get_gpu_ids()[0]  # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = [
        int(dev) for dev in cuda_visible_str.split(",")
    ]
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

I think the block should be changed to:

gpu_id = ray.get_gpu_ids()[0]
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = list(cuda_visible_str.split(","))
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

Thanks for the great work so far!

@amogkam
Copy link
Collaborator

amogkam commented Dec 20, 2022

Hey @davidegraff, this is a really great callout! Should be fixed by this PR: #239!

amogkam added a commit that referenced this issue Jan 6, 2023
GPU device ids can be specified with an integer index, but may also be specified as strings.

This PR ensures that both cases are supported by root_device. The code is taken from what is being done in Ray Train: https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/train/torch/train_loop_utils.py?L470-498

Closes #236

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants