Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

ray_ddp gpu issue #179

Open
JiahaoYao opened this issue Jul 12, 2022 · 3 comments
Open

ray_ddp gpu issue #179

JiahaoYao opened this issue Jul 12, 2022 · 3 comments

Comments

@JiahaoYao
Copy link
Contributor

ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
    self._report_thread_runner_error(block=True)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
    self._entrypoint()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
    return self._trainable_func(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
    output = fn()
  File "test_tune.py", line 37, in _inner_train
    trainer.fit(model)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 62, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 224, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ray/default/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27475, ip=172.31.59.24, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7f2c3c105610>)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 356, in execute
    return fn(*args, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 256, in _wrapping_function
    self._strategy.set_cuda_device_if_used()
  File "/home/ray/default/ray_lightning/ray_lightning/ray_ddp.py", line 233, in set_cuda_device_if_used
    torch.cuda.set_device(self.root_device)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

gives CUDA error: invalid device ordinal

@JiahaoYao
Copy link
Contributor Author

def tune_test(dir, strategy):
    callbacks = [TuneReportCallback(on="validation_end")]
    analysis = tune.run(
        train_func(dir, strategy, callbacks=callbacks),
        config={"max_epochs": tune.choice([1, 2, 3])},
        resources_per_trial=get_tune_resources(
            num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
        num_samples=2)
    assert all(analysis.results_df["training_iteration"] ==
               analysis.results_df["config.max_epochs"])


def test_tune_iteration_ddp():
    """Tests if each RayStrategy runs the correct number of iterations."""
    tmpdir = './'
    strategy = RayStrategy(num_workers=2, use_gpu=True)
    tune_test(tmpdir, strategy)

this is the code to reproduce the error.

@JiahaoYao
Copy link
Contributor Author

@JiahaoYao
Copy link
Contributor Author

it seems like the gpu id issue? can not assign torch.cuda.set_device

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant