-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
What happened + What you expected to happen
Running the following will result in an error if we use a CPU-node as a headnode.
I didn't test for other algorithms or versions but assume that the issue exists there, too.
The issue is that the MetricsLogger sometimes returns torch tensors that are on GPU and torch serializes and de-serializes to the same device type.
Error:
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/algorithms/algorithm.py", line 2865, in get_state
state[COMPONENT_LEARNER_GROUP] = self.learner_group.get_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/core/learner/learner_group.py", line 481, in get_state
state[COMPONENT_LEARNER] = self._get_results(results)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/core/learner/learner_group.py", line 632, in _get_results
raise result_or_error
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/utils/actor_manager.py", line 861, in _fetch_result
result = ray.get(ready)
^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
raise value
ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Versions / Dependencies
Ray 2.46
Reproduction script
`
from ray.rllib.algorithms import appo
alg_config = (
appo.APPOConfig()
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.environment(
env="CartPole-v1",
disable_env_checking=True,
)
.learners(
num_learners=1,
num_gpus_per_learner=1,
)
.reporting(min_time_s_per_iteration=1)
)
algo = alg_config.build()
algo.train()
state = algo.get_state()
`
Issue Severity
None