Skip to content

[RLlib] Headnode without GPU triggers torch/CUDA de-serialization error #53467

@ArturNiederfahrenhorst

Description

@ArturNiederfahrenhorst

What happened + What you expected to happen

Running the following will result in an error if we use a CPU-node as a headnode.
I didn't test for other algorithms or versions but assume that the issue exists there, too.
The issue is that the MetricsLogger sometimes returns torch tensors that are on GPU and torch serializes and de-serializes to the same device type.

Error:

File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/algorithms/algorithm.py", line 2865, in get_state
state[COMPONENT_LEARNER_GROUP] = self.learner_group.get_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/core/learner/learner_group.py", line 481, in get_state
state[COMPONENT_LEARNER] = self._get_results(results)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/core/learner/learner_group.py", line 632, in _get_results
raise result_or_error
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/rllib/utils/actor_manager.py", line 861, in _fetch_result
result = ray.get(ready)
^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
raise value
ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Versions / Dependencies

Ray 2.46

Reproduction script

`
from ray.rllib.algorithms import appo

alg_config = (
appo.APPOConfig()
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.environment(
env="CartPole-v1",
disable_env_checking=True,
)
.learners(
num_learners=1,
num_gpus_per_learner=1,
)
.reporting(min_time_s_per_iteration=1)
)

algo = alg_config.build()

algo.train()

state = algo.get_state()
`

Issue Severity

None

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'trllibRLlib related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions