[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

mzat-msft · 2023-06-23T08:11:09Z

What happened + What you expected to happen

When I train an algorithm with tune specifying for example num_tune_samples=10 and try to restore the best algorithm using Algorithm.from_checkpoint(), Ray tries to get 10 CPUs from the machine.
If the machine has not enough CPUs available it starts to throw this warning and never restore the algorithm:

(autoscaler +2m24s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

I would expect this to be portable and to work on any machine I bring the checkpoints along.

Versions / Dependencies

Observed with ray==2.3.0 and tensorflow==2.11.1 on Linux but I believe it is a common issue

Reproduction script

from gymnasium import Env
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import register_env


class DummyEnv(Env):
    def __init__(self, env_config, observation_space=None, action_space=None):
        if observation_space is None:
            raise TypeError("observation_space cannot be of type None.")
        self.observation_space = observation_space

        if action_space is None:
            raise TypeError("action_space cannot be of type None.")
        self.action_space = action_space

    def step(self, action):
        return self.observation_space.sample(), 0, False, False, {}

    def reset(self, *, seed=None, options=None):
        return self.observation_space.sample(), {}


def restore_agent(
    observation_space,
    action_space,
    checkpoint_path,
    name_env="sim_env",
):
    register_env(name_env, lambda conf: DummyEnv(conf, observation_space, action_space))
    return Algorithm.from_checkpoint(checkpoint_path)

agent = restore_agent(spaces.Discrete(2), spaces.Discrete(2), 'checkpoints/folder/with-10-rollout-workers', 'name-of-environment-used-for-training')

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69.

Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69. In addition to this, the restore from pickle does not work correctly with ray==2.5.0. In fact, it restores the agent correctly but when trying to use ``compute_single_action`` it throws a 'NoneType' object has no attribute 'compute_single_action'

Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69.

avnishn · 2023-06-26T22:05:56Z

Something you can do here is directly restore the RLModule that is inside of the policy instead, either for training or for inference.

Here are some tests that act as pretty good documentation on the new way that we recommend restoring trained policies/RLModules:

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/rllib_tests/checkpointing_tests/test_e2e_rl_module_restore.py?L176

Let me know if something like this works for you. Thanks :)

avnishn · 2023-06-26T22:07:57Z

related: #36830

mzat-msft · 2023-06-27T08:22:08Z

Hi, thanks for your suggestion.

Are you suggesting to basically rebuild the algorithm config, overriding the number of workers, and then use Algorithm.restore() to load the weights?
IIUC this is equivalent to what I implemented here? Azure/plato@cfba87d

sven1977 · 2024-01-05T15:41:42Z

Yes, we need to fix this. :)
We will (in the near future) go back to requiring the user to always bring along their (original or changed) configs when restoring.

For now as a workaround, the following hack should work:

from ray.rllib.utils.checkpoints import get_checkpoint_info

# Instead of calling .from_checkpoint directly, do this procedure:
checkpoint_info = get_checkpoint_info(checkpoint)
state = Algorithm._checkpoint_info_to_algorithm_state(
    checkpoint_info=checkpoint_info,
    policy_ids=None,
    policy_mapping_fn=None,
    policies_to_train=None,
)

state["config"] = ...  # drop-in your own, altered (num_rollout_workers?) AlgorithmConfig (not old config dict!!) object here.

algo = Algorithm.from_state(state)

# This `algo` should now have/require fewer rollout workers.

mzat-msft added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 23, 2023

mzat-msft mentioned this issue Jun 23, 2023

Deploy agent overriding num rollout workers Azure/plato#44

Merged

10 tasks

avnishn removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 26, 2023

avnishn added P2 Important issue, but not time-critical rllib RLlib related issues labels Jun 26, 2023

Finebouche mentioned this issue Dec 1, 2023

Fails restoring weights #41508

Open

sven1977 self-assigned this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

mzat-msft commented Jun 23, 2023

avnishn commented Jun 26, 2023

avnishn commented Jun 26, 2023

mzat-msft commented Jun 27, 2023 •

edited

sven1977 commented Jan 5, 2024

[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

Comments

mzat-msft commented Jun 23, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

avnishn commented Jun 26, 2023

avnishn commented Jun 26, 2023

mzat-msft commented Jun 27, 2023 • edited

sven1977 commented Jan 5, 2024

mzat-msft commented Jun 27, 2023 •

edited