Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761

Open
mzat-msft opened this issue Jun 23, 2023 · 4 comments
Assignees
Labels
P2 Important issue, but not time-critical rllib RLlib related issues

Comments

@mzat-msft
Copy link

What happened + What you expected to happen

When I train an algorithm with tune specifying for example num_tune_samples=10 and try to restore the best algorithm using Algorithm.from_checkpoint(), Ray tries to get 10 CPUs from the machine.
If the machine has not enough CPUs available it starts to throw this warning and never restore the algorithm:

(autoscaler +2m24s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

I would expect this to be portable and to work on any machine I bring the checkpoints along.

Versions / Dependencies

Observed with ray==2.3.0 and tensorflow==2.11.1 on Linux but I believe it is a common issue

Reproduction script

from gymnasium import Env
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import register_env


class DummyEnv(Env):
    def __init__(self, env_config, observation_space=None, action_space=None):
        if observation_space is None:
            raise TypeError("observation_space cannot be of type None.")
        self.observation_space = observation_space

        if action_space is None:
            raise TypeError("action_space cannot be of type None.")
        self.action_space = action_space

    def step(self, action):
        return self.observation_space.sample(), 0, False, False, {}

    def reset(self, *, seed=None, options=None):
        return self.observation_space.sample(), {}


def restore_agent(
    observation_space,
    action_space,
    checkpoint_path,
    name_env="sim_env",
):
    register_env(name_env, lambda conf: DummyEnv(conf, observation_space, action_space))
    return Algorithm.from_checkpoint(checkpoint_path)

agent = restore_agent(spaces.Discrete(2), spaces.Discrete(2), 'checkpoints/folder/with-10-rollout-workers', 'name-of-environment-used-for-training')

Issue Severity

High: It blocks me from completing my task.

@mzat-msft mzat-msft added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 23, 2023
mzat-msft added a commit to Azure/plato that referenced this issue Jun 23, 2023
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries
to initialize the same number of workers used for training.
In order to avoid this, we re-implement the unpickle approach for restoring
the agent where we override the number of rollout workers in the Algorithm
config.
This issue was raised to the RLlib team: ray-project/ray#36761

Also, the number of rollout workers is set to 0 to avoid the bug solved in
071fd69.
mzat-msft added a commit to Azure/plato that referenced this issue Jun 23, 2023
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries
to initialize the same number of workers used for training.
In order to avoid this, we re-implement the unpickle approach for restoring
the agent where we override the number of rollout workers in the Algorithm
config.
This issue was raised to the RLlib team: ray-project/ray#36761

Also, the number of rollout workers is set to 0 to avoid the bug solved in
071fd69.

In addition to this, the restore from pickle does not work correctly with
ray==2.5.0. In fact, it restores the agent correctly but when trying to use
``compute_single_action`` it throws a

    'NoneType' object has no attribute 'compute_single_action'
mzat-msft added a commit to Azure/plato that referenced this issue Jun 23, 2023
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries
to initialize the same number of workers used for training.
In order to avoid this, we re-implement the unpickle approach for restoring
the agent where we override the number of rollout workers in the Algorithm
config.
This issue was raised to the RLlib team: ray-project/ray#36761

Also, the number of rollout workers is set to 0 to avoid the bug solved in
071fd69.
mzat-msft added a commit to Azure/plato that referenced this issue Jun 23, 2023
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries
to initialize the same number of workers used for training.
In order to avoid this, we re-implement the unpickle approach for restoring
the agent where we override the number of rollout workers in the Algorithm
config.
This issue was raised to the RLlib team: ray-project/ray#36761

Also, the number of rollout workers is set to 0 to avoid the bug solved in
071fd69.
@avnishn
Copy link
Contributor

avnishn commented Jun 26, 2023

Something you can do here is directly restore the RLModule that is inside of the policy instead, either for training or for inference.

Here are some tests that act as pretty good documentation on the new way that we recommend restoring trained policies/RLModules:

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/rllib_tests/checkpointing_tests/test_e2e_rl_module_restore.py?L176

Let me know if something like this works for you. Thanks :)

@avnishn avnishn removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 26, 2023
@avnishn
Copy link
Contributor

avnishn commented Jun 26, 2023

related: #36830

@avnishn avnishn added P2 Important issue, but not time-critical rllib RLlib related issues labels Jun 26, 2023
@mzat-msft
Copy link
Author

mzat-msft commented Jun 27, 2023

Hi, thanks for your suggestion.

Are you suggesting to basically rebuild the algorithm config, overriding the number of workers, and then use Algorithm.restore() to load the weights?
IIUC this is equivalent to what I implemented here? Azure/plato@cfba87d

@sven1977
Copy link
Contributor

sven1977 commented Jan 5, 2024

Yes, we need to fix this. :)
We will (in the near future) go back to requiring the user to always bring along their (original or changed) configs when restoring.

For now as a workaround, the following hack should work:

from ray.rllib.utils.checkpoints import get_checkpoint_info

# Instead of calling .from_checkpoint directly, do this procedure:
checkpoint_info = get_checkpoint_info(checkpoint)
state = Algorithm._checkpoint_info_to_algorithm_state(
    checkpoint_info=checkpoint_info,
    policy_ids=None,
    policy_mapping_fn=None,
    policies_to_train=None,
)

state["config"] = ...  # drop-in your own, altered (num_rollout_workers?) AlgorithmConfig (not old config dict!!) object here.

algo = Algorithm.from_state(state)

# This `algo` should now have/require fewer rollout workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Important issue, but not time-critical rllib RLlib related issues
Projects
None yet
Development

No branches or pull requests

3 participants