-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Restoring an algorithm from_checkpoint expects same number of rollout workers available #36761
Comments
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69.
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69. In addition to this, the restore from pickle does not work correctly with ray==2.5.0. In fact, it restores the agent correctly but when trying to use ``compute_single_action`` it throws a 'NoneType' object has no attribute 'compute_single_action'
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69.
Apparently when restoring an Algorithm using ``from_checkpoint``, Ray tries to initialize the same number of workers used for training. In order to avoid this, we re-implement the unpickle approach for restoring the agent where we override the number of rollout workers in the Algorithm config. This issue was raised to the RLlib team: ray-project/ray#36761 Also, the number of rollout workers is set to 0 to avoid the bug solved in 071fd69.
Something you can do here is directly restore the RLModule that is inside of the policy instead, either for training or for inference. Here are some tests that act as pretty good documentation on the new way that we recommend restoring trained policies/RLModules: Let me know if something like this works for you. Thanks :) |
related: #36830 |
Hi, thanks for your suggestion. Are you suggesting to basically rebuild the algorithm config, overriding the number of workers, and then use |
Yes, we need to fix this. :) For now as a workaround, the following hack should work:
|
What happened + What you expected to happen
When I train an algorithm with tune specifying for example
num_tune_samples=10
and try to restore the best algorithm usingAlgorithm.from_checkpoint()
, Ray tries to get10
CPUs from the machine.If the machine has not enough CPUs available it starts to throw this warning and never restore the algorithm:
I would expect this to be portable and to work on any machine I bring the checkpoints along.
Versions / Dependencies
Observed with
ray==2.3.0
andtensorflow==2.11.1
on Linux but I believe it is a common issueReproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: