-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Introduce Checkpointable
API for RLlib components and subcomponents.
#46376
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A question in regard to the class definition and I would like to see the code more pathlib-ish - but this is my taste.
|
||
|
||
@PublicAPI(stability="alpha") | ||
class Checkpointable(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question at the beginning: Can't we directly derive from train.Checkpoint
or at least its metaclass '_CheckpointMetaClass`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different concepts, I believe. train.Checkpoint is just describing the checkpoint itself, NOT the component that does get checkpointed.
I would probably interpret the new Checkpointable more as a competitor to tune.Trainable with its existing save
and restore
APIs and subsequent (save_checkpoint and load_checkpoint). We'll have to see, how to fit the Algorithm class itself into this new API and whether it's even possible. But for all the sub-components (which - in isolation - will never be used with Tune), this new API will unify and simplify a lot of things for our users. Especially when going to production/deployment.
Delivers an important promise of the new API stack: Plugability and re-usability of all components outside RLlib.
|
||
def save_to_path( | ||
self, | ||
path: Optional[Union[str, pathlib.Path]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does exclude paths to cloud storage like S3 I think. In the future we should provide this. When running with tune
I guess tune
takes care of this b/c it syncs basically the driver files with the S3 bucket (or other object storage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Actually, for Algorithm
I'm not sure yet, how we are going to make this an implementer of Checkpointable API as it might collide with tune.Trainable
in exactly these functionalities.
Keep in mind that the subcomponents of RLlib (e.g. RLModule, Learner, etc..) - if used in isolation - will never be run in connection with Tune, so this should be fine. It would be super nice, though, to have Algorithm also abide to this new API, but we'll see. For now, it'll use the tune.Trainable's save_checkpoint
, which takes care of S3/remote synching.
rllib/utils/checkpoints.py
Outdated
component_dir = path / component_name | ||
# If subcomponent's dir is not in path, ignore it and don't restore this | ||
# subcomponent's state from disk. | ||
if not os.path.isdir(component_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stick here to the pathlib
module and use component_dir.is_dir()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call. Always get confused with os.path vs pathlib. :|
def get_metadata(self) -> Dict: | ||
"""Returns JSON writable metadata further describing the implementing class. | ||
|
||
Note that this metadata is NOT part of any state and is thus NOT needed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Like this ot becomes more handy.
Introduce
Checkpointable
API for RLlib components and subcomponents.The API should define common rules of engagement for checkpointable components of RLlib, such as Algorithm, RLModule, Learner, and EnvRunner.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.