Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Introduce Checkpointable API for RLlib components and subcomponents. #46376

Merged
merged 6 commits into from
Jul 3, 2024

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jul 2, 2024

Introduce Checkpointable API for RLlib components and subcomponents.
The API should define common rules of engagement for checkpointable components of RLlib, such as Algorithm, RLModule, Learner, and EnvRunner.

  • Any Checkpointable will be able to restore its state from disk, save its state to disk, construct a new instance of itself from disk (incl. saved state) and define subcomponents that are themselves Checkpointable.
  • This will allow users to decompose the components of an Algorithm in a later stage of their experiments, for example, if they want to run only a trained RLModule in production and in an RLlib-free environment.
  • In a follow-up PR, we will attach this API to all classes that should implement it: Algorithm, LearnerGroup, Learner, RLModule, ConnectorV2, and EnvRunner.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A question in regard to the class definition and I would like to see the code more pathlib-ish - but this is my taste.



@PublicAPI(stability="alpha")
class Checkpointable(abc.ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question at the beginning: Can't we directly derive from train.Checkpoint or at least its metaclass '_CheckpointMetaClass`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different concepts, I believe. train.Checkpoint is just describing the checkpoint itself, NOT the component that does get checkpointed.

I would probably interpret the new Checkpointable more as a competitor to tune.Trainable with its existing save and restore APIs and subsequent (save_checkpoint and load_checkpoint). We'll have to see, how to fit the Algorithm class itself into this new API and whether it's even possible. But for all the sub-components (which - in isolation - will never be used with Tune), this new API will unify and simplify a lot of things for our users. Especially when going to production/deployment.

Delivers an important promise of the new API stack: Plugability and re-usability of all components outside RLlib.


def save_to_path(
self,
path: Optional[Union[str, pathlib.Path]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does exclude paths to cloud storage like S3 I think. In the future we should provide this. When running with tune I guess tune takes care of this b/c it syncs basically the driver files with the S3 bucket (or other object storage).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Actually, for Algorithm I'm not sure yet, how we are going to make this an implementer of Checkpointable API as it might collide with tune.Trainable in exactly these functionalities.

Keep in mind that the subcomponents of RLlib (e.g. RLModule, Learner, etc..) - if used in isolation - will never be run in connection with Tune, so this should be fine. It would be super nice, though, to have Algorithm also abide to this new API, but we'll see. For now, it'll use the tune.Trainable's save_checkpoint, which takes care of S3/remote synching.

component_dir = path / component_name
# If subcomponent's dir is not in path, ignore it and don't restore this
# subcomponent's state from disk.
if not os.path.isdir(component_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick here to the pathlib module and use component_dir.is_dir()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. Always get confused with os.path vs pathlib. :|

def get_metadata(self) -> Dict:
"""Returns JSON writable metadata further describing the implementing class.

Note that this metadata is NOT part of any state and is thus NOT needed to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Like this ot becomes more handy.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 enabled auto-merge (squash) July 2, 2024 19:15
@github-actions github-actions bot disabled auto-merge July 2, 2024 19:15
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 2, 2024
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 enabled auto-merge (squash) July 3, 2024 11:14
@sven1977 sven1977 merged commit b643ed8 into ray-project:master Jul 3, 2024
7 checks passed
@sven1977 sven1977 deleted the checkpointable_api branch July 5, 2024 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants