Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions to support new persistence path in FunctionTrainable #38284

Merged
merged 34 commits into from
Aug 12, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 10, 2023

Why are these changes needed?

This PR:

  • Unifies _StatusReporter (Tune session) and _TrainSession. _TrainSession is the only one that's used now.
  • This is so that train.report has a single implementation for both Train workers and Tune function trainables. This call to train.report uploads the checkpoints directly to storage.
  • Now, function trainables don't need to rely on the old class Trainable syncing codepath.

TODO

These things need a more thorough review:

  • Is actor reuse working correctly? See next PR.
  • Eager vs. lazy training function execution (see the comment about eager_mode in the code). See discussion below.
  • Are there any other behavior changes introduced by switching from FunctionTrainable/_StatusReporter to _TrainSession?
    • Particularly the new part which adds a stop_event, as well as handling get_next() returning None.
    • Went through a few rounds of bug-fixing here
  • Add to the e2e test case for Tuner.
  • Probably want to get rid of these references to global variables. Can probably get rid of the init_shared_storage_context, now that it's stored in a global session already.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl
Copy link
Contributor

ericl commented Aug 10, 2023

Can you provide more context for the use cases behind eager vs non eager mode? When/why did we introduce this?

If I understand, the difference is just whether we block result processing until the tune driver has received the message?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, I'm wondering why not always use "eager mode".

@justinvyu
Copy link
Contributor Author

@ericl The difference in flows

Non-eager (what Tune fn trainable does before):
> Trainable.train > Unblock > Do work > Report result > Blocked > Driver fetches the result 

Eager (what Train workers do before):
> Already unblocked > Trainable.train > Keep doing work > Report result > Blocked > Driver fetches the result > Unblock

Tune's FunctionTrainable basically implemented the non eager mode version before. This is because Tune trainables are expected to not do anything after a call to Trainable.train finishes and reports back a result. After this result comes back, Tune could pause the current trial and try to swap in a new trial. (Or, more complex PBT/BOHB logic.)

If we do "eager mode" here, then the training fn will have continued doing more stuff, and it can no longer be stopped gracefully in the middle when the scheduler tells it to stop.

@ericl
Copy link
Contributor

ericl commented Aug 10, 2023

Ok I see. I think my main confusion is just around the naming then. How about calling this synchronous_result_reporting, and add the comment that synchronous reporting is needed for advanced schedulers such as BOHB, etc. that you mentioned.

@justinvyu
Copy link
Contributor Author

Yeah, that name makes more sense.

I think this is actually only needed to support reuse_actors. If the training thread continues immediately, then cannot join in a few seconds, the reset operation will fail and the actor will be cleaned up.

Another thing I've been wondering: If you use Train with a scheduler/early stopping condition today, I'm actually not sure who cleans up the train worker actors. Will take a quick look into this.

cc: @krfricke

@krfricke
Copy link
Contributor

Yeah, that name makes more sense.

I think this is actually only needed to support reuse_actors. If the training thread continues immediately, then cannot join in a few seconds, the reset operation will fail and the actor will be cleaned up.

Should we then only set it when reuse_actors=True?

I've thought about this and actually think all other functionality should still work. In fact, we have "buffered training" where we run train multiple times before returning results. Maybe we can get rid of that code path and use the synchronous_result_reporting instead?

Another thing I've been wondering: If you use Train with a scheduler/early stopping condition today, I'm actually not sure who cleans up the train worker actors. Will take a quick look into this.

cc: @krfricke

We schedule terminates but I think most things would also work when actors go out of scope/get gc'd.

@ericl
Copy link
Contributor

ericl commented Aug 10, 2023

Hmm, what about the situation for PBT? I'm wondering if that could cause meaningful slowdowns if we cannot synchronously interrupt the trial (i.e., need to wait for minutes for the next report call). Is this the right thinking I'm having?

@justinvyu
Copy link
Contributor Author

Should we then only set it when reuse_actors=True?

I was mostly just trying to get identical behavior with the old code for now. Maybe we can consider setting to False in the case that reuse actors is disabled in the future? Fn trainables should set reuse_actors=True by default, so I think the behavior would only be different if the user manually turns off reuse_actors. (And different for class trainables, but I'm not touching that for now.)

Maybe we can get rid of that code path and use the synchronous_result_reporting instead?

I'm not too familiar with the train buffered path -- what can be replaced there? In my mind there still needs to be multiple calls to train, regardless of synchronous_result_reporting.

Hmm, what about the situation for PBT?

So, if a trial cannot be reset in time (aka synchronous_result_reporting=False), Tune will just give up on trying to reuse the actor (after like 2 seconds) and terminate it instead. Then Tune will create a new actor to start the unpaused PBT trial. So, no waiting for the next report call.

Paused trials are a bit strange with synchronous_result_reporting=False though:

  • Trial A reports a result.
  • Then, PBT pauses trial A.
  • Its training fn keeps going and could report another result and upload a checkpoint that never gets processed. Actually... train workers would also run into this problem if their trial is paused. Maybe this is not so big of a deal though.

I'm also imagining some concurrent read/write issues with the global session. Ex: PBT trying to get the latest checkpoint while the training function has advanced to the next train.report and is setting the latest checkpoint.

I think it's just safer to keep synchronous_result_reporting=True for the Trainable.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 11, 2023
Comment on lines 428 to 429
tune_session: _TrainSession = get_session()
assert tune_session, "`start_training` should only be called from within Tune"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rllib's LearnerGroup uses BackendExecutor, and they may not be inside a Tune session (??). But they never call this code and only use it to start and stop a WorkerGroup. Maybe they should just use the WorkerGroup abstraction directly 😅

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines -274 to -285
if _use_storage_context():
assert isinstance(checkpoint, NewCheckpoint)
logger.debug(f"Checkpoint received by the Tune session: {checkpoint}")
self._fresh_checkpoint = True
# TODO(justinvyu): `metrics` doesn't include the autofilled metrics
# like `training_iteration` and `time_total_s`.
# Should the session be the source of truth for these metrics?
self._latest_checkpoint_result = _TrainingResult(
checkpoint=checkpoint, metrics=metrics
)

self._last_checkpoint = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Reverted this code back to what it was before, since we don't use _StatusReporter anymore in the new path.

Comment on lines +196 to +197
if not name:
name = StorageContext.get_experiment_dir_name(run)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tune.run usage, name may not be provided, so we need to auto-generate one. Tuner will always populate name though.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 12, 2023
@justinvyu
Copy link
Contributor Author

@ericl This one should be good to merge.

@ericl ericl merged commit e39b18c into ray-project:master Aug 12, 2023
54 of 62 checks passed
@justinvyu justinvyu deleted the air/persistence/unify_sessions branch August 12, 2023 17:50
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…to support new persistence path in `FunctionTrainable` (ray-project#38284)

Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…to support new persistence path in `FunctionTrainable` (ray-project#38284)

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…to support new persistence path in `FunctionTrainable` (ray-project#38284)
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…to support new persistence path in `FunctionTrainable` (ray-project#38284)

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…to support new persistence path in `FunctionTrainable` (ray-project#38284)

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants