[air] `pyarrow.fs` persistence (10/n): Unify Tune and Train sessions to support new persistence path in `FunctionTrainable` #38284

justinvyu · 2023-08-10T01:37:05Z

Why are these changes needed?

This PR:

Unifies _StatusReporter (Tune session) and _TrainSession. _TrainSession is the only one that's used now.
This is so that train.report has a single implementation for both Train workers and Tune function trainables. This call to train.report uploads the checkpoints directly to storage.
Now, function trainables don't need to rely on the old class Trainable syncing codepath.

TODO

These things need a more thorough review:

Is actor reuse working correctly? See next PR.
Eager vs. lazy training function execution (see the comment about eager_mode in the code). See discussion below.
Are there any other behavior changes introduced by switching from FunctionTrainable/_StatusReporter to _TrainSession?
- Particularly the new part which adds a stop_event, as well as handling get_next() returning None.
- Went through a few rounds of bug-fixing here
~~Add to the e2e test case for Tuner.~~
~~Probably want to get rid of these references to global variables. Can probably get rid of the init_shared_storage_context, now that it's stored in a global session already.~~

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/unify_sessions Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ericl · 2023-08-10T02:13:05Z

Can you provide more context for the use cases behind eager vs non eager mode? When/why did we introduce this?

If I understand, the difference is just whether we block result processing until the tune driver has received the message?

ericl

In particular, I'm wondering why not always use "eager mode".

justinvyu · 2023-08-10T03:08:13Z

@ericl The difference in flows

Non-eager (what Tune fn trainable does before):
> Trainable.train > Unblock > Do work > Report result > Blocked > Driver fetches the result 

Eager (what Train workers do before):
> Already unblocked > Trainable.train > Keep doing work > Report result > Blocked > Driver fetches the result > Unblock

Tune's FunctionTrainable basically implemented the non eager mode version before. This is because Tune trainables are expected to not do anything after a call to Trainable.train finishes and reports back a result. After this result comes back, Tune could pause the current trial and try to swap in a new trial. (Or, more complex PBT/BOHB logic.)

If we do "eager mode" here, then the training fn will have continued doing more stuff, and it can no longer be stopped gracefully in the middle when the scheduler tells it to stop.

ericl · 2023-08-10T03:10:41Z

Ok I see. I think my main confusion is just around the naming then. How about calling this synchronous_result_reporting, and add the comment that synchronous reporting is needed for advanced schedulers such as BOHB, etc. that you mentioned.

justinvyu · 2023-08-10T18:21:14Z

Yeah, that name makes more sense.

I think this is actually only needed to support reuse_actors. If the training thread continues immediately, then cannot join in a few seconds, the reset operation will fail and the actor will be cleaned up.

Another thing I've been wondering: If you use Train with a scheduler/early stopping condition today, I'm actually not sure who cleans up the train worker actors. Will take a quick look into this.

cc: @krfricke

krfricke · 2023-08-10T21:48:49Z

Yeah, that name makes more sense.

I think this is actually only needed to support reuse_actors. If the training thread continues immediately, then cannot join in a few seconds, the reset operation will fail and the actor will be cleaned up.

Should we then only set it when reuse_actors=True?

I've thought about this and actually think all other functionality should still work. In fact, we have "buffered training" where we run train multiple times before returning results. Maybe we can get rid of that code path and use the synchronous_result_reporting instead?

Another thing I've been wondering: If you use Train with a scheduler/early stopping condition today, I'm actually not sure who cleans up the train worker actors. Will take a quick look into this.

cc: @krfricke

We schedule terminates but I think most things would also work when actors go out of scope/get gc'd.

ericl · 2023-08-10T22:18:07Z

Hmm, what about the situation for PBT? I'm wondering if that could cause meaningful slowdowns if we cannot synchronously interrupt the trial (i.e., need to wait for minutes for the next report call). Is this the right thinking I'm having?

justinvyu · 2023-08-10T22:59:10Z

Should we then only set it when reuse_actors=True?

I was mostly just trying to get identical behavior with the old code for now. Maybe we can consider setting to False in the case that reuse actors is disabled in the future? Fn trainables should set reuse_actors=True by default, so I think the behavior would only be different if the user manually turns off reuse_actors. (And different for class trainables, but I'm not touching that for now.)

Maybe we can get rid of that code path and use the synchronous_result_reporting instead?

I'm not too familiar with the train buffered path -- what can be replaced there? In my mind there still needs to be multiple calls to train, regardless of synchronous_result_reporting.

Hmm, what about the situation for PBT?

So, if a trial cannot be reset in time (aka synchronous_result_reporting=False), Tune will just give up on trying to reuse the actor (after like 2 seconds) and terminate it instead. Then Tune will create a new actor to start the unpaused PBT trial. So, no waiting for the next report call.

Paused trials are a bit strange with synchronous_result_reporting=False though:

Trial A reports a result.
Then, PBT pauses trial A.
Its training fn keeps going and could report another result and upload a checkpoint that never gets processed. Actually... train workers would also run into this problem if their trial is paused. Maybe this is not so big of a deal though.

I'm also imagining some concurrent read/write issues with the global session. Ex: PBT trying to get the latest checkpoint while the training function has advanced to the next train.report and is setting the latest checkpoint.

I think it's just safer to keep synchronous_result_reporting=True for the Trainable.

…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-11T17:00:05Z

python/ray/train/_internal/backend_executor.py

+        tune_session: _TrainSession = get_session()
+        assert tune_session, "`start_training` should only be called from within Tune"


Rllib's LearnerGroup uses BackendExecutor, and they may not be inside a Tune session (??). But they never call this code and only use it to start and stop a WorkerGroup. Maybe they should just use the WorkerGroup abstraction directly 😅

…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-11T17:13:20Z

python/ray/tune/trainable/function_trainable.py

-        if _use_storage_context():
-            assert isinstance(checkpoint, NewCheckpoint)
-            logger.debug(f"Checkpoint received by the Tune session: {checkpoint}")
-            self._fresh_checkpoint = True
-            # TODO(justinvyu): `metrics` doesn't include the autofilled metrics
-            # like `training_iteration` and `time_total_s`.
-            # Should the session be the source of truth for these metrics?
-            self._latest_checkpoint_result = _TrainingResult(
-                checkpoint=checkpoint, metrics=metrics
-            )
-
-            self._last_checkpoint = None


Note: Reverted this code back to what it was before, since we don't use _StatusReporter anymore in the new path.

justinvyu · 2023-08-11T17:14:30Z

python/ray/tune/experiment/experiment.py

+            if not name:
+                name = StorageContext.get_experiment_dir_name(run)


For tune.run usage, name may not be provided, so we need to auto-generate one. Tuner will always populate name though.

python/ray/tune/trainable/function_trainable.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-12T05:05:26Z

@ericl This one should be good to merge.

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: harborn <gangsheng.wu@intel.com>

…to support new persistence path in `FunctionTrainable` (ray-project#38284)

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 9 commits August 8, 2023 11:51

Move _TrainingResult to session.py

1d8f49b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Prototype unified session

c88df86

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

7bca696

…persistence/unify_sessions Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix incorrect merge

8871231

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Implement reset with unified session (for actor reuse)

d87100d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Eager mode in session (difference in tune/train behavior)

5455c2c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Working for train again

e1bd135

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove unused ckpt index code

60ea1ee

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

3d23988

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from ericl and krfricke August 10, 2023 01:37

justinvyu assigned ericl and krfricke Aug 10, 2023

justinvyu mentioned this pull request Aug 10, 2023

[air] pyarrow.fs persistence (9/n): ray.train.Checkpoint restore: Manual restore #38128

Merged

11 tasks

ericl reviewed Aug 10, 2023

View reviewed changes

ericl mentioned this pull request Aug 10, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

justinvyu added 7 commits August 10, 2023 16:06

Merge branch 'master' of https://github.com/ray-project/ray into air/…

4d209e7

…persistence/unify_sessions

Add dict checkpoint utils for tests

67a0721

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add env var as a constant

2a35b3e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove prints

7915999

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix tune.run sync config = None issue

74a2f19

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

dfff147

…persistence/unify_sessions

Rename eager_mode -> synchronous_result_reporting

35aaf2c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 3 commits August 11, 2023 01:21

synch result reporting logic is flipped...

ce9920b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Handle trainable outputs correctly

a97025f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Propagate storage on trial.reset (for restarting upon restore)

b14bce5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from ericl August 11, 2023 08:52

justinvyu marked this pull request as ready for review August 11, 2023 08:52

justinvyu mentioned this pull request Aug 11, 2023

[air] pyarrow.fs persistence (11/n): Support pausing trials (and certain schedulers) #38355

Merged

8 tasks

ericl approved these changes Aug 11, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 11, 2023

justinvyu commented Aug 11, 2023

View reviewed changes

justinvyu added 2 commits August 11, 2023 10:07

Merge branch 'master' of https://github.com/ray-project/ray into air/…

0792283

…persistence/unify_sessions

Guard the tune session assertion

90480aa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Aug 11, 2023

View reviewed changes

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved

justinvyu added 6 commits August 11, 2023 11:03

thread join timeout = 0 on cleanup + report any errors left in the queue

e9ff63c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add back saving_to for now

c3246dd

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Convert path to str for env var

7f68633

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

TIL local variables override imports even if they're set conditionally

1498f05

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

5b40250

…persistence/unify_sessions

Clarify comment a bit

5d72ca8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 12, 2023

ericl merged commit e39b18c into ray-project:master Aug 12, 2023
54 of 62 checks passed

justinvyu deleted the air/persistence/unify_sessions branch August 12, 2023 17:50

NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions …

6917ba6

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions …

94ce5fe

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions …

89e171e

…to support new persistence path in `FunctionTrainable` (ray-project#38284)

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions …

8f836e0

…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (10/n): Unify Tune and Train sessions to support new persistence path in `FunctionTrainable` #38284

[air] `pyarrow.fs` persistence (10/n): Unify Tune and Train sessions to support new persistence path in `FunctionTrainable` #38284

justinvyu commented Aug 10, 2023 •

edited

Loading

ericl commented Aug 10, 2023

ericl left a comment

justinvyu commented Aug 10, 2023

ericl commented Aug 10, 2023

justinvyu commented Aug 10, 2023

krfricke commented Aug 10, 2023

ericl commented Aug 10, 2023

justinvyu commented Aug 10, 2023

justinvyu Aug 11, 2023

justinvyu Aug 11, 2023

justinvyu Aug 11, 2023

justinvyu commented Aug 12, 2023

		tune_session: _TrainSession = get_session()
		assert tune_session, "`start_training` should only be called from within Tune"

		if not name:
		name = StorageContext.get_experiment_dir_name(run)

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions to support new persistence path in FunctionTrainable #38284

[air] pyarrow.fs persistence (10/n): Unify Tune and Train sessions to support new persistence path in FunctionTrainable #38284

Conversation

justinvyu commented Aug 10, 2023 • edited Loading

Why are these changes needed?

TODO

Related issue number

Checks

ericl commented Aug 10, 2023

ericl left a comment

Choose a reason for hiding this comment

justinvyu commented Aug 10, 2023

ericl commented Aug 10, 2023

justinvyu commented Aug 10, 2023

krfricke commented Aug 10, 2023

ericl commented Aug 10, 2023

justinvyu commented Aug 10, 2023

justinvyu Aug 11, 2023

Choose a reason for hiding this comment

justinvyu Aug 11, 2023

Choose a reason for hiding this comment

justinvyu Aug 11, 2023

Choose a reason for hiding this comment

justinvyu commented Aug 12, 2023

[air] `pyarrow.fs` persistence (10/n): Unify Tune and Train sessions to support new persistence path in `FunctionTrainable` #38284

[air] `pyarrow.fs` persistence (10/n): Unify Tune and Train sessions to support new persistence path in `FunctionTrainable` #38284

justinvyu commented Aug 10, 2023 •

edited

Loading