-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (10/n): Unify Tune and Train sessions to support new persistence path in FunctionTrainable
#38284
[air] pyarrow.fs
persistence (10/n): Unify Tune and Train sessions to support new persistence path in FunctionTrainable
#38284
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Can you provide more context for the use cases behind eager vs non eager mode? When/why did we introduce this? If I understand, the difference is just whether we block result processing until the tune driver has received the message? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, I'm wondering why not always use "eager mode".
@ericl The difference in flows
Tune's If we do "eager mode" here, then the training fn will have continued doing more stuff, and it can no longer be stopped gracefully in the middle when the scheduler tells it to stop. |
Ok I see. I think my main confusion is just around the naming then. How about calling this |
Yeah, that name makes more sense. I think this is actually only needed to support Another thing I've been wondering: If you use Train with a scheduler/early stopping condition today, I'm actually not sure who cleans up the train worker actors. Will take a quick look into this. cc: @krfricke |
Should we then only set it when I've thought about this and actually think all other functionality should still work. In fact, we have "buffered training" where we run train multiple times before returning results. Maybe we can get rid of that code path and use the
We schedule terminates but I think most things would also work when actors go out of scope/get gc'd. |
Hmm, what about the situation for PBT? I'm wondering if that could cause meaningful slowdowns if we cannot synchronously interrupt the trial (i.e., need to wait for minutes for the next report call). Is this the right thinking I'm having? |
I was mostly just trying to get identical behavior with the old code for now. Maybe we can consider setting to
I'm not too familiar with the train buffered path -- what can be replaced there? In my mind there still needs to be multiple calls to
So, if a trial cannot be reset in time (aka Paused trials are a bit strange with
I'm also imagining some concurrent read/write issues with the global I think it's just safer to keep |
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
tune_session: _TrainSession = get_session() | ||
assert tune_session, "`start_training` should only be called from within Tune" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rllib's LearnerGroup
uses BackendExecutor
, and they may not be inside a Tune session (??). But they never call this code and only use it to start and stop a WorkerGroup
. Maybe they should just use the WorkerGroup
abstraction directly 😅
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
if _use_storage_context(): | ||
assert isinstance(checkpoint, NewCheckpoint) | ||
logger.debug(f"Checkpoint received by the Tune session: {checkpoint}") | ||
self._fresh_checkpoint = True | ||
# TODO(justinvyu): `metrics` doesn't include the autofilled metrics | ||
# like `training_iteration` and `time_total_s`. | ||
# Should the session be the source of truth for these metrics? | ||
self._latest_checkpoint_result = _TrainingResult( | ||
checkpoint=checkpoint, metrics=metrics | ||
) | ||
|
||
self._last_checkpoint = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Reverted this code back to what it was before, since we don't use _StatusReporter
anymore in the new path.
if not name: | ||
name = StorageContext.get_experiment_dir_name(run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tune.run
usage, name may not be provided, so we need to auto-generate one. Tuner
will always populate name though.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl This one should be good to merge. |
…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: harborn <gangsheng.wu@intel.com>
…to support new persistence path in `FunctionTrainable` (ray-project#38284)
…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…to support new persistence path in `FunctionTrainable` (ray-project#38284) Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR:
_StatusReporter
(Tune session) and_TrainSession
._TrainSession
is the only one that's used now.train.report
has a single implementation for both Train workers and Tune function trainables. This call totrain.report
uploads the checkpoints directly to storage.TODO
These things need a more thorough review:
eager_mode
in the code). See discussion below.FunctionTrainable
/_StatusReporter
to_TrainSession
?stop_event
, as well as handlingget_next()
returningNone
.Add to the e2e test case for Tuner.Probably want to get rid of these references to global variables. Can probably get rid of theinit_shared_storage_context
, now that it's stored in a global session already.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.