-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (15/n): Trainable and Train worker artifact uploading
#38871
[air] pyarrow.fs
persistence (15/n): Trainable and Train worker artifact uploading
#38871
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/trial_artifacts
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl Should this instead force a WAIT on the previous task, then schedule a new one, and not necessarily block on the new one to finish?
|
Unfortunately I think we have to do this one, otherwise you might miss some recently written files. |
The code should be blocked at this point though, as we're waiting on the previous sync to finish up. What files are going to be missed? |
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There's no guarantee the previous sync actually picked up a file that was written just before we called report. The sync could have started long before that point. |
If we first wait for the old sync, then launch a new sync in report, I think we should be uploading the most up to date artifacts. |
…persistence/trial_artifacts
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary change for Train
…ifact uploading (ray-project#38871) * implement trial artifact syncing Signed-off-by: Justin Yu <justinvyu@anyscale.com> * update test Signed-off-by: Justin Yu <justinvyu@anyscale.com> * also support for class trainable Signed-off-by: Justin Yu <justinvyu@anyscale.com> * small comment fixes Signed-off-by: Justin Yu <justinvyu@anyscale.com> * guard w/ FF Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…ifact uploading (ray-project#38871) * implement trial artifact syncing Signed-off-by: Justin Yu <justinvyu@anyscale.com> * update test Signed-off-by: Justin Yu <justinvyu@anyscale.com> * also support for class trainable Signed-off-by: Justin Yu <justinvyu@anyscale.com> * small comment fixes Signed-off-by: Justin Yu <justinvyu@anyscale.com> * guard w/ FF Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ifact uploading (ray-project#38871) * implement trial artifact syncing Signed-off-by: Justin Yu <justinvyu@anyscale.com> * update test Signed-off-by: Justin Yu <justinvyu@anyscale.com> * also support for class trainable Signed-off-by: Justin Yu <justinvyu@anyscale.com> * small comment fixes Signed-off-by: Justin Yu <justinvyu@anyscale.com> * guard w/ FF Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…ifact uploading (ray-project#38871) * implement trial artifact syncing Signed-off-by: Justin Yu <justinvyu@anyscale.com> * update test Signed-off-by: Justin Yu <justinvyu@anyscale.com> * also support for class trainable Signed-off-by: Justin Yu <justinvyu@anyscale.com> * small comment fixes Signed-off-by: Justin Yu <justinvyu@anyscale.com> * guard w/ FF Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR enables artifact syncing on the Trainable (trial driver) and Train remote workers:
sync_on_checkpoint
is now used (previously it was only used in the head node syncing case). It forces a blocking artifact sync to happen on trial checkpoints.sync_artifacts=True
by default still -- to be updated in a follow-up.Related issue number
#38865
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.