New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Async restores and S3/GCP-capable trial FT #6376
Conversation
c482240
to
f7dc06a
Compare
Test FAILed. |
Test FAILed. |
f7dc06a
to
082b45e
Compare
Test FAILed. |
082b45e
to
102bc30
Compare
Test FAILed. |
Test FAILed. |
Test FAILed. |
0cd6c7a
to
29aeab0
Compare
29aeab0
to
51a356b
Compare
Test FAILed. |
Test FAILed. |
python/ray/tune/durable_trainable.py
Outdated
|
||
|
||
class DurableTrainable(Trainable): | ||
"""A fault-tolerant Trainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this a flag enabled feature of trainable?
As I user I am going to be confused about multiple trainables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe it's more reasonable to just do the TrainableV2 abstraction then...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, TrainableV2 will block this PR further. We can mark this as experimental and merge as a separate Trainable so our BAIR users can try it out.
Then expose it better once we implement the TrainableV2 abstraction #6417.
13e5856
to
028876b
Compare
Test FAILed. |
jenkins test tune |
jenkins test tune |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
…ne-async-save-restore
Test PASSed. |
jenkins test tune |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Overview
This PR introduces the following:
DurableTrainable
impl)Async restores
Why its necessary
Autoscaler scale-up can take an arbitrarily long time and should not block the control loop. Currently we timeout but this is also problematic because it results in trial failures.
How it works
Currently trial restoration is implemented as follows:
Instead we get rid of getting the worker-ip and the subsequent sync (since restores are now s3/gs-enabled only). Then, we store the obj id as done with training results.
The trial runner waits on these as usual, along with the other object IDs. Therefore we only handle successful/unsuccessful restores once autoscaler scale-up is complete. Once its ready to get we call
_process_trial_restore
which fetches the restore result. We distinguish between training results and restore results by callingtrial.is_restoring
before calling either_process_trial
orprocess_trial_restore
.Trial FT changes
Trial FT should be provided conditioned on the user providing an
S3
orGS
path (NFS support to be added) throughtune.run(..., upload_dir)
. This simplifies the implementation becausetrainable.restore.remote(path)
does everything. Therefore we no longer need to perform a sync and make 2 blocking calls.However, for backwards compatibility we can continue to sync checkpoints to driver synchronously by default using
sync_on_checkpoint=True
. Then, if noupload_dir
is provided we can read the checkpoint into memory on the driver and calltrainable.restore_from_object.remote(chkpt)
. This is not ideal but we can log a warning for the user to consider passing in anupload_dir
.Garbage Collection
We can only garbage collect an old checkpoint once the driver is notified that the newest checkpoint has been persisted. GC is implemented in this PR as follows (rsync-based deletion removed):
This lets the trainable control how to do garbage collection. An alternative approach would be to have the trainable just delete its local copy as soon as the checkpoint is uploaded, and let the driver delete from the remote store with its own client (which isn't unreasonable since the driver has its own client anyways, currently being used for global checkpoints).
Example
Extend
DurableTrainable
instead ofTrainable
.Misc
Closes #6226.
Partially addresses #6345.
Follow-up PRs
Checks
scripts/format.sh
to lint the changes in this PR.