Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Async restores and S3/GCP-capable trial FT #6376

Merged
merged 54 commits into from Jan 3, 2020

Conversation

ujvl
Copy link
Contributor

@ujvl ujvl commented Dec 6, 2019

Overview

This PR introduces the following:

  • asynchronous restores (better for performance and fixes restoration bug in the autoscaling case)
  • addition of cloud storage options for trial fault tolerance (see DurableTrainable impl)
  • removal of default rsync dependence for trial fault tolerance. rsync will now only be relied on for best-effort syncing of logs and checkpoints to driver. restores will not use rsync.
  • garbage collection through sync client

Async restores

Why its necessary

Autoscaler scale-up can take an arbitrarily long time and should not block the control loop. Currently we timeout but this is also problematic because it results in trial failures.

How it works

Currently trial restoration is implemented as follows:

worker_ip = ray.get(trial.runner.worker_ip.remote())
trial.sync_to_worker(worker_ip)
ray.get(trial.runner.restore.remote())

Instead we get rid of getting the worker-ip and the subsequent sync (since restores are now s3/gs-enabled only). Then, we store the obj id as done with training results.

self.running[trail.runner.restore.remote()] = trial

The trial runner waits on these as usual, along with the other object IDs. Therefore we only handle successful/unsuccessful restores once autoscaler scale-up is complete. Once its ready to get we call _process_trial_restore which fetches the restore result. We distinguish between training results and restore results by calling trial.is_restoring before calling either _process_trial or process_trial_restore.

Trial FT changes

Trial FT should be provided conditioned on the user providing an S3 or GS path (NFS support to be added) through tune.run(..., upload_dir). This simplifies the implementation because trainable.restore.remote(path) does everything. Therefore we no longer need to perform a sync and make 2 blocking calls.

However, for backwards compatibility we can continue to sync checkpoints to driver synchronously by default using sync_on_checkpoint=True. Then, if no upload_dir is provided we can read the checkpoint into memory on the driver and call trainable.restore_from_object.remote(chkpt). This is not ideal but we can log a warning for the user to consider passing in an upload_dir.

Garbage Collection

We can only garbage collect an old checkpoint once the driver is notified that the newest checkpoint has been persisted. GC is implemented in this PR as follows (rsync-based deletion removed):

driver: receives checkpoint persisted notification
driver: decides if a checkpoint needs to be GC'd.
driver: deletes driver-local copy of checkpoint, if any.
driver: calls runner.delete_checkpoint.remote()
worker: deletes worker-local copy of checkpoint, if any.
worker: calls sync_client.delete to delete persisted checkpoint

This lets the trainable control how to do garbage collection. An alternative approach would be to have the trainable just delete its local copy as soon as the checkpoint is uploaded, and let the driver delete from the remote store with its own client (which isn't unreasonable since the driver has its own client anyways, currently being used for global checkpoints).

Example

Extend DurableTrainable instead of Trainable.

from ray.tune import DurableTrainable

class MyTrainable(DurableTrainable):

  def _save(self, checkpoint_dir):
    return {...}

  def _restore(self, checkpoint):
    ...

tune.run(MyTrainable, upload_dir="s3://my-bucket/experiments/")

Misc

Closes #6226.
Partially addresses #6345.

Follow-up PRs

  • Prevent MEMORY checkpoints from breaking FT
  • Address known rsync bug causing checkpoint failures
  • Sync logs between workers.
  • Improve garbage collection. Trials moving around makes it possible to inadvertently not GC a checkpoint
  • Asynchronous checkpointing
  • Avoid syncing down all trial checkpoints to driver during restore (need selective syncing)
  • Refactor cluster tests
  • Finalize durable training API, documentation, log warnings

Checks

@ujvl ujvl changed the title [WIP][tune] Asynchronous trial checkpoint/restore [WIP][tune] Process trial checkpoints/restores asynchronously Dec 6, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19252/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19254/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19587/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19611/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19629/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19630/
Test FAILed.

@ujvl ujvl changed the title [WIP][tune] Process trial checkpoints/restores asynchronously [wip][tune] S3/GCP-capable checkpointing and async restores Dec 15, 2019
@ujvl ujvl changed the title [wip][tune] S3/GCP-capable checkpointing and async restores [tune] Async restores and S3/GCP-capable trial FT Dec 15, 2019
@ujvl ujvl marked this pull request as ready for review December 15, 2019 19:57
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19636/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19633/
Test FAILed.

@ujvl ujvl mentioned this pull request Dec 16, 2019
3 tasks


class DurableTrainable(Trainable):
"""A fault-tolerant Trainable.
Copy link
Contributor

@ericl ericl Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a flag enabled feature of trainable?

As I user I am going to be confused about multiple trainables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe it's more reasonable to just do the TrainableV2 abstraction then...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, TrainableV2 will block this PR further. We can mark this as experimental and merge as a separate Trainable so our BAIR users can try it out.

Then expose it better once we implement the TrainableV2 abstraction #6417.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20276/
Test FAILed.

@ujvl
Copy link
Contributor Author

ujvl commented Jan 2, 2020

jenkins test tune

@ujvl
Copy link
Contributor Author

ujvl commented Jan 2, 2020

jenkins test tune

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/337/
Tune tests passed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20279/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/338/
Tune tests passed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20280/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20285/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20286/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20290/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20292/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20297/
Test PASSed.

@ujvl
Copy link
Contributor Author

ujvl commented Jan 3, 2020

jenkins test tune

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/339/
Tune tests passed.

@richardliaw richardliaw merged commit ca651af into ray-project:master Jan 3, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20312/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20315/
Test FAILed.

@ujvl ujvl deleted the tune-async-save-restore branch January 3, 2020 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] Checkpoints are only partially transferred thus cannot be resumed
4 participants