[tune] Async restores and S3/GCP-capable trial FT #6376

ujvl · 2019-12-06T03:21:25Z

Overview

This PR introduces the following:

asynchronous restores (better for performance and fixes restoration bug in the autoscaling case)
addition of cloud storage options for trial fault tolerance (see DurableTrainable impl)
removal of default rsync dependence for trial fault tolerance. rsync will now only be relied on for best-effort syncing of logs and checkpoints to driver. restores will not use rsync.
garbage collection through sync client

Async restores

Why its necessary

Autoscaler scale-up can take an arbitrarily long time and should not block the control loop. Currently we timeout but this is also problematic because it results in trial failures.

How it works

Currently trial restoration is implemented as follows:

worker_ip = ray.get(trial.runner.worker_ip.remote())
trial.sync_to_worker(worker_ip)
ray.get(trial.runner.restore.remote())

Instead we get rid of getting the worker-ip and the subsequent sync (since restores are now s3/gs-enabled only). Then, we store the obj id as done with training results.

self.running[trail.runner.restore.remote()] = trial

The trial runner waits on these as usual, along with the other object IDs. Therefore we only handle successful/unsuccessful restores once autoscaler scale-up is complete. Once its ready to get we call _process_trial_restore which fetches the restore result. We distinguish between training results and restore results by calling trial.is_restoring before calling either _process_trial or process_trial_restore.

Trial FT changes

Trial FT should be provided conditioned on the user providing an S3 or GS path (NFS support to be added) through tune.run(..., upload_dir). This simplifies the implementation because trainable.restore.remote(path) does everything. Therefore we no longer need to perform a sync and make 2 blocking calls.

However, for backwards compatibility we can continue to sync checkpoints to driver synchronously by default using sync_on_checkpoint=True. Then, if no upload_dir is provided we can read the checkpoint into memory on the driver and call trainable.restore_from_object.remote(chkpt). This is not ideal but we can log a warning for the user to consider passing in an upload_dir.

Garbage Collection

We can only garbage collect an old checkpoint once the driver is notified that the newest checkpoint has been persisted. GC is implemented in this PR as follows (rsync-based deletion removed):

driver: receives checkpoint persisted notification
driver: decides if a checkpoint needs to be GC'd.
driver: deletes driver-local copy of checkpoint, if any.
driver: calls runner.delete_checkpoint.remote()
worker: deletes worker-local copy of checkpoint, if any.
worker: calls sync_client.delete to delete persisted checkpoint

This lets the trainable control how to do garbage collection. An alternative approach would be to have the trainable just delete its local copy as soon as the checkpoint is uploaded, and let the driver delete from the remote store with its own client (which isn't unreasonable since the driver has its own client anyways, currently being used for global checkpoints).

Example

Extend DurableTrainable instead of Trainable.

from ray.tune import DurableTrainable

class MyTrainable(DurableTrainable):

  def _save(self, checkpoint_dir):
    return {...}

  def _restore(self, checkpoint):
    ...

tune.run(MyTrainable, upload_dir="s3://my-bucket/experiments/")

Misc

Closes #6226.
Partially addresses #6345.

Follow-up PRs

Prevent MEMORY checkpoints from breaking FT
Address known rsync bug causing checkpoint failures
Sync logs between workers.
Improve garbage collection. Trials moving around makes it possible to inadvertently not GC a checkpoint
Asynchronous checkpointing
Avoid syncing down all trial checkpoints to driver during restore (need selective syncing)
Refactor cluster tests
Finalize durable training API, documentation, log warnings

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

AmplabJenkins · 2019-12-06T06:35:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19252/
Test FAILed.

AmplabJenkins · 2019-12-06T07:09:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19254/
Test FAILed.

AmplabJenkins · 2019-12-14T02:10:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19587/
Test FAILed.

AmplabJenkins · 2019-12-15T00:53:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19611/
Test FAILed.

AmplabJenkins · 2019-12-15T10:21:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19629/
Test FAILed.

AmplabJenkins · 2019-12-15T12:17:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19630/
Test FAILed.

AmplabJenkins · 2019-12-15T20:29:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19636/
Test FAILed.

AmplabJenkins · 2019-12-15T21:52:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/19633/
Test FAILed.

python/ray/tune/durable_trainable.py

ericl · 2019-12-16T00:33:28Z

python/ray/tune/durable_trainable.py

+
+
+class DurableTrainable(Trainable):
+    """A fault-tolerant Trainable.


Why not make this a flag enabled feature of trainable?

As I user I am going to be confused about multiple trainables

Hm, maybe it's more reasonable to just do the TrainableV2 abstraction then...

Though, TrainableV2 will block this PR further. We can mark this as experimental and merge as a separate Trainable so our BAIR users can try it out.

Then expose it better once we implement the TrainableV2 abstraction #6417.

AmplabJenkins · 2020-01-02T06:20:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20276/
Test FAILed.

ujvl · 2020-01-02T07:06:14Z

jenkins test tune

ujvl · 2020-01-02T07:45:17Z

jenkins test tune

AmplabJenkins · 2020-01-02T08:33:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/337/
Tune tests passed.

AmplabJenkins · 2020-01-02T08:47:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20279/
Test FAILed.

AmplabJenkins · 2020-01-02T09:39:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/338/
Tune tests passed.

AmplabJenkins · 2020-01-02T11:17:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20280/
Test FAILed.

AmplabJenkins · 2020-01-02T11:25:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20285/
Test PASSed.

AmplabJenkins · 2020-01-02T11:27:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20286/
Test PASSed.

AmplabJenkins · 2020-01-02T14:29:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20290/
Test PASSed.

AmplabJenkins · 2020-01-02T14:40:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20292/
Test PASSed.

…ne-async-save-restore

AmplabJenkins · 2020-01-02T23:17:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20297/
Test PASSed.

ujvl · 2020-01-03T03:14:46Z

jenkins test tune

AmplabJenkins · 2020-01-03T04:40:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/339/
Tune tests passed.

AmplabJenkins · 2020-01-03T06:27:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20312/
Test FAILed.

AmplabJenkins · 2020-01-03T08:06:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20315/
Test FAILed.

ujvl changed the title ~~[WIP][tune] Asynchronous trial checkpoint/restore~~ [WIP][tune] Process trial checkpoints/restores asynchronously Dec 6, 2019

ujvl force-pushed the tune-async-save-restore branch from c482240 to f7dc06a Compare December 6, 2019 03:39

Initial commit for asynchronous save/restore

e9b641e

ujvl force-pushed the tune-async-save-restore branch from f7dc06a to 082b45e Compare December 13, 2019 22:18

Set stage for cloud checkpointable trainable.

102bc30

ujvl force-pushed the tune-async-save-restore branch from 082b45e to 102bc30 Compare December 15, 2019 00:18

ujvl added 3 commits December 15, 2019 00:49

Refactor log_sync and sync_client.

8a31dea

Add durable trainable impl.

d0a6dbb

Support delete in cmd based client

4641f0e

ujvl added 2 commits December 15, 2019 03:21

Fix some tests and such

a3d4bfc

Merge branch 'master' into tune-async-save-restore

95a3d13

ujvl changed the title ~~[WIP][tune] Process trial checkpoints/restores asynchronously~~ [wip][tune] S3/GCP-capable checkpointing and async restores Dec 15, 2019

ujvl force-pushed the tune-async-save-restore branch from 0cd6c7a to 29aeab0 Compare December 15, 2019 18:40

ujvl changed the title ~~[wip][tune] S3/GCP-capable checkpointing and async restores~~ [tune] Async restores and S3/GCP-capable trial FT Dec 15, 2019

ujvl force-pushed the tune-async-save-restore branch from 29aeab0 to 51a356b Compare December 15, 2019 19:56

ujvl requested a review from richardliaw December 15, 2019 19:57

ujvl marked this pull request as ready for review December 15, 2019 19:57

richardliaw reviewed Dec 15, 2019

View reviewed changes

python/ray/tune/durable_trainable.py Show resolved Hide resolved

ujvl mentioned this pull request Dec 16, 2019

[tune] Restructure trial directory #6300

Closed

3 tasks

ericl reviewed Dec 16, 2019

View reviewed changes

Cleanup, comments.

028876b

ujvl force-pushed the tune-async-save-restore branch from 13e5856 to 028876b Compare December 16, 2019 01:17

Fix basename bug for directories.

edc1cea

lint

03b2fe2

richardliaw added 2 commits January 2, 2020 01:54

fix_tests

507c885

nit_fix

a3ebf67

ujvl added 2 commits January 2, 2020 13:20

Add __init__ file.

e74e960

Merge branch 'tune-async-save-restore' of github.com:ujvl/ray into tu…

2153d1e

…ne-async-save-restore

ujvl added 3 commits January 2, 2020 19:19

Move to utils package

113494f

Merge with master

152a0ce

Fix merge conflicts

ee78765

richardliaw approved these changes Jan 3, 2020

View reviewed changes

richardliaw merged commit ca651af into ray-project:master Jan 3, 2020

ujvl deleted the tune-async-save-restore branch January 3, 2020 10:38

richardliaw mentioned this pull request Jan 18, 2020

[Tune] Allow files to be synced directly to cloud #5276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Async restores and S3/GCP-capable trial FT #6376

[tune] Async restores and S3/GCP-capable trial FT #6376

ujvl commented Dec 6, 2019 •

edited

AmplabJenkins commented Dec 6, 2019

AmplabJenkins commented Dec 6, 2019

AmplabJenkins commented Dec 14, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

ericl Dec 16, 2019 •

edited

richardliaw Dec 18, 2019

richardliaw Dec 18, 2019

AmplabJenkins commented Jan 2, 2020

ujvl commented Jan 2, 2020

ujvl commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

ujvl commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020



		class DurableTrainable(Trainable):
		"""A fault-tolerant Trainable.

[tune] Async restores and S3/GCP-capable trial FT #6376

[tune] Async restores and S3/GCP-capable trial FT #6376

Conversation

ujvl commented Dec 6, 2019 • edited

Overview

Async restores

Trial FT changes

Garbage Collection

Example

Misc

Checks

AmplabJenkins commented Dec 6, 2019

AmplabJenkins commented Dec 6, 2019

AmplabJenkins commented Dec 14, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

AmplabJenkins commented Dec 15, 2019

ericl Dec 16, 2019 • edited

Choose a reason for hiding this comment

richardliaw Dec 18, 2019

Choose a reason for hiding this comment

richardliaw Dec 18, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Jan 2, 2020

ujvl commented Jan 2, 2020

ujvl commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

AmplabJenkins commented Jan 2, 2020

ujvl commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020

AmplabJenkins commented Jan 3, 2020

ujvl commented Dec 6, 2019 •

edited

ericl Dec 16, 2019 •

edited