[Tune] Enable `tune.ExperimentAnalysis` to pull experiment checkpoint files from the cloud if needed #34461

justinvyu · 2023-04-17T07:18:41Z

Why are these changes needed?

For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.

While this adds functionality to ExperimentAnalysis, it also provides the functionality to:

ResultGrid(ExperimentAnalysis("s3://...")), which is what we do in the tuner.fit()
Tuner.restore("s3://...").get_results()

Point 2 was the error that flagged this issue in the first place.

This PR also cleans up some confusing trial metadata loading code in ExperimentAnalysis.

New behavior

The local directory path case remains the same.
A remote URI can be used to initialize the ExperimentAnalysis now.
- Tuner.restore("s3://") will use that remote path to create the experiment analysis
- Any files needed to initialize the ExperimentAnalysis will be downloaded to a temporary directory. These files are mostly lightweight (e.g., experiment_state.json, params.pkl, result.json, progress.csv).
- No checkpoints are downloaded (all URI-backed). Users should use the Checkpoint API directly to work with checkpoints.

TODOs

Unit tests
- This PR is getting too big, so I added a basic unit test for now, and will have a follow-up to be more comprehensive (test the same things for both local/remote ExperimentAnalysis constructor).
Some cleanup to do in ExperimentAnalysis constructor

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Removed some dead code for 3.6

There is a bug in the backpressure implementation with regard to actor pools, in that once a task is queued for an actor pool, it is no longer subject to backpressure. This is problematic when the output size of a task is much bigger than the input size. In this situation, the actor pool will keep executing tasks (converting small objects into larger objects), even when this would grossly exceed memory limits. Put another way: it fixes the issue where the streaming executor queues tasks on an actor pool operator, but later on wants to "take it back" due to unexpectedly high memory usage. This avoids the issue by not queueing tasks that won't be immediately executed (so they won't need to be taken back). Example: 1. Suppose there is an actor pool of size 10, each of which can take 1 active task each. 2. Each input task is size 1GB. The memory limit is 100GB, so we add 100 of these inputs in an actor pool operator. 3. When the tasks run, they expand into 100GB of output each. Now, the memory usage overall is 200GB (2x over our limit!). 4. However, since we already added those 100 inputs to the actor pool, there is no way of the streaming scheduler to pause execution of those 90 remaining queued inputs. 5. Now the 90 queued inputs execute and we end up using 1TB, or 10x our intended memory limit. We need to check for the memory limit right before executing a task in the actor pool; one way of doing this is to eliminate the internal queue in the actor pool operator and instead always queue work outside the operator. TODO: - [x] Performance testing - [x] Unit tests - [x] Perf test final version

Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure.

Release logs perf benchmark for 2.4.0 Also updated tool to sort the regressions Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Co-authored-by: Clarence Ng <clarence@anyscale.com>

…#34441)

…e strongly typed (ray-project#34297) We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add sanity check to prevent infinite recursion Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

krfricke

Looks good to me!

krfricke · 2023-04-18T12:06:11Z

python/ray/tune/analysis/experiment_analysis.py

+
+            # Create a temp directory to store downloaded checkpoint files if
+            # they are pulled from a remote `experiment_checkpoint_path`.
+            self._local_experiment_path = tempfile.TemporaryDirectory(


Should we make this direcotry canonical? e.g. via uuid5

I believe the TemporaryDirectory will already append a unique string after the prefix!

Oh what I mean is do we want to always have the same directory, so we don't download multiple times if we restore multiple times.

The main reason for this is that this will download the full experiment dir (including checkpoints) which may be expensive

I guess this is more of an optimization, so also happy to go as is

Oh, got it. This will only download a few files from
the experiment (experiment state, results csv/json, params.pkl) -- no checkpoints or artifacts. So, I thought it was lightweight enough.

A few ideas I had for a follow-up:

Cleanup the temp dir on experiment analysis destructor.

Move result grid creation to get_results so that it doesn't always happen on restore.

Yeah sounds good - we can do these in a follow-up!

…/fix_s3_get_results Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…/fix_s3_get_results

krfricke

Awesome, thanks!

… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: elliottower <elliot@elliottower.com>

… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Jack He <jackhe2345@gmail.com>

… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…` implementation (#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. #34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 19 commits April 14, 2023 16:29

Fix is_local_path util

59ac33a

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add some helper propreties for URI utils

4597613

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

support uri in find latest experient checkpoint helper

e89cf53

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Allow trial runner experiment state path to return cloud path

f5c8ccb

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Trial runner exp path always local, but recommend cloud restore path

4ea5a0d

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Refactor exp analysis logic to handle remote path

d7e561c

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Download results+params if needed

e40bf58

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix error message

c2c5bb4

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix download_from_uri to create dir for the single file case

098b8d6

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fixes (WIP)

71acd2e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Pass in storage path rather than local dir in the restore case

a98a590

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Move local to remote fn to to_air_checkpoint

b3931fb

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

a187af4

…/fix_s3_get_results

Convert all paths of best ckpts

80365b5

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix storage path -> actual experiment dir

2c207a2

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix case where remote_storage_path is not passed to ExpAnalysis

2f7e655

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix URI case for find_newest_ckpt helper + better docstrings

1c2b65d

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

cbe29fe

…/fix_s3_get_results

Fix condition

65011fa

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu assigned Yard1 and krfricke Apr 17, 2023

justinvyu requested review from krfricke, Yard1 and gjoliver April 17, 2023 18:25

jjyao and others added 6 commits April 17, 2023 17:32

Remove python 3.6 support [2/n] (ray-project#34416)

85a5e42

Removed some dead code for 3.6

Deflake gcs_client_test.cc (ray-project#34411)

5f2edc6

Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure.

release logs for 2.4.0 (ray-project#33905)

335dcd8

Release logs perf benchmark for 2.4.0 Also updated tool to sort the regressions Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Co-authored-by: Clarence Ng <clarence@anyscale.com>

[data] [streaming] Improve handling of KeyboardInterrupt (ray-project…

fca8dac

…#34441)

[no_early_kickoff][core][state] Make state api return results that ar…

9f5fecf

…e strongly typed (ray-project#34297) We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs.

justinvyu added 5 commits April 17, 2023 17:32

Move mock s3 fixture to tune conftest

30849ed

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Critical fix orz

8f49094

Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add sanity check to prevent infinite recursion Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Another critical fix to construct URI w/ params properly

b35136d

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add basic unit test + add callbacks to exp checkpoint creator harness

146798d

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

7d12d9b

…/fix_s3_get_results

justinvyu force-pushed the tune/fix_s3_get_results branch from 9c837ba to 7d12d9b Compare April 18, 2023 00:32

justinvyu marked this pull request as ready for review April 18, 2023 00:37

krfricke reviewed Apr 18, 2023

View reviewed changes

justinvyu added 11 commits April 18, 2023 09:37

Merge branch 'master' of https://github.com/ray-project/ray into tune…

cc00fa7

…/fix_s3_get_results Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix to_air_checkpoint docstring

bd5bf82

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add a join path or uri util + fix the nested dir case

99bdf6b

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

d1c838e

…/fix_s3_get_results

Fix lint

50bda10

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix BULID file to include conftest

b76ea11

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix the final failing test! (due to extra file:// prepend)

7a23537

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

cb4e2f7

…/fix_s3_get_results

Fix the last last failing test.. (Path object passed into EA)

8f8a158

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Make todo more descriptive

fbe2a66

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

11666e0

…/fix_s3_get_results

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 19, 2023

krfricke approved these changes Apr 20, 2023

View reviewed changes

krfricke merged commit 333c300 into ray-project:master Apr 21, 2023

justinvyu deleted the tune/fix_s3_get_results branch June 1, 2023 07:32

justinvyu mentioned this pull request Aug 21, 2023

[air] pyarrow.fs persistence (14/n): Simplified ExperimentAnalysis implementation #38648

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Enable `tune.ExperimentAnalysis` to pull experiment checkpoint files from the cloud if needed #34461

[Tune] Enable `tune.ExperimentAnalysis` to pull experiment checkpoint files from the cloud if needed #34461

justinvyu commented Apr 17, 2023 •

edited

Loading

krfricke left a comment

krfricke Apr 18, 2023

justinvyu Apr 19, 2023

krfricke Apr 20, 2023

krfricke Apr 20, 2023

justinvyu Apr 20, 2023

krfricke Apr 21, 2023

krfricke left a comment

[Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed #34461

[Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed #34461

Conversation

justinvyu commented Apr 17, 2023 • edited Loading

Why are these changes needed?

New behavior

TODOs

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke Apr 18, 2023

Choose a reason for hiding this comment

justinvyu Apr 19, 2023

Choose a reason for hiding this comment

krfricke Apr 20, 2023

Choose a reason for hiding this comment

krfricke Apr 20, 2023

Choose a reason for hiding this comment

justinvyu Apr 20, 2023

Choose a reason for hiding this comment

krfricke Apr 21, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

[Tune] Enable `tune.ExperimentAnalysis` to pull experiment checkpoint files from the cloud if needed #34461

[Tune] Enable `tune.ExperimentAnalysis` to pull experiment checkpoint files from the cloud if needed #34461

justinvyu commented Apr 17, 2023 •

edited

Loading