Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed #34461

Merged
merged 45 commits into from
Apr 21, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Apr 17, 2023

Why are these changes needed?

For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.

While this adds functionality to ExperimentAnalysis, it also provides the functionality to:

  1. ResultGrid(ExperimentAnalysis("s3://...")), which is what we do in the tuner.fit()
  2. Tuner.restore("s3://...").get_results()

Point 2 was the error that flagged this issue in the first place.

This PR also cleans up some confusing trial metadata loading code in ExperimentAnalysis.

New behavior

  1. The local directory path case remains the same.
  2. A remote URI can be used to initialize the ExperimentAnalysis now.
    • Tuner.restore("s3://") will use that remote path to create the experiment analysis
    • Any files needed to initialize the ExperimentAnalysis will be downloaded to a temporary directory. These files are mostly lightweight (e.g., experiment_state.json, params.pkl, result.json, progress.csv).
    • No checkpoints are downloaded (all URI-backed). Users should use the Checkpoint API directly to work with checkpoints.

TODOs

  • Unit tests
    • This PR is getting too big, so I added a basic unit test for now, and will have a follow-up to be more comprehensive (test the same things for both local/remote ExperimentAnalysis constructor).
  • Some cleanup to do in ExperimentAnalysis constructor

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
jjyao and others added 6 commits April 17, 2023 17:32
There is a bug in the backpressure implementation with regard to actor pools, in that once a task is queued for an actor pool, it is no longer subject to backpressure. This is problematic when the output size of a task is much bigger than the input size. In this situation, the actor pool will keep executing tasks (converting small objects into larger objects), even when this would grossly exceed memory limits.

Put another way: it fixes the issue where the streaming executor queues tasks on an actor pool operator, but later on wants to "take it back" due to unexpectedly high memory usage. This avoids the issue by not queueing tasks that won't be immediately executed (so they won't need to be taken back).

Example:
1. Suppose there is an actor pool of size 10, each of which can take 1 active task each.
2. Each input task is size 1GB. The memory limit is 100GB, so we add 100 of these inputs in an actor pool operator.
3. When the tasks run, they expand into 100GB of output each. Now, the memory usage overall is 200GB (2x over our limit!).
4. However, since we already added those 100 inputs to the actor pool, there is no way of the streaming scheduler to pause execution of those 90 remaining queued inputs.
5. Now the 90 queued inputs execute and we end up using 1TB, or 10x our intended memory limit.

We need to check for the memory limit right before executing a task in the actor pool; one way of doing this is to eliminate the internal queue in the actor pool operator and instead always queue work outside the operator.

TODO:
- [x] Performance testing
- [x] Unit tests
- [x] Perf test final version
Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure.
Release logs perf benchmark for 2.4.0
Also updated tool to sort the regressions

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Co-authored-by: Clarence Ng <clarence@anyscale.com>
…e strongly typed (ray-project#34297)

We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add sanity check to prevent infinite recursion

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu marked this pull request as ready for review April 18, 2023 00:37
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!


# Create a temp directory to store downloaded checkpoint files if
# they are pulled from a remote `experiment_checkpoint_path`.
self._local_experiment_path = tempfile.TemporaryDirectory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this direcotry canonical? e.g. via uuid5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the TemporaryDirectory will already append a unique string after the prefix!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh what I mean is do we want to always have the same directory, so we don't download multiple times if we restore multiple times.

The main reason for this is that this will download the full experiment dir (including checkpoints) which may be expensive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more of an optimization, so also happy to go as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it. This will only download a few files from
the experiment (experiment state, results csv/json, params.pkl) -- no checkpoints or artifacts. So, I thought it was lightweight enough.

A few ideas I had for a follow-up:

  • Cleanup the temp dir on experiment analysis destructor.
  • Move result grid creation to get_results so that it doesn't always happen on restore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good - we can do these in a follow-up!

…/fix_s3_get_results

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 19, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@krfricke krfricke merged commit 333c300 into ray-project:master Apr 21, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
… files from the cloud if needed (ray-project#34461)

For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.

While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to:
1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()`
2. `Tuner.restore("s3://...").get_results()`

Point 2 was the error that flagged this issue in the first place.

This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
… files from the cloud if needed (ray-project#34461)

For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.

While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to:
1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()`
2. `Tuner.restore("s3://...").get_results()`

Point 2 was the error that flagged this issue in the first place.

This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Jack He <jackhe2345@gmail.com>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
… files from the cloud if needed (ray-project#34461)

For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.

While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to:
1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()`
2. `Tuner.restore("s3://...").get_results()`

Point 2 was the error that flagged this issue in the first place.

This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu deleted the tune/fix_s3_get_results branch June 1, 2023 07:32
krfricke pushed a commit that referenced this pull request Aug 22, 2023
…` implementation (#38648)

This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`.

The main reasons why this couldn't be minimally updated:
1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path.
3. #34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`).

APIs removed:
* `get_best_logdir`  <- Workaround = `analysis.get_best_trial(...).local_path`
* `best_logdir` <- Workaround = `analysis.best_trial.local_path`
* `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…` implementation (ray-project#38648)

This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`.

The main reasons why this couldn't be minimally updated:
1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path.
3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`).

APIs removed:
* `get_best_logdir`  <- Workaround = `analysis.get_best_trial(...).local_path`
* `best_logdir` <- Workaround = `analysis.best_trial.local_path`
* `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…` implementation (ray-project#38648)

This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`.

The main reasons why this couldn't be minimally updated:
1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path.
3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`).

APIs removed:
* `get_best_logdir`  <- Workaround = `analysis.get_best_trial(...).local_path`
* `best_logdir` <- Workaround = `analysis.best_trial.local_path`
* `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants