-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Enable tune.ExperimentAnalysis
to pull experiment checkpoint files from the cloud if needed
#34461
[Tune] Enable tune.ExperimentAnalysis
to pull experiment checkpoint files from the cloud if needed
#34461
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Removed some dead code for 3.6
There is a bug in the backpressure implementation with regard to actor pools, in that once a task is queued for an actor pool, it is no longer subject to backpressure. This is problematic when the output size of a task is much bigger than the input size. In this situation, the actor pool will keep executing tasks (converting small objects into larger objects), even when this would grossly exceed memory limits. Put another way: it fixes the issue where the streaming executor queues tasks on an actor pool operator, but later on wants to "take it back" due to unexpectedly high memory usage. This avoids the issue by not queueing tasks that won't be immediately executed (so they won't need to be taken back). Example: 1. Suppose there is an actor pool of size 10, each of which can take 1 active task each. 2. Each input task is size 1GB. The memory limit is 100GB, so we add 100 of these inputs in an actor pool operator. 3. When the tasks run, they expand into 100GB of output each. Now, the memory usage overall is 200GB (2x over our limit!). 4. However, since we already added those 100 inputs to the actor pool, there is no way of the streaming scheduler to pause execution of those 90 remaining queued inputs. 5. Now the 90 queued inputs execute and we end up using 1TB, or 10x our intended memory limit. We need to check for the memory limit right before executing a task in the actor pool; one way of doing this is to eliminate the internal queue in the actor pool operator and instead always queue work outside the operator. TODO: - [x] Performance testing - [x] Unit tests - [x] Perf test final version
Hypothesis is that on_subscribe callback is invoked after test finishes; the reference to the stack-allocated atomic counter is no longer valid, causing asan failure.
Release logs perf benchmark for 2.4.0 Also updated tool to sort the regressions Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Co-authored-by: Clarence Ng <clarence@anyscale.com>
…e strongly typed (ray-project#34297) We are now returning strongly typed dataclasses (with type checking enabled by pydantic) from list and get APIs.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Add sanity check to prevent infinite recursion Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
9c837ba
to
7d12d9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
|
||
# Create a temp directory to store downloaded checkpoint files if | ||
# they are pulled from a remote `experiment_checkpoint_path`. | ||
self._local_experiment_path = tempfile.TemporaryDirectory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this direcotry canonical? e.g. via uuid5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the TemporaryDirectory
will already append a unique string after the prefix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh what I mean is do we want to always have the same directory, so we don't download multiple times if we restore multiple times.
The main reason for this is that this will download the full experiment dir (including checkpoints) which may be expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is more of an optimization, so also happy to go as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it. This will only download a few files from
the experiment (experiment state, results csv/json, params.pkl) -- no checkpoints or artifacts. So, I thought it was lightweight enough.
A few ideas I had for a follow-up:
- Cleanup the temp dir on experiment analysis destructor.
- Move result grid creation to get_results so that it doesn't always happen on restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good - we can do these in a follow-up!
…/fix_s3_get_results Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/fix_s3_get_results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: elliottower <elliot@elliottower.com>
… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Jack He <jackhe2345@gmail.com>
… files from the cloud if needed (ray-project#34461) For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys. While this adds functionality to `ExperimentAnalysis`, it also provides the functionality to: 1. `ResultGrid(ExperimentAnalysis("s3://..."))`, which is what we do in the `tuner.fit()` 2. `Tuner.restore("s3://...").get_results()` Point 2 was the error that flagged this issue in the first place. This PR also cleans up some confusing trial metadata loading code in `ExperimentAnalysis`. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…` implementation (#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. #34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
For post-experiment analysis of a Tune run that uploaded results and checkpoints to S3, the node where analysis is being done may not contain the experiment directory. In this case, the experiment checkpoint + other files (json + csv result files and the param space) should be pulled to a temp directory in the local filesys.
While this adds functionality to
ExperimentAnalysis
, it also provides the functionality to:ResultGrid(ExperimentAnalysis("s3://..."))
, which is what we do in thetuner.fit()
Tuner.restore("s3://...").get_results()
Point 2 was the error that flagged this issue in the first place.
This PR also cleans up some confusing trial metadata loading code in
ExperimentAnalysis
.New behavior
ExperimentAnalysis
now.ExperimentAnalysis
will be downloaded to a temporary directory. These files are mostly lightweight (e.g.,experiment_state.json
,params.pkl
,result.json
,progress.csv
).Checkpoint
API directly to work with checkpoints.TODOs
ExperimentAnalysis
constructor).Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.