Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (14/n): Simplified ExperimentAnalysis implementation #38648

Merged
merged 47 commits into from
Aug 22, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 21, 2023

Why are these changes needed?

This PR simplifies the implementation of ExperimentAnalysis so that it works in the new codepath. While updating all tests/examples, this will be called NewExperimentAnalysis.

The main reasons why this couldn't be minimally updated:

  1. Previously, the internal (private) data-structures of ExperimentAnalysis indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
  2. There were APIs that didn't make so much sense in the new codepath. For example, get_best_logdir also returns a local path.
  3. [Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed #34461 added support for initializing the ExperimentAnalysis from a URI, but this was very tied to the old codepath (using the old remote_storage utils). The new implementation handles the remote URI case the same as the local case (just reading files from the storage_filesystem).

APIs removed:

  • get_best_logdir <- Workaround = analysis.get_best_trial(...).local_path
  • best_logdir <- Workaround = analysis.best_trial.local_path
  • get_trial_checkpoints_paths <- This was pretty much an internal method, but was public for some reason.

Related issue number

Closes #38567

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ne.run

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -52,6 +64,645 @@
DEFAULT_FILE_TYPE = "csv"


@PublicAPI(stability="beta")
class NewExperimentAnalysis:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this class is the same, except the init logic is simplified, we now read the json/csv files from the filesystem directly (instead of downloading), and modified the internal data structures a bit.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/exp_analysis

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/exp_analysis

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines +91 to +92
location (path on the head node)."""
return self._experiment_analysis.experiment_path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little different now. It's always returning a filesystem-qualified path. NOT a URI. This is in line with #38568

cc: @ericl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup thx! Will use this in the followup PR.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good. I only have minor questions. I'm happy to proceed with this assuming tests pass.

What is the plan to switch from NewExperimentAnalysis to ExperimentAnalysis? This may be a breaking change for libraries that interface with it. Are we going to support the old way longer?

Should we call them ExperimentAnalysis and LegacyExperimentAnalysis instead?

python/ray/tune/examples/pb2_ppo_example.py Show resolved Hide resolved
assert info["info"] == 4


def test_best_result(ray_start_2_cpus):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes quite a few tests - is the functionality covered somewhere else? Do we have to migrate this later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these have been consolidated in the new tests. I've also removed some that were just testing behavior that was passing through from ExperimentAnalysis.

python/ray/tune/analysis/experiment_analysis.py Outdated Show resolved Hide resolved
@justinvyu
Copy link
Contributor Author

Plan is to completely remove the old one once all CI has been converted.

For API compatibility, let's just add dummy methods for all the old "public" methods raising an error with the new workarounds. We can even re-implement the best_logdir methods if you think that's too much of a breaking change.

I'm just keeping the NewCheckpoint and NewExperimentAnalysis naming since that's what I've been doing already and will be easier for me to track once I want to rename them all -- the Legacy prefix probably would have been better..

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving now pending the changes discussed above as this PR is 95% where it should be.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 21, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/exp_analysis

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 22, 2023
@justinvyu
Copy link
Contributor Author

@krfricke Ok to merge.

@krfricke krfricke merged commit 2f17219 into ray-project:master Aug 22, 2023
90 of 95 checks passed
@justinvyu justinvyu deleted the air/persistence/exp_analysis branch August 22, 2023 17:47
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…` implementation (ray-project#38648)

This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`.

The main reasons why this couldn't be minimally updated:
1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path.
3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`).

APIs removed:
* `get_best_logdir`  <- Workaround = `analysis.get_best_trial(...).local_path`
* `best_logdir` <- Workaround = `analysis.best_trial.local_path`
* `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…` implementation (ray-project#38648)

This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`.

The main reasons why this couldn't be minimally updated:
1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path.
3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`).

APIs removed:
* `get_best_logdir`  <- Workaround = `analysis.get_best_trial(...).local_path`
* `best_logdir` <- Workaround = `analysis.best_trial.local_path`
* `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[train][GA] Update ExperimentAnalysis for new persistence mode
3 participants