[air] `pyarrow.fs` persistence (14/n): Simplified `ExperimentAnalysis` implementation #38648

justinvyu · 2023-08-21T01:19:23Z

Why are these changes needed?

This PR simplifies the implementation of ExperimentAnalysis so that it works in the new codepath. While updating all tests/examples, this will be called NewExperimentAnalysis.

The main reasons why this couldn't be minimally updated:

Previously, the internal (private) data-structures of ExperimentAnalysis indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places.
There were APIs that didn't make so much sense in the new codepath. For example, get_best_logdir also returns a local path.
[Tune] Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed #34461 added support for initializing the ExperimentAnalysis from a URI, but this was very tied to the old codepath (using the old remote_storage utils). The new implementation handles the remote URI case the same as the local case (just reading files from the storage_filesystem).

APIs removed:

get_best_logdir <- Workaround = analysis.get_best_trial(...).local_path
best_logdir <- Workaround = analysis.best_trial.local_path
get_trial_checkpoints_paths <- This was pretty much an internal method, but was public for some reason.

Related issue number

Closes #38567

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ne.run Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-21T01:27:28Z

python/ray/tune/analysis/experiment_analysis.py

@@ -52,6 +64,645 @@
 DEFAULT_FILE_TYPE = "csv"


+@PublicAPI(stability="beta")
+class NewExperimentAnalysis:


Most of this class is the same, except the init logic is simplified, we now read the json/csv files from the filesystem directly (instead of downloading), and modified the internal data structures a bit.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/exp_analysis

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-21T18:56:01Z

python/ray/tune/result_grid.py

+        location (path on the head node)."""
+        return self._experiment_analysis.experiment_path


This is a little different now. It's always returning a filesystem-qualified path. NOT a URI. This is in line with #38568

cc: @ericl

Yup thx! Will use this in the followup PR.

krfricke

This looks very good. I only have minor questions. I'm happy to proceed with this assuming tests pass.

What is the plan to switch from NewExperimentAnalysis to ExperimentAnalysis? This may be a breaking change for libraries that interface with it. Are we going to support the old way longer?

Should we call them ExperimentAnalysis and LegacyExperimentAnalysis instead?

python/ray/tune/examples/pb2_ppo_example.py

krfricke · 2023-08-21T19:32:35Z

python/ray/tune/tests/test_result_grid.py

-            assert info["info"] == 4
-
-
-def test_best_result(ray_start_2_cpus):


This PR removes quite a few tests - is the functionality covered somewhere else? Do we have to migrate this later?

Most of these have been consolidated in the new tests. I've also removed some that were just testing behavior that was passing through from ExperimentAnalysis.

python/ray/tune/analysis/experiment_analysis.py

justinvyu · 2023-08-21T20:12:03Z

Plan is to completely remove the old one once all CI has been converted.

For API compatibility, let's just add dummy methods for all the old "public" methods raising an error with the new workarounds. We can even re-implement the best_logdir methods if you think that's too much of a breaking change.

I'm just keeping the NewCheckpoint and NewExperimentAnalysis naming since that's what I've been doing already and will be easier for me to track once I want to rename them all -- the Legacy prefix probably would have been better..

krfricke

I'm approving now pending the changes discussed above as this PR is 95% where it should be.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-22T07:09:19Z

@krfricke Ok to merge.

…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…` implementation (ray-project#38648) This PR simplifies the implementation of `ExperimentAnalysis` so that it works in the new codepath. While updating all tests/examples, this will be called `NewExperimentAnalysis`. The main reasons why this couldn't be minimally updated: 1. Previously, the internal (private) data-structures of `ExperimentAnalysis` indexed trial configs, results, and checkpoints with the local path. Tying checkpoints to their local path doesn't really make sense to do anymore, so it was hard to just modify the code in a few places. 2. There were APIs that didn't make so much sense in the new codepath. For example, `get_best_logdir` also returns a local path. 3. ray-project#34461 added support for initializing the `ExperimentAnalysis` from a URI, but this was very tied to the old codepath (using the old `remote_storage` utils). The new implementation handles the remote URI case the same as the local case (just reading files from the `storage_filesystem`). APIs removed: * `get_best_logdir` <- Workaround = `analysis.get_best_trial(...).local_path` * `best_logdir` <- Workaround = `analysis.best_trial.local_path` * `get_trial_checkpoints_paths` <- This was pretty much an internal method, but was public for some reason. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 30 commits August 18, 2023 14:31

Add simplified EA skeleton

ca0fab7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fetch trial dfs working

45fdefc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

get_all_configs works

3ed6876

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small cleanup

339536f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small cleanup 2

49c7fab

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unused apis

faa23ec

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

simplify get_trial_checkpoint_paths

576e6aa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

simplify get_best_checkpoint

a660682

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove trial as string input

f3e1d86

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unused stats/runner data APIs

8d8f404

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Use new EA class in tune.run

549555e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update usage in tuner + result grid

683b75a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove logdir methods + implement experiment_path attr

8c2a628

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update BUILD/Ci file to move tests to new persistence suite

882ba4a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix get_best_checkpoint new return type

87c5166

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

some progress on the unit test

64d47bc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fixes + new unit tests for EA

08ce692

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test get_last_checkpoint

fdde0b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

split

b48c492

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

port pickle test

01f696d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test dataframe

74058a0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Delete old tests

5e21de2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Test from dir

bef86a9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test from cloud path

d9689a6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove ea mem test

38005da

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Use storage path version of exp path to initialize exp analysis in tu…

68865e9

…ne.run Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update checkpoint repr

67eaeb2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

don't hard fail on result read failure in exp analysis

555b726

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update test_result_grid

c7430b1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix excessive logging

9114aee

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Aug 21, 2023

View reviewed changes

matthewdeng mentioned this pull request Aug 18, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

justinvyu added 6 commits August 20, 2023 23:57

exclude converted new persistence tests

03f7500

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix based on new result_grid.experiment_path (fs path) output

28beb12

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix metrics_df attr in result

8643f86

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

0215999

…persistence/exp_analysis

Remove other best_logdir usage

f4232e8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove remaining trial_dataframes path key usage

9dcc1a4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from krfricke August 21, 2023 07:26

justinvyu assigned krfricke Aug 21, 2023

justinvyu added 5 commits August 21, 2023 09:35

exclude new persistence test in old test suite

045e56c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

ce01a10

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix exp analysis for custom fs

bd6086d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

dep -> tag typo

f2e8f47

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

6896af7

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Aug 21, 2023

View reviewed changes

krfricke reviewed Aug 21, 2023

View reviewed changes

krfricke approved these changes Aug 21, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 21, 2023

justinvyu added 3 commits August 21, 2023 21:32

make fetch trial dfs private

078293c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add deprecated methods

b24bd2d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

e0fec71

…persistence/exp_analysis Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 22, 2023

krfricke merged commit 2f17219 into ray-project:master Aug 22, 2023
90 of 95 checks passed

justinvyu deleted the air/persistence/exp_analysis branch August 22, 2023 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (14/n): Simplified `ExperimentAnalysis` implementation #38648

[air] `pyarrow.fs` persistence (14/n): Simplified `ExperimentAnalysis` implementation #38648

justinvyu commented Aug 21, 2023 •

edited

Loading

justinvyu Aug 21, 2023

justinvyu Aug 21, 2023

ericl Aug 21, 2023

krfricke left a comment

krfricke Aug 21, 2023

justinvyu Aug 21, 2023

justinvyu commented Aug 21, 2023

krfricke left a comment

justinvyu commented Aug 22, 2023

		location (path on the head node)."""
		return self._experiment_analysis.experiment_path

		assert info["info"] == 4


		def test_best_result(ray_start_2_cpus):

[air] pyarrow.fs persistence (14/n): Simplified ExperimentAnalysis implementation #38648

[air] pyarrow.fs persistence (14/n): Simplified ExperimentAnalysis implementation #38648

Conversation

justinvyu commented Aug 21, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

justinvyu Aug 21, 2023

Choose a reason for hiding this comment

justinvyu Aug 21, 2023

Choose a reason for hiding this comment

ericl Aug 21, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke Aug 21, 2023

Choose a reason for hiding this comment

justinvyu Aug 21, 2023

Choose a reason for hiding this comment

justinvyu commented Aug 21, 2023

krfricke left a comment

Choose a reason for hiding this comment

justinvyu commented Aug 22, 2023

[air] `pyarrow.fs` persistence (14/n): Simplified `ExperimentAnalysis` implementation #38648

[air] `pyarrow.fs` persistence (14/n): Simplified `ExperimentAnalysis` implementation #38648

justinvyu commented Aug 21, 2023 •

edited

Loading