Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration #38895

Merged
merged 23 commits into from
Aug 28, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 25, 2023

Why are these changes needed?

Migrates failing tests of this suite: https://buildkite.com/ray-project/oss-ci-build-pr/builds/33691#018a2b92-326f-4336-9c75-4f358721c816

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 8 commits August 25, 2023 13:59
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Comment on lines 466 to 472
if not self.storage_filesystem or isinstance(
self.storage_filesystem, pyarrow.fs.LocalFileSystem
):
self.storage_local_path = self.storage_path
else:
self.storage_local_path = default_results_dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu lmk what you think. Previously the storage_local_path was always the ray_results directory. But I think we should actually set it to the local path if a local path was set. This also matches previous behavior.

Copy link
Contributor

@justinvyu justinvyu Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this. @ericl did not want ANY parsing of the storage path (i.e detecting if it's remote or not).

But, because this is AFTER pyrarrow.fs.from_uri, and we're just doing a typecheck of the pyarrow fs, I think this is ok.

The difference is that previously, we did some checks on the raw uri that was being passed by the user.

This will also remove the need for us to either re-introduce some kind of local_dir param or make the environment variable more visible.
storage_path is the 1 way to set the experiment path.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One fix, and some other comments

Comment on lines 466 to 472
if not self.storage_filesystem or isinstance(
self.storage_filesystem, pyarrow.fs.LocalFileSystem
):
self.storage_local_path = self.storage_path
else:
self.storage_local_path = default_results_dir

Copy link
Contributor

@justinvyu justinvyu Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this. @ericl did not want ANY parsing of the storage path (i.e detecting if it's remote or not).

But, because this is AFTER pyrarrow.fs.from_uri, and we're just doing a typecheck of the pyarrow fs, I think this is ok.

The difference is that previously, we did some checks on the raw uri that was being passed by the user.

This will also remove the need for us to either re-introduce some kind of local_dir param or make the environment variable more visible.
storage_path is the 1 way to set the experiment path.

Comment on lines 211 to 218
# Backwards compatibility
from ray import train

report_dict = kwargs
if _metric is not None:
report_dict["_metric"] = _metric

train.report(report_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just force users to switch to train.report? otherwise we'll have to force users to do another migration post 2.7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, at minimum add a warning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a deprecation warning for now. Let's check if we can hard deprecate in a follow-up

Comment on lines 466 to 471
if not self.storage_filesystem or isinstance(
self.storage_filesystem, pyarrow.fs.LocalFileSystem
):
self.storage_local_path = self.storage_path
else:
self.storage_local_path = default_results_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.storage_filesystem or isinstance(
self.storage_filesystem, pyarrow.fs.LocalFileSystem
):
self.storage_local_path = self.storage_path
else:
self.storage_local_path = default_results_dir
if isinstance(
self.storage_filesystem, pyarrow.fs.LocalFileSystem
):
self.storage_local_path = self.storage_path
else:
self.storage_local_path = default_results_dir

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise this will set local path to a s3 uri in the case of

storage_path="s3://...", storage_filesystem=None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, wouldn't pyarrow resolve the filesystem to s3 filesystem anyway? this is checking self.storage_filesystem after parsing, not storage_filesystem

@justinvyu justinvyu requested a review from ericl August 25, 2023 17:56
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[flagging as a major API change]

It's super late in the release process, and we cannot make breaking API changes like this that go against choices previously made in the REP.

Let's revert the behavior changes from this PR and we can discuss them separately.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 25, 2023
Kai Fricke added 5 commits August 25, 2023 20:14
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/BUILD
#	python/ray/tune/tests/test_api.py
#	python/ray/tune/tests/test_cluster.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 26, 2023
@krfricke
Copy link
Contributor Author

Ok, reverted the changes for now.

It also looks like test_cluster was fixed in another PR, so this leaves only two test suites in this PR.

Kai Fricke added 5 commits August 26, 2023 20:55
Signed-off-by: Kai Fricke <kai@anyscale.com>
fix
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke changed the title [tune] Storage: Tune tests + examples (incl Rllib) migration [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration Aug 27, 2023
@krfricke krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. Ray 2.7 labels Aug 27, 2023
@@ -59,7 +59,7 @@ py_test(
size = "large",
srcs = ["tests/test_api.py"],
deps = [":tune_lib"],
tags = ["team:ml", "exclusive", "rllib"],
tags = ["team:ml", "exclusive", "rllib", "new_storage"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/for future reference - we don't need this if we're enabling the FF (or is this test run in multiple suites?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are disabled in the old path but enabled in the new path. When the FF is enabled for everything (and not overwritten) we don't need it anymore.

Let's clean those up in one go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, just to confirm my understanding - these tests are run in other suites in addition to ":octopus: :brain: Tune tests and examples {using RLlib}"?


register_trainable("f1", train_fn)
with unittest.mock.patch.dict(os.environ, {"TEST_TMPDIR": logdir}):
tune.run("f1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test this supposed to test run_experiments specifically? Or does it need fail_fast="raise" in order for the test failure to be propagated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just testing the logdir location

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh woops, for some reason I was under the impression that Tune would swallow the error and tune.run would finish "successfully" with an error file.

python/ray/tune/tests/test_api.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_api.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_tune_restore.py Outdated Show resolved Hide resolved
@@ -64,7 +67,7 @@ def testTuneRestore(self):
name="TuneRestoreTest",
stop={"training_iteration": 2}, # train one more iteration.
checkpoint_config=CheckpointConfig(checkpoint_frequency=1),
restore=self.checkpoint_path, # Restore the checkpoint
restore=self.checkpoint_parent, # Restore the checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu to confirm that this is the intended behavior - restore takes in the directory instead of the file? Seems logical, but are there backwards compatibility issues with this?

@matthewdeng matthewdeng added v2.7.0-pick and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 27, 2023
@matthewdeng
Copy link
Contributor

matthewdeng commented Aug 28, 2023

@krfricke I'm going to merge in master to re-trigger the tests. (I took off the tests-ok label because I saw a different test that failed, but can add it back after this run)

@matthewdeng matthewdeng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 28, 2023
@zhe-thoughts zhe-thoughts removed their assignment Aug 28, 2023
Kai Fricke added 2 commits August 28, 2023 17:26
@krfricke krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 28, 2023
@matthewdeng matthewdeng merged commit ed7186a into ray-project:master Aug 28, 2023
58 of 60 checks passed
@krfricke krfricke deleted the train/test-api branch August 29, 2023 07:59
matthewdeng added a commit to matthewdeng/ray that referenced this pull request Aug 30, 2023
…ay-project#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
GeneDer pushed a commit that referenced this pull request Aug 30, 2023
* [train] enable new persistence mode for core and serve tests (#38938)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train] remove non-URI assertion (#38944)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Co-authored-by: Matthew Deng <matt@anyscale.com>

* disable legacy sync config logic in trainable (#38952)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [2.7 CI][New Persistent Mode][6/n] 📖 ✈️ Ray AIR examples (#38918)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905)


Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910)



Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>

* [2.7 CI][New Persistent Mode][1/n] 📺 ✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests  (#38903)


Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

* [train] Fix broken tune tests and support ray storage (#38950)

This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959)


Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [train][rllib] temporarily disable new persistence mode for rllib tests (#38965)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [2.7 CI][New Persistent Mode][8/n] ✈️ AIR tests (ray/air) (#38932)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>

* [train] Fix MosaicTrainer example and unit test (#38970)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [air/release] Fix dreambooth example image preprocessing logic (#39020)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train] clean up ray.train._checkpoint imports (#38951)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [train] high level cleanup of Ray Train docs (#38971)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [wip][docs] update FrameworkPredictor examples (#38634)


Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: matthewdeng <matt@anyscale.com>

* [train] Add documentation for using metadata argument to save preprocessors (#38701)

* [Train] Restructure Ray Train Example Page (#38814)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794)

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [tune/storage] Fix Tune multinode tests (#39050)

Fixes multinode tests by using the new train.report() API.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune] Fix BOHB example for new storage (#38983)

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore.

Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked".

I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB.

Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Release Test] Fix `long_running_horovod_tune_test`. (#39012)

Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

* [train] New persistence mode: `StorageContext` unit tests (#39023)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [train] enable train + tune tests and examples (#39021)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [rllib] Fix storage-path related tests (#38947)

This PR fixes rllib-related tests that didn't pass changes related to the new storage context.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: matthewdeng <matt@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>

* [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: matthewdeng <matt@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ay-project#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023
…ay-project#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…ay-project#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ay-project#38895)

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.7 tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants