Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Fix behavior of multi-node checkpointing without an external storage_path to hard-fail #37543

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jul 18, 2023

Why are these changes needed?

This PR builds on #37142, to fix the behavior of running a multi-node experiment without NFS/cloud storage. The correct behavior should be to fail, but previously, Tune caught the error and just logged it.

Testing

Manual testing

Verified outputs/errors are as expected for:

  • XGBoostTrainer
  • DataParallelTrainer
  • Tuner w/ function trainable

Automated testing of all combinations

This PR was tested with this script on a cluster

import ray
from ray import tune, air, train
from ray.train.data_parallel_trainer import DataParallelTrainer

import os
import pytest


def train_fn(config):
    if config["should_checkpoint"]:
        train.report({"a": 1}, checkpoint=train.Checkpoint.from_dict({"a": 1}))
    else:
        train.report({"a": 1})


@pytest.fixture(scope="module")
def ray_init():
    ray.init()
    yield
    ray.shutdown()

@pytest.mark.parametrize("storage_path", [None, "/mnt/cluster_storage", f"gs://{os.environ['ANYSCALE_CLOUD_STORAGE_BUCKET']}/justinvyu-testing"])
@pytest.mark.parametrize("syncer", ["auto", None])
@pytest.mark.parametrize("env_var_flag", ["0", "1"])
@pytest.mark.parametrize("should_checkpoint", [True, False])
@pytest.mark.parametrize("runner_type", ["trainer", "tuner"])
def test_all_storage_configurations(monkeypatch, storage_path, syncer, env_var_flag, should_checkpoint, runner_type):
    if storage_path and storage_path.startswith("gs://") and syncer is None:
        return

    monkeypatch.setenv("RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE", env_var_flag)

    run_config = train.RunConfig(
        storage_path=storage_path,
        sync_config=tune.SyncConfig(syncer=syncer),
    )

    # Make sure that the trainer/tune trials are scheduled on worker nodes.
    if runner_type == "trainer":
        runner = DataParallelTrainer(train_fn, train_loop_config={"should_checkpoint": should_checkpoint}, scaling_config=air.ScalingConfig(num_workers=2, trainer_resources={"CPU": 20}, resources_per_worker={"CPU": 12}), run_config=run_config)
    else:
        runner = tune.Tuner(tune.with_resources(train_fn, {"CPU": 32}), param_space={"should_checkpoint": should_checkpoint}, run_config=run_config, tune_config=tune.TuneConfig(num_samples=2))

    # Hard fail if:
    # - Checkpointing is enabled
    # - Syncing is not explicitly disabled
    # - And, the rollback environment variable is not set.
    # - No storage path is set
    should_raise = should_checkpoint and syncer is not None and env_var_flag != "1" and storage_path is None

    if should_raise:
        with pytest.raises(Exception):
            runner.fit()
    else:
        runner.fit()

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -93,7 +98,8 @@
"`RunConfig(storage_path='/mnt/path/to/nfs_storage')`\n"
"See this Github issue for more details on transitioning to cloud storage/NFS "
"as well as an explanation on why this functionality is "
"being removed: https://github.com/ray-project/ray/issues/37177\n\n"
"being removed: https://github.com/ray-project/ray/issues/37177\n"
"If you are already using NFS, you can ignore this warning message.\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen? If they are already using storage_path, they shouldn't see this message, right?

Copy link
Contributor Author

@justinvyu justinvyu Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do log a warning message when running with multiple nodes without cloud storage, and it'll show up even if NFS is used.

This is because I don't have a good way of checking if storage_path is a local directory or NFS.

Is that ok? This warning message will only show up once, and it basically just serves as a deprecation reminder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me :)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@matthewdeng matthewdeng merged commit 78649e0 into ray-project:master Jul 19, 2023
63 of 65 checks passed
@justinvyu justinvyu deleted the air/disable_head_node_sync_fix branch July 19, 2023 03:35
@pcmoritz
Copy link
Contributor

@justinvyu This is great! Can you also create a cherry-pick PR into the 2.6.0 release branch? :)

justinvyu added a commit to justinvyu/ray that referenced this pull request Jul 19, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
can-anyscale pushed a commit that referenced this pull request Jul 19, 2023
…torage_path` to hard-fail (#37543) (#37567)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 24, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 28, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…torage_path` to hard-fail (ray-project#37543)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants