Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] New train.Checkpoint API: Update release tests (batch 2) #38550

Merged
merged 19 commits into from
Aug 17, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 17, 2023

Why are these changes needed?

This PR enables the new persistence mode feature flag for the following release tests:

workspace_template_many_model_training
air_benchmark_data_bulk_ingest
air_benchmark_tensorflow_mnist_cpu_1x4
air_benchmark_tensorflow_mnist_cpu_4x1
air_benchmark_tensorflow_mnist_cpu_4x4
air_benchmark_tensorflow_mnist_gpu_4x4
air_example_dreambooth_finetuning
alpa_opt_2_7b_sanity_check
alpa_opt_30b_inference
cluster_tune_scale_up_down
long_running_many_ppo
ml_user_horovod_user_test_latest
ml_user_horovod_user_test_master
ml_user_train_tensorflow_mnist_test
ml_user_train_torch_linear_test
ml_user_tune_rllib_connect_test
train_horovod_multi_node_test

See them passing here (I removed the dolly one from the PR for now): https://buildkite.com/ray-project/release-tests-pr/builds/49414#_

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint 2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

missing

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…vanilla tf exactly

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for @can-anyscale to review)

@justinvyu justinvyu marked this pull request as draft August 17, 2023 07:11
@justinvyu
Copy link
Contributor Author

@aslonnie The CI changes here are temporary just to launch a subset of release tests. Going to revert these before merging, so no need to review here.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu marked this pull request as ready for review August 17, 2023 15:35
@justinvyu justinvyu requested a review from ericl August 17, 2023 15:36
Comment on lines +556 to +562

# Raise an error if the storage path is not accessible when
# attempting to upload a checkpoint from a remote worker.
# Ex: If storage_path is a local path, then a validation marker
# will only exist on the head node but not the worker nodes.
self._check_validation_file()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is a change that's needed so that multi-node w/ no storage path doesn't fail when NO checkpoints are reported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl This is the same raise error behavior that we added for head node syncing in 2.6.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@krfricke krfricke merged commit f693068 into ray-project:master Aug 17, 2023
54 of 62 checks passed
@justinvyu justinvyu deleted the air/update_release_tests_batch2 branch August 17, 2023 17:02
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…-project#38550)

This PR enables the new persistence mode feature flag for the following release tests:

workspace_template_many_model_training
air_benchmark_data_bulk_ingest
air_benchmark_tensorflow_mnist_cpu_1x4
air_benchmark_tensorflow_mnist_cpu_4x1
air_benchmark_tensorflow_mnist_cpu_4x4
air_benchmark_tensorflow_mnist_gpu_4x4
air_example_dreambooth_finetuning
alpa_opt_2_7b_sanity_check
alpa_opt_30b_inference
cluster_tune_scale_up_down
long_running_many_ppo
ml_user_horovod_user_test_latest
ml_user_horovod_user_test_master
ml_user_train_tensorflow_mnist_test
ml_user_train_torch_linear_test
ml_user_tune_rllib_connect_test
train_horovod_multi_node_test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…-project#38550)

This PR enables the new persistence mode feature flag for the following release tests:

workspace_template_many_model_training
air_benchmark_data_bulk_ingest
air_benchmark_tensorflow_mnist_cpu_1x4
air_benchmark_tensorflow_mnist_cpu_4x1
air_benchmark_tensorflow_mnist_cpu_4x4
air_benchmark_tensorflow_mnist_gpu_4x4
air_example_dreambooth_finetuning
alpa_opt_2_7b_sanity_check
alpa_opt_30b_inference
cluster_tune_scale_up_down
long_running_many_ppo
ml_user_horovod_user_test_latest
ml_user_horovod_user_test_master
ml_user_train_tensorflow_mnist_test
ml_user_train_torch_linear_test
ml_user_tune_rllib_connect_test
train_horovod_multi_node_test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants