-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] New train.Checkpoint
API: Update release tests (batch 2)
#38550
[air] New train.Checkpoint
API: Update release tests (batch 2)
#38550
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint 2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> missing Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…vanilla tf exactly Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for @can-anyscale to review)
@aslonnie The CI changes here are temporary just to launch a subset of release tests. Going to revert these before merging, so no need to review here. |
This reverts commit f1b9534.
…update_release_tests_batch2
This reverts commit e6608e8.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
|
||
# Raise an error if the storage path is not accessible when | ||
# attempting to upload a checkpoint from a remote worker. | ||
# Ex: If storage_path is a local path, then a validation marker | ||
# will only exist on the head node but not the worker nodes. | ||
self._check_validation_file() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is a change that's needed so that multi-node w/ no storage path doesn't fail when NO checkpoints are reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl This is the same raise error behavior that we added for head node syncing in 2.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…-project#38550) This PR enables the new persistence mode feature flag for the following release tests: workspace_template_many_model_training air_benchmark_data_bulk_ingest air_benchmark_tensorflow_mnist_cpu_1x4 air_benchmark_tensorflow_mnist_cpu_4x1 air_benchmark_tensorflow_mnist_cpu_4x4 air_benchmark_tensorflow_mnist_gpu_4x4 air_example_dreambooth_finetuning alpa_opt_2_7b_sanity_check alpa_opt_30b_inference cluster_tune_scale_up_down long_running_many_ppo ml_user_horovod_user_test_latest ml_user_horovod_user_test_master ml_user_train_tensorflow_mnist_test ml_user_train_torch_linear_test ml_user_tune_rllib_connect_test train_horovod_multi_node_test Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…-project#38550) This PR enables the new persistence mode feature flag for the following release tests: workspace_template_many_model_training air_benchmark_data_bulk_ingest air_benchmark_tensorflow_mnist_cpu_1x4 air_benchmark_tensorflow_mnist_cpu_4x1 air_benchmark_tensorflow_mnist_cpu_4x4 air_benchmark_tensorflow_mnist_gpu_4x4 air_example_dreambooth_finetuning alpa_opt_2_7b_sanity_check alpa_opt_30b_inference cluster_tune_scale_up_down long_running_many_ppo ml_user_horovod_user_test_latest ml_user_horovod_user_test_master ml_user_train_tensorflow_mnist_test ml_user_train_torch_linear_test ml_user_tune_rllib_connect_test train_horovod_multi_node_test Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR enables the new persistence mode feature flag for the following release tests:
workspace_template_many_model_training
air_benchmark_data_bulk_ingest
air_benchmark_tensorflow_mnist_cpu_1x4
air_benchmark_tensorflow_mnist_cpu_4x1
air_benchmark_tensorflow_mnist_cpu_4x4
air_benchmark_tensorflow_mnist_gpu_4x4
air_example_dreambooth_finetuning
alpa_opt_2_7b_sanity_check
alpa_opt_30b_inference
cluster_tune_scale_up_down
long_running_many_ppo
ml_user_horovod_user_test_latest
ml_user_horovod_user_test_master
ml_user_train_tensorflow_mnist_test
ml_user_train_torch_linear_test
ml_user_tune_rllib_connect_test
train_horovod_multi_node_test
See them passing here (I removed the dolly one from the PR for now): https://buildkite.com/ray-project/release-tests-pr/builds/49414#_
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.