[tune] Fix trial checkpoint syncing after recovery from other node #28470

krfricke · 2022-09-13T15:31:53Z

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because trial.last_result has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this.

Related issue number

Closes #28468

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

xwjiang2010 · 2022-09-13T16:30:02Z

python/ray/tune/tests/test_syncer_callback.py

+    syncer_callback._trial_ips[trial1.trial_id] = "invalid"
+    syncer_callback.on_trial_start(iteration=0, trials=[], trial=trial1)
+
+    syncer_callback.on_trial_result(iteration=1, trials=[], trial=trial1, result={})


I am a bit confused as to why would we encounter this?
if a result is reported, shouldn't there always exist a NODE_IP field, which is automatically filled in by us?

The problem is that the last_result could be populated from a restored checkpoint and point to an IP address that does not exist anymore

Signed-off-by: Kai Fricke <kai@anyscale.com>

…28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <kai@anyscale.com>

…tests (#28535) * [core/ci] Disallow protobuf 3.19.5 (#28504) This leads to hangs in Ray client (e.g. test_dataclient_disconnect) Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix trial checkpoint syncing after recovery from other node (#28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <kai@anyscale.com> * [air] minor example fix. (#28379) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [cleanup] Remove memory unit conversion (#28396) The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore. Recommendation: Review #28394 first, since this is stacked on top of it. Co-authored-by: Alex <alex@anyscale.com> * [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422) * Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384) * Cast rewards as tf.float32 to fix error in DQN in tf2 Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> * Add test case for DQN with integer rewards Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> * [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488) This addresses some of the issues brought up in #28484 * [ci] Increase timeout on test_metrics (#28508) 10 milliseconds is ambitious for the CI to do anything. Co-authored-by: Alex <alex@anyscale.com> * [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503) * Add imports to object-spilling.rst Python code (#28507) * Add imports to object-spilling.rst Python code Also adjust a couple descriptions, retaining the same general information Signed-off-by: Jake <DevJake@users.noreply.github.com> * fix doc build / keep note formatting Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * another tiny fix Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Jake <DevJake@users.noreply.github.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> * [AIR] Make PathPartitionScheme a dataclass (#28390) Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> * [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> * [autoscaler][observability] Experimental verbose mode (#28392) This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API. Example output ======== Autoscaler status: 2020-12-28 01:02:03 ======== GCS request time: 3.141500s Node Provider non_terminated_nodes time: 1.618000s Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Total Usage: 1/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00/8.000 GiB memory 3.14/16.000 GiB object_store_memory Total Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() Node: 192.168.1.1 Usage: 0.1/1 AcceleratorType:V100 5.0/20.0 CPU 0.7/1 GPU 1.00/4.000 GiB memory 3.14/4.000 GiB object_store_memory Node: 192.168.1.2 Usage: 0.9/1 AcceleratorType:V100 15.0/20.0 CPU 0.3/1 GPU 1.00/12.000 GiB memory 0.00/4.000 GiB object_store_memory Co-authored-by: Alex <alex@anyscale.com> * [doc/tune] fix tune stopper attribute name (#28517) * [doc] Fix tune stopper doctests (#28531) * [air] Use self-hosted mirror for CIFAR10 dataset (#28480) The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing. Signed-off-by: Kai Fricke <kai@anyscale.com> * draft Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> Signed-off-by: Jake <DevJake@users.noreply.github.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex <alex@anyscale.com> Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: mgerstgrasser <matthias@gerstgrasser.net> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Jake <DevJake@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Co-authored-by: Árpád Rózsás <rozsasarpi@gmail.com>

…ay-project#28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: PaulFenton <paul@fentonsoftware.com>

Kai Fricke added 2 commits September 13, 2022 16:28

[tune] Fix trial checkpoint syncing after recovery from other node

1cf2762

Signed-off-by: Kai Fricke <kai@anyscale.com>

Remove from state

7af2348

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from xwjiang2010 September 13, 2022 15:32

krfricke assigned xwjiang2010 Sep 13, 2022

remote fix

482a501

Signed-off-by: Kai Fricke <kai@anyscale.com>

xwjiang2010 reviewed Sep 13, 2022

View reviewed changes

Fix unavailable IP

36bb909

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from xwjiang2010 September 13, 2022 20:19

xwjiang2010 approved these changes Sep 14, 2022

View reviewed changes

krfricke merged commit b49f4b7 into ray-project:master Sep 14, 2022

krfricke deleted the tune/syncer-ip branch September 14, 2022 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fix trial checkpoint syncing after recovery from other node #28470

[tune] Fix trial checkpoint syncing after recovery from other node #28470

krfricke commented Sep 13, 2022

xwjiang2010 Sep 13, 2022

krfricke Sep 13, 2022

[tune] Fix trial checkpoint syncing after recovery from other node #28470

[tune] Fix trial checkpoint syncing after recovery from other node #28470

Conversation

krfricke commented Sep 13, 2022

Why are these changes needed?

Related issue number

Checks

xwjiang2010 Sep 13, 2022

Choose a reason for hiding this comment

krfricke Sep 13, 2022

Choose a reason for hiding this comment