[Core] Trigger failover based on whether the previous cluster was ever UP #2977

Michaelvll · 2024-01-11T22:31:24Z

The previous logic for trigger failover is based on the previous cluster status to be None or INIT. This can be buggy, as if the previous cluster has been UP before, but is somehow in INIT state at the moment, the provisioner may terminate the cluster when the run_instance fails on the VM.

This is to make the failover a bit more conservation and robust, to not terminate clusters that was ever UP before.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
1. make cluster in INIT: sky launch -c test-vm --cloud gcp touch myfile.txt; ssh test-vm; ray stop; sky status -r
2. manually make the provisioning raising error: add raise RuntimeError
3. launch again on the cluster: sky launch -c test-vm ls myfile.txt
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin

Thanks @Michaelvll! Some questions.

sky/provision/provisioner.py

sky/global_user_state.py

concretevitamin · 2024-01-13T22:03:14Z

sky/backends/cloud_vm_ray_backend.py

+                # If we don't refresh the state of the cluster and reset it
+                # back to STOPPED, our failover logic will consider it as an
+                # abnormal cluster after hitting resources capacity limit on
+                # the cloud, and will start failover. This is not desired,
+                # because the user may want to keep the data on the disk of
+                # that cluster.


Since now we have the "ever up" logic, is this comment still accurate? Do we still need to force-refresh for INIT (I guess it doesn't harm to properly restore it to STOPPED for nicer UX, but perhaps the avoid-failover reason no longer applies)?

Good point! Updated the comment. PTAL. : )

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2024-01-13T22:43:25Z

sky/backends/cloud_vm_ray_backend.py

+                message = (f'Failed to launch the cluster {cluster_name!r}. '
+                           'It is now stopped.\n\tTo remove the cluster '
+                           f'please run: sky down {cluster_name}\n\t'
+                           'Try launching the cluster again with: '
+                           f'sky start {cluster_name}')


Shall we use the same message as L1305? Seems applicable in both cases.

No, they are different. Here it is the case when we restart/re-launch a cluster that was ever up and failed, the cluster is stopped and we fail immediately without triggerring the failover.

However, in the case at L1305, the previous cluster is terminated, and the failover is triggered. We do not need to print the hints, as the failover will automatically do the termination and re-launch.

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2024-01-13T22:56:47Z

Minor UX: while testing ctrl-c at various point in time + relaunching, noticing a message here:

...

I 01-13 14:53:28 cloud_vm_ray_backend.py:1285] Cluster 'dbg' (status: INIT) was previously launched in GCP us-central1. Relaunching in that region.
I 01-13 14:53:33 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
I 01-13 14:53:59 provisioner.py:362] Successfully provisioned or found existing instance.
I 01-13 14:54:30 provisioner.py:425] Ray cluster on head is not up. Restarting...
I 01-13 14:54:41 provisioner.py:464] Successfully provisioned cluster: dbg
...

Shall we make the provisioner.py:425 message debug()? It seems too low-level to print.

…ilover-logic

concretevitamin

Thanks @Michaelvll - some quick nits.

sky/backends/cloud_vm_ray_backend.py

sky/global_user_state.py

sky/backends/cloud_vm_ray_backend.py

concretevitamin

LGTM, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py

sky/global_user_state.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

concretevitamin · 2024-01-17T04:44:31Z

Just tried to start a GCP(a2-ultragpu-8g, {'A100-80GB': 8}) (prev stopped) using this branch. Got:

» sky start sky-f41e-zongheng
Restarting 1 cluster: sky-f41e-zongheng. Proceed? [Y/n]:
I 01-16 20:42:08 cloud_vm_ray_backend.py:1375] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2024-01-16-20-42-07-560865/provision.log
I 01-16 20:42:13 cloud_vm_ray_backend.py:1286] Cluster 'sky-f41e-zongheng' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
I 01-16 20:42:17 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
W 01-16 20:42:32 instance.py:321] The number of running instances is different from the requested number after provisioning (requested: 1, observed: 0). This could be some instances failed to start or some resource leak.
E 01-16 20:42:35 provisioner.py:498] *** Failed setting up cluster. ***
RuntimeError: Provision failed for cluster 'sky-f41e-zongheng'. Could not find any head instance.

Is this message expected? (Is this from GCP provisioner V2 PR?) The instance.py:321 message and the last line Could not find any head instance. are not as direct as before which mentioned resource unavailable stuff.

… into fix-failover-logic

Michaelvll · 2024-01-17T16:48:34Z

Just tried to start a GCP(a2-ultragpu-8g, {'A100-80GB': 8}) (prev stopped) using this branch. Got:

» sky start sky-f41e-zongheng
Restarting 1 cluster: sky-f41e-zongheng. Proceed? [Y/n]:
I 01-16 20:42:08 cloud_vm_ray_backend.py:1375] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2024-01-16-20-42-07-560865/provision.log
I 01-16 20:42:13 cloud_vm_ray_backend.py:1286] Cluster 'sky-f41e-zongheng' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
I 01-16 20:42:17 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
W 01-16 20:42:32 instance.py:321] The number of running instances is different from the requested number after provisioning (requested: 1, observed: 0). This could be some instances failed to start or some resource leak.
E 01-16 20:42:35 provisioner.py:498] *** Failed setting up cluster. ***
RuntimeError: Provision failed for cluster 'sky-f41e-zongheng'. Could not find any head instance.

Is this message expected? (Is this from GCP provisioner V2 PR?) The instance.py:321 message and the last line Could not find any head instance. are not as direct as before which mentioned resource unavailable stuff.

This is a great catch! It seems our original wait_for_operation was not functioning correctly as it did not raise error when the operation fails, which will cause operations like start_instances pass, even if an error occurs. 21760c7 fixes it by polling the status of the operations and checking the return value of the operation. PTAL @concretevitamin.

Tested:

pytest tests/test_smoke.py --gcp
sky launch -c test-a100-failover --cloud gcp --gpus A100- 80GB:8 -i 0 --retry-until-up; after it is stopped; sky launch -c test-a100-failover check if the provisioning works correctly.

sky/provision/common.py

sky/provision/gcp/instance_utils.py

…ilover-logic

sky/provision/gcp/instance_utils.py

Michaelvll added 5 commits January 11, 2024 19:46

cluster ever up

c549c09

fix comment

8c9655b

simplify failover logic

eb7cf34

remove stalled comments

0700b01

Fix comments

a37e977

Michaelvll requested a review from concretevitamin January 12, 2024 22:49

Michaelvll force-pushed the master branch from 71213e5 to 9743aa0 Compare January 13, 2024 05:30

concretevitamin reviewed Jan 13, 2024

View reviewed changes

Michaelvll added 3 commits January 15, 2024 18:52

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-fa…

3d23535

…ilover-logic

Address comments

7c3f8fc

Avoid showing the restarting of ray cluster

b278432

concretevitamin reviewed Jan 16, 2024

View reviewed changes

Michaelvll added 2 commits January 16, 2024 21:18

address comments

aafffb3

format

6bd567e

concretevitamin approved these changes Jan 16, 2024

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

sky/global_user_state.py Outdated Show resolved Hide resolved

Michaelvll and others added 2 commits January 16, 2024 16:30

Update sky/backends/cloud_vm_ray_backend.py

c47e636

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Update sky/global_user_state.py

8755636

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Michaelvll added 2 commits January 17, 2024 16:32

Fix wait_operation by raising errors

21760c7

Merge branch 'fix-failover-logic' of github.com:skypilot-org/skypilot…

3e5f0a8

… into fix-failover-logic

fix nonetype error

bd0489f

concretevitamin reviewed Jan 17, 2024

View reviewed changes

Michaelvll added 2 commits January 17, 2024 19:16

Fix http exceptions

a9515a9

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-fa…

35cf08d

…ilover-logic

concretevitamin approved these changes Jan 18, 2024

View reviewed changes

sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved

Michaelvll added 2 commits January 18, 2024 03:27

fix wait_operations

06c5dea

fix names API

5c7d578

Michaelvll merged commit c1f28bc into master Jan 18, 2024
19 checks passed

Michaelvll deleted the fix-failover-logic branch January 18, 2024 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Trigger failover based on whether the previous cluster was ever UP #2977

[Core] Trigger failover based on whether the previous cluster was ever UP #2977

Michaelvll commented Jan 11, 2024 •

edited

concretevitamin left a comment

concretevitamin Jan 13, 2024

Michaelvll Jan 15, 2024

concretevitamin Jan 13, 2024

Michaelvll Jan 15, 2024

concretevitamin commented Jan 13, 2024

concretevitamin left a comment

concretevitamin left a comment

concretevitamin commented Jan 17, 2024

Michaelvll commented Jan 17, 2024 •

edited

[Core] Trigger failover based on whether the previous cluster was ever UP #2977

[Core] Trigger failover based on whether the previous cluster was ever UP #2977

Conversation

Michaelvll commented Jan 11, 2024 • edited

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jan 13, 2024

Choose a reason for hiding this comment

Michaelvll Jan 15, 2024

Choose a reason for hiding this comment

concretevitamin Jan 13, 2024

Choose a reason for hiding this comment

Michaelvll Jan 15, 2024

Choose a reason for hiding this comment

concretevitamin commented Jan 13, 2024

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin commented Jan 17, 2024

Michaelvll commented Jan 17, 2024 • edited

Michaelvll commented Jan 11, 2024 •

edited

Michaelvll commented Jan 17, 2024 •

edited