Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Trigger failover based on whether the previous cluster was ever UP #2977

Merged
merged 19 commits into from
Jan 18, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 11, 2024

The previous logic for trigger failover is based on the previous cluster status to be None or INIT. This can be buggy, as if the previous cluster has been UP before, but is somehow in INIT state at the moment, the provisioner may terminate the cluster when the run_instance fails on the VM.

This is to make the failover a bit more conservation and robust, to not terminate clusters that was ever UP before.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    1. make cluster in INIT: sky launch -c test-vm --cloud gcp touch myfile.txt; ssh test-vm; ray stop; sky status -r
    2. manually make the provisioning raising error: add raise RuntimeError
    3. launch again on the cluster: sky launch -c test-vm ls myfile.txt
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Some questions.

sky/provision/provisioner.py Outdated Show resolved Hide resolved
sky/global_user_state.py Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
Comment on lines 4313 to 4318
# If we don't refresh the state of the cluster and reset it
# back to STOPPED, our failover logic will consider it as an
# abnormal cluster after hitting resources capacity limit on
# the cloud, and will start failover. This is not desired,
# because the user may want to keep the data on the disk of
# that cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now we have the "ever up" logic, is this comment still accurate? Do we still need to force-refresh for INIT (I guess it doesn't harm to properly restore it to STOPPED for nicer UX, but perhaps the avoid-failover reason no longer applies)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated the comment. PTAL. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
Comment on lines 1296 to 1300
message = (f'Failed to launch the cluster {cluster_name!r}. '
'It is now stopped.\n\tTo remove the cluster '
f'please run: sky down {cluster_name}\n\t'
'Try launching the cluster again with: '
f'sky start {cluster_name}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use the same message as L1305? Seems applicable in both cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are different. Here it is the case when we restart/re-launch a cluster that was ever up and failed, the cluster is stopped and we fail immediately without triggerring the failover.

However, in the case at L1305, the previous cluster is terminated, and the failover is triggered. We do not need to print the hints, as the failover will automatically do the termination and re-launch.

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@concretevitamin
Copy link
Collaborator

Minor UX: while testing ctrl-c at various point in time + relaunching, noticing a message here:

...

I 01-13 14:53:28 cloud_vm_ray_backend.py:1285] Cluster 'dbg' (status: INIT) was previously launched in GCP us-central1. Relaunching in that region.
I 01-13 14:53:33 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
I 01-13 14:53:59 provisioner.py:362] Successfully provisioned or found existing instance.
I 01-13 14:54:30 provisioner.py:425] Ray cluster on head is not up. Restarting...
I 01-13 14:54:41 provisioner.py:464] Successfully provisioned cluster: dbg
...

Shall we make the provisioner.py:425 message debug()? It seems too low-level to print.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - some quick nits.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
Michaelvll and others added 2 commits January 16, 2024 16:30
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@concretevitamin
Copy link
Collaborator

Just tried to start a GCP(a2-ultragpu-8g, {'A100-80GB': 8}) (prev stopped) using this branch. Got:

» sky start sky-f41e-zongheng
Restarting 1 cluster: sky-f41e-zongheng. Proceed? [Y/n]:
I 01-16 20:42:08 cloud_vm_ray_backend.py:1375] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2024-01-16-20-42-07-560865/provision.log
I 01-16 20:42:13 cloud_vm_ray_backend.py:1286] Cluster 'sky-f41e-zongheng' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
I 01-16 20:42:17 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
W 01-16 20:42:32 instance.py:321] The number of running instances is different from the requested number after provisioning (requested: 1, observed: 0). This could be some instances failed to start or some resource leak.
E 01-16 20:42:35 provisioner.py:498] *** Failed setting up cluster. ***
RuntimeError: Provision failed for cluster 'sky-f41e-zongheng'. Could not find any head instance.

Is this message expected? (Is this from GCP provisioner V2 PR?) The instance.py:321 message and the last line Could not find any head instance. are not as direct as before which mentioned resource unavailable stuff.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jan 17, 2024

Just tried to start a GCP(a2-ultragpu-8g, {'A100-80GB': 8}) (prev stopped) using this branch. Got:

» sky start sky-f41e-zongheng
Restarting 1 cluster: sky-f41e-zongheng. Proceed? [Y/n]:
I 01-16 20:42:08 cloud_vm_ray_backend.py:1375] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2024-01-16-20-42-07-560865/provision.log
I 01-16 20:42:13 cloud_vm_ray_backend.py:1286] Cluster 'sky-f41e-zongheng' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
I 01-16 20:42:17 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
W 01-16 20:42:32 instance.py:321] The number of running instances is different from the requested number after provisioning (requested: 1, observed: 0). This could be some instances failed to start or some resource leak.
E 01-16 20:42:35 provisioner.py:498] *** Failed setting up cluster. ***
RuntimeError: Provision failed for cluster 'sky-f41e-zongheng'. Could not find any head instance.

Is this message expected? (Is this from GCP provisioner V2 PR?) The instance.py:321 message and the last line Could not find any head instance. are not as direct as before which mentioned resource unavailable stuff.

This is a great catch! It seems our original wait_for_operation was not functioning correctly as it did not raise error when the operation fails, which will cause operations like start_instances pass, even if an error occurs. 21760c7 fixes it by polling the status of the operations and checking the return value of the operation. PTAL @concretevitamin.

Tested:

  • pytest tests/test_smoke.py --gcp
  • sky launch -c test-a100-failover --cloud gcp --gpus A100- 80GB:8 -i 0 --retry-until-up; after it is stopped; sky launch -c test-a100-failover check if the provisioning works correctly.

sky/provision/common.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit c1f28bc into master Jan 18, 2024
19 checks passed
@Michaelvll Michaelvll deleted the fix-failover-logic branch January 18, 2024 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants