Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Handle kubernetes config and reachability errors #3043

Merged
merged 15 commits into from
Feb 20, 2024

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Jan 29, 2024

This resolves #3013.

When ~/.kube/config was removed externally, two errors were noted each from running sky launch and sky down --purge.

sky launch: removing ~/.kube/config sets the Kubernetes cloud to be disabled. And as it was removed externally, the change does not get reflected to the state. To resolve this issue, the change was made to make sure sky check is ran when provisioning to update the state, so that write_cluster_config does not try to upload the credential file for Kubernetes cloud(~/.kube/config).

sky down --purge: runs cleanup_ports as part of the post_teardown_cleanup. While doing so for Kubernetes cloud, it tries to load the config and raises an error since it's removed. Error handling is implemented to handle the scenario.

Tested (run the relevant ones):

@landscapepainter landscapepainter changed the title [k8s] Handle errors from externally removed ~/.kube/config file [k8s,P0] Handle errors from externally removed ~/.kube/config file Jan 29, 2024
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sky/exceptions.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

To fix the whole user story ("sky down --purge doesn't work"), can this PR also fix #3093?

@romilbhardwaj romilbhardwaj changed the title [k8s,P0] Handle errors from externally removed ~/.kube/config file [k8s] Handle kubernetes config and reachability errors Feb 13, 2024
@romilbhardwaj
Copy link
Collaborator

Now also closes #3093.

Since the missing kubeconfig is actually an error, I have removed the silent disabling of Kubernetes cloud (cloud-specific sky.check()) and instead added a note in the error message for the user to run sky check if they want to disable Kubernetes.

Tested against repros from #3093 and #3013.

@romilbhardwaj romilbhardwaj added this to the v0.5 milestone Feb 13, 2024
…o kube-config-issue

# Conflicts:
#	sky/clouds/kubernetes.py
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue @landscapepainter!

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @Michaelvll - addressed your comments.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @landscapepainter @romilbhardwaj! LGTM.

@romilbhardwaj romilbhardwaj merged commit 63e5887 into skypilot-org:master Feb 20, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] sky down --purge fails on Kubernetes clusters that no longer exist
3 participants