Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Increase ssh timeout when calling uptime #2785

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

Issue

When trying provisioning on a Kubernetes cluster on a high latency connection (e.g., different geographic region + VPN), SkyPilot would get stuck on launching. This is because the uptime call run by the ray autoscaler to check for cluster liveness uses a very short ConnectTimeout=5s. Using a larger SSH timeout fixes the issue.

Fix in this PR

This PR introduces a monkey patch to SSHCommandRunner in kubernetes/node_provider.py. It uses a larger timeout when running uptime to check cluster liveness. This monkeypatch to SSHCommandRunner.runis necessary since the ray autoscaler sets the timeout on a per-call basis (as an arg to SSHCommandRunner.run) and the 5s timeout is hardcoded in updater.py::NodeUpdater.wait_ready() is hard to modify without duplicating a large chunk of ray autoscaler code.

Thus, we monkey patch the run method to check if the command being run is 'uptime', and if so change the timeout to 10s.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • sky launch -c test --cloud kubernetes on GKE and EKS clusters on a high latency connection.

@romilbhardwaj romilbhardwaj added the k8s Kubernetes related items label Nov 14, 2023
Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it runs successfully to launch k8s instance. LGTM!

@romilbhardwaj romilbhardwaj merged commit ef8839a into master Nov 16, 2023
19 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_ssh_uptime_timeout branch November 16, 2023 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants