[k8s] Increase ssh timeout when calling `uptime` #2785

romilbhardwaj · 2023-11-14T15:17:14Z

Issue

When trying provisioning on a Kubernetes cluster on a high latency connection (e.g., different geographic region + VPN), SkyPilot would get stuck on launching. This is because the uptime call run by the ray autoscaler to check for cluster liveness uses a very short ConnectTimeout=5s. Using a larger SSH timeout fixes the issue.

Fix in this PR

This PR introduces a monkey patch to SSHCommandRunner in kubernetes/node_provider.py. It uses a larger timeout when running uptime to check cluster liveness. This monkeypatch to SSHCommandRunner.runis necessary since the ray autoscaler sets the timeout on a per-call basis (as an arg to SSHCommandRunner.run) and the 5s timeout is hardcoded in updater.py::NodeUpdater.wait_ready() is hard to modify without duplicating a large chunk of ray autoscaler code.

Thus, we monkey patch the run method to check if the command being run is 'uptime', and if so change the timeout to 10s.

Tested (run the relevant ones):

Code formatting: bash format.sh
sky launch -c test --cloud kubernetes on GKE and EKS clusters on a high latency connection.

landscapepainter

Confirmed it runs successfully to launch k8s instance. LGTM!

sky/skylet/providers/kubernetes/node_provider.py

romilbhardwaj added 2 commits November 14, 2023 20:27

update timeout

1f35a84

lint

225187a

romilbhardwaj requested a review from landscapepainter November 14, 2023 15:17

romilbhardwaj added the k8s Kubernetes related items label Nov 14, 2023

landscapepainter approved these changes Nov 16, 2023

View reviewed changes

landscapepainter reviewed Nov 16, 2023

View reviewed changes

sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved

readability

3c191e8

romilbhardwaj merged commit ef8839a into master Nov 16, 2023
19 checks passed

romilbhardwaj deleted the k8s_ssh_uptime_timeout branch November 16, 2023 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Increase ssh timeout when calling `uptime` #2785

[k8s] Increase ssh timeout when calling `uptime` #2785

romilbhardwaj commented Nov 14, 2023

landscapepainter left a comment

[k8s] Increase ssh timeout when calling uptime #2785

[k8s] Increase ssh timeout when calling uptime #2785

Conversation

romilbhardwaj commented Nov 14, 2023

Issue

Fix in this PR

landscapepainter left a comment

Choose a reason for hiding this comment

[k8s] Increase ssh timeout when calling `uptime` #2785

[k8s] Increase ssh timeout when calling `uptime` #2785