[k8s] Increase ssh timeout when calling uptime
#2785
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
When trying provisioning on a Kubernetes cluster on a high latency connection (e.g., different geographic region + VPN), SkyPilot would get stuck on launching. This is because the
uptime
call run by the ray autoscaler to check for cluster liveness uses a very shortConnectTimeout=5s
. Using a larger SSH timeout fixes the issue.Fix in this PR
This PR introduces a monkey patch to SSHCommandRunner in
kubernetes/node_provider.py
. It uses a larger timeout when running uptime to check cluster liveness. This monkeypatch toSSHCommandRunner.run
is necessary since the ray autoscaler sets the timeout on a per-call basis (as an arg toSSHCommandRunner.run
) and the 5s timeout is hardcoded inupdater.py::NodeUpdater.wait_ready()
is hard to modify without duplicating a large chunk of ray autoscaler code.Thus, we monkey patch the run method to check if the command being run is 'uptime', and if so change the timeout to 10s.
Tested (run the relevant ones):
bash format.sh
sky launch -c test --cloud kubernetes
on GKE and EKS clusters on a high latency connection.