[Provisioner] Additional check to make sure ssh works #2797

Michaelvll · 2023-11-16T23:02:47Z

The additional check for the ssh does not affect the provisioning speed significantly.
Tested with sky launch -c test-launch --cloud aws --cpus 2
Original: 3m20s, 2m53s
New: 2m39s, 2m36s

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch -c test-launch --cloud aws --cpus 2
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

cblmemo · 2023-11-16T23:57:36Z

sky/provision/provisioner.py

@@ -241,12 +241,18 @@ def _wait_ssh_connection_direct(
        ssh_private_key: str,
        ssh_control_name: Optional[str] = None,
        ssh_proxy_command: Optional[str] = None) -> bool:
-    del ssh_control_name
    assert ssh_proxy_command is None, 'SSH proxy command is not supported.'
    try:
        with socket.create_connection((ip, 22), timeout=1) as s:
            if s.recv(100).startswith(b'SSH'):


What will the recv buffer look like if the ssh is ready? I'm not very familiar with the new provisioner and not sure why the recursion here will fix the issue...

Previously we only check the socket to be available before we move to the next step. However, when the socket is ready, the actual ssh can cause the issue in #2796 when we are doing internal file mounts.
Here we explicitly check the connection with the actual ssh once the socket is ready to make sure the later steps can connect the cluster.

Then why should we enter another recursion if the SSH is already presented in the recv buffer? IIUC it should be:

if startswith('SSH'): return True return _wait_ssh_connection_indirect(...)

Just take a look at the wait_for_ssh function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?

The problem is: when the ssh is available, it may not be able to actually connect the VM with the current ssh user, as the error suggested in #2796. That is why we do the additional check after SSH is available.

Just take a look at the wait_for_ssh function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?

What do you mean? The wait_for_ssh does not return anything but raising an error when timeout.

The waiter returns whether we can connect to the ssh?

Oh sry I misread the function name here. I was thinking you are calling _wait_ssh_connection_indirect inside _wait_ssh_connection_indirect... LGTM!

Michaelvll added 2 commits November 16, 2023 22:58

Additional check to make sure ssh works

5140d3f

format

6cf3ed0

Michaelvll requested review from cblmemo and suquark November 16, 2023 23:32

cblmemo reviewed Nov 16, 2023

View reviewed changes

cblmemo approved these changes Nov 17, 2023

View reviewed changes

Michaelvll merged commit 6b1bbc9 into master Nov 17, 2023
19 checks passed

Michaelvll deleted the booting-ssh-error branch November 17, 2023 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Provisioner] Additional check to make sure ssh works #2797

[Provisioner] Additional check to make sure ssh works #2797

Michaelvll commented Nov 16, 2023

cblmemo Nov 16, 2023

Michaelvll Nov 17, 2023 •

edited

cblmemo Nov 17, 2023 •

edited

cblmemo Nov 17, 2023

Michaelvll Nov 17, 2023

Michaelvll Nov 17, 2023

cblmemo Nov 17, 2023

[Provisioner] Additional check to make sure ssh works #2797

[Provisioner] Additional check to make sure ssh works #2797

Conversation

Michaelvll commented Nov 16, 2023

cblmemo Nov 16, 2023

Choose a reason for hiding this comment

Michaelvll Nov 17, 2023 • edited

Choose a reason for hiding this comment

cblmemo Nov 17, 2023 • edited

Choose a reason for hiding this comment

cblmemo Nov 17, 2023

Choose a reason for hiding this comment

Michaelvll Nov 17, 2023

Choose a reason for hiding this comment

Michaelvll Nov 17, 2023

Choose a reason for hiding this comment

cblmemo Nov 17, 2023

Choose a reason for hiding this comment

Michaelvll Nov 17, 2023 •

edited

cblmemo Nov 17, 2023 •

edited