Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Additional check to make sure ssh works #2797

Merged
merged 2 commits into from
Nov 17, 2023

Conversation

Michaelvll
Copy link
Collaborator

Fixes #2796

The additional check for the ssh does not affect the provisioning speed significantly.
Tested with sky launch -c test-launch --cloud aws --cpus 2
Original: 3m20s, 2m53s
New: 2m39s, 2m36s

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-launch --cloud aws --cpus 2
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@@ -241,12 +241,18 @@ def _wait_ssh_connection_direct(
ssh_private_key: str,
ssh_control_name: Optional[str] = None,
ssh_proxy_command: Optional[str] = None) -> bool:
del ssh_control_name
assert ssh_proxy_command is None, 'SSH proxy command is not supported.'
try:
with socket.create_connection((ip, 22), timeout=1) as s:
if s.recv(100).startswith(b'SSH'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the recv buffer look like if the ssh is ready? I'm not very familiar with the new provisioner and not sure why the recursion here will fix the issue...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we only check the socket to be available before we move to the next step. However, when the socket is ready, the actual ssh can cause the issue in #2796 when we are doing internal file mounts.
Here we explicitly check the connection with the actual ssh once the socket is ready to make sure the later steps can connect the cluster.

Copy link
Collaborator

@cblmemo cblmemo Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why should we enter another recursion if the SSH is already presented in the recv buffer? IIUC it should be:

if startswith('SSH'):
    return True
return _wait_ssh_connection_indirect(...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just take a look at the wait_for_ssh function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: when the ssh is available, it may not be able to actually connect the VM with the current ssh user, as the error suggested in #2796. That is why we do the additional check after SSH is available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just take a look at the wait_for_ssh function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?

What do you mean? The wait_for_ssh does not return anything but raising an error when timeout.

The waiter returns whether we can connect to the ssh?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sry I misread the function name here. I was thinking you are calling _wait_ssh_connection_indirect inside _wait_ssh_connection_indirect... LGTM!

@Michaelvll Michaelvll merged commit 6b1bbc9 into master Nov 17, 2023
19 checks passed
@Michaelvll Michaelvll deleted the booting-ssh-error branch November 17, 2023 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AWS] Unexpected ssh error log printed in CLI
2 participants