-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Additional check to make sure ssh works #2797
Conversation
@@ -241,12 +241,18 @@ def _wait_ssh_connection_direct( | |||
ssh_private_key: str, | |||
ssh_control_name: Optional[str] = None, | |||
ssh_proxy_command: Optional[str] = None) -> bool: | |||
del ssh_control_name | |||
assert ssh_proxy_command is None, 'SSH proxy command is not supported.' | |||
try: | |||
with socket.create_connection((ip, 22), timeout=1) as s: | |||
if s.recv(100).startswith(b'SSH'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will the recv buffer look like if the ssh is ready? I'm not very familiar with the new provisioner and not sure why the recursion here will fix the issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we only check the socket to be available before we move to the next step. However, when the socket is ready, the actual ssh
can cause the issue in #2796 when we are doing internal file mounts.
Here we explicitly check the connection with the actual ssh
once the socket is ready to make sure the later steps can connect the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why should we enter another recursion if the SSH is already presented in the recv buffer? IIUC it should be:
if startswith('SSH'):
return True
return _wait_ssh_connection_indirect(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just take a look at the wait_for_ssh
function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is: when the ssh is available, it may not be able to actually connect the VM with the current ssh user, as the error suggested in #2796. That is why we do the additional check after SSH
is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just take a look at the wait_for_ssh function and it seems that the return value is whether an error happened? Can we add a comment to say typical error message pattern?
What do you mean? The wait_for_ssh
does not return anything but raising an error when timeout.
The waiter
returns whether we can connect to the ssh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sry I misread the function name here. I was thinking you are calling _wait_ssh_connection_indirect
inside _wait_ssh_connection_indirect
... LGTM!
Fixes #2796
The additional check for the ssh does not affect the provisioning speed significantly.
Tested with
sky launch -c test-launch --cloud aws --cpus 2
Original: 3m20s, 2m53s
New: 2m39s, 2m36s
Tested (run the relevant ones):
bash format.sh
sky launch -c test-launch --cloud aws --cpus 2
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh