Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve #196 and #211 #214

Merged
merged 5 commits into from
Jan 18, 2024
Merged

Improve #196 and #211 #214

merged 5 commits into from
Jan 18, 2024

Conversation

mmirate
Copy link
Contributor

@mmirate mmirate commented Nov 13, 2023

  • Rewrite these two features using disjunctive futures in order to cleanly make the timers optional.
    This works because of https://docs.rs/futures/latest/futures/future/enum.Either.html#impl-Future-for-Either%3CA,+B%3E.
    It theoretically works best because reusing the same futures means Tokio has to do less bookkeeping. Every non-diverging exit from the select! will reset the keepalive timer, and will also reset the inactivity timer if any nontrivial I/O happened.
  • Add an analogue of OpenSSH's ServerAliveCountMax.
    A counter is persisted on the Session; incremented when a keepalive is sent; and reset when a REQUEST_SUCCESS or REQUEST_FAILURE is received (presumably in reply to a keepalive). If it passes a configurable threshold, the connection is cut.
  • Mutate the Session to pass information back to the main bg loop from the plaintext packet reader, so that only nontrivial data transfer will reset the inactivity timer. (And so that ServerAliveCountMax will be judged correctly.)

@mmirate
Copy link
Contributor Author

mmirate commented Nov 13, 2023

This is related to #196 and #211 but I guess that putting the numbers in the title doesn't do anything. @amtelekom, how badly did I misunderstand what you were looking for?

russh/src/client/mod.rs Outdated Show resolved Hide resolved
@amtelekom
Copy link
Contributor

how badly did I misunderstand what you were looking for?

First of all thanks for looking out for my needs :)

Your branch works fine for our purposes - we just need to be able to detect if the network connectivity was lost or the Server exit without closing down the connection correctly on long-standing SSH connections (Netconf transport tunnels).

Since our (OpenSSH-based) servers can send keepalives, just copying the inactivity_timeout from the Russh server side was the quickest solution, but actually sending own keepalives and then timing out on non-responsiveness is actually the better solution to our specific situation.

While it works, it feels like the activity tracking seems like it goes farther than needed?
As mentioned, using the inactivity_timeout config just came from the Russh Server side, where it is just used to add a timeout to the select! loop.
In your current PR, the interpretation of the flag differ between the Server and Client side.

It might still be the better way of tracking "inactivity", but then maybe we should also port this to the Server side for consistency, and expand the explanation in the doc comment in the Config struct?

@mmirate
Copy link
Contributor Author

mmirate commented Nov 14, 2023

First of all thanks for looking out for my needs :)

Your branch works fine for our purposes - we just need to be able to detect if the network connectivity was lost or the Server exit without closing down the connection correctly on long-standing SSH connections (Netconf transport tunnels).

Well, I would marvel at the coincidence that our respective employers have the same motivation for us to use russh, but I guess there aren't that many other client-side usecases of SSH that can't be done "simpler" by just shelling out to OpenSSH's client program with tokio::process. (Though I'd assume that that would be less scalable since certainly each russh Session has to be using less memory than each entire ssh process...)

Since our (OpenSSH-based) servers can send keepalives, just copying the inactivity_timeout from the Russh server side was the quickest solution, but actually sending own keepalives and then timing out on non-responsiveness is actually the better solution to our specific situation.

Ah, so that feature's use case now makes a lot more sense. I guess that's what I might have ended up doing, if I had any influence over the SSH servers' configuration.

While it works, it feels like the activity tracking seems like it goes farther than needed?

My best guess is that what you meant is: the activity tracking should only say "even though a packet was read, it didn't merit resetting the inactivity timer" if the packet in-question was clearly a keepalive request or reply. Is this correct?

(By contrast, what I wrote will also avoid resetting the inactivity timer in case of XON/XOFF, unhandled global requests, unhandled channel requests and unhandled packets. Upon reflection, I see that's clearly a subjective judgment of mine.)

As mentioned, using the inactivity_timeout config just came from the Russh Server side, where it is just used to add a timeout to the select! loop. In your current PR, the interpretation of the flag differ between the Server and Client side.

It might still be the better way of tracking "inactivity", but then maybe we should also port this to the Server side for consistency, and expand the explanation in the doc comment in the Config struct?

You're right, I have thusfar been paying zero attention to the Server side of russh. I'll take some time to see where this all fits into the Server.

@mmirate
Copy link
Contributor Author

mmirate commented Nov 20, 2023

I'll take some time to see where this all fits into the Server.

I have concluded this process, and thus I believe my recent commit brings Client and Server into parity regarding these two features.

Based on what I saw on the Server side - and contrary to the code I wrote in my previous PR - it seems to me that the purpose of the inactivity timer, relative to OpenSSH-style ServerAliveInterval or ClientAliveInterval, is that the inactivity timer on the one hand doesn't let the peer hold the connection open by responding to a ping-pong, but on the other hand does let the local application hold the connection open by sending data to the peer even without expecting any responses whatsoever to that data (not even window size adjustments or rekeyings or etc). So that is what I have assumed when writing this last commit.

@amtelekom
Copy link
Contributor

amtelekom commented Nov 22, 2023

Hmm,

running your updated commit with

let config = Arc::new(sshclient::Config {
  keepalive_interval: Some(Duration::from_secs(60))
  ..sshclient::Config::default()
});
let client = SSHClient {};
let mut ssh = sshclient::connect(config, socket_addr, client).await?;
// ...

against a default-configuration OpenSSH Server (i.e. it doesn't send keepalives on its own) keeps runing into timeouts.

Activating the Debug outputs for russh, it seems like there is no wait time between the keepalive attempts being sent, and thereby also not giving the remote side more than a few ns to respond (the previous russh debugmsg timestamp before this is exactly a minute older):

2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 558   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 559   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 560   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 561   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::client: Timeout, server not responding to keepalives   
2023-11-22T09:53:18Z DEBUG russh::client: disconnected   
2023-11-22T09:53:18Z DEBUG russh::client: drop session 

I've pushed a fix to https://github.com/amtelekom/russh/tree/keepalive-improvement (this also fixes that an additional keepalive request is sent right before disconnecting [note the 4 seqns in the previous log], but that's more of a nicety than a functional improvement). Not sure how to best collaborate on PRs on Github :)

@mmirate
Copy link
Contributor Author

mmirate commented Nov 22, 2023

Aye, that was a nasty little oversight. Apologies. It should work correctly with that simple fix, and since your code was slightly cleaner than what I came up with prior to reading the entirety of your message, I cherry-picked your commit into here.

@mmirate
Copy link
Contributor Author

mmirate commented Dec 6, 2023

The semver checker failed when I rebased this onto the latest commit in main; I assume because the version numbers no longer have the prerelease indicator that allows breaking changes (e.g. these).

@mmirate mmirate requested a review from Eugeny December 6, 2023 17:47
@mmirate
Copy link
Contributor Author

mmirate commented Dec 18, 2023

@Eugeny
Copy link
Member

Eugeny commented Dec 18, 2023

fixed!

mmirate and others added 3 commits December 18, 2023 11:48
* Add an analogue of OpenSSH's `ServerAliveCountMax`.
* Use disjunctive futures for cleanly making these timers optional.
* Use the `Session` to pass information back to the main bg loop from
  the plaintext packet reader, so that only nontrivial data transfer
  will reset the inactivity timer. (And so that `ServerAliveCountMax`
  will be judged correctly.)
Also bring client and server into parity regarding timers.

Also, per OpenSSH documentation, only reset keepalive timer when
receiving data, not when sending it.

Also, always reset the inactivity timer unless the iteration was ended
via sending a keepalive request.
@mmirate
Copy link
Contributor Author

mmirate commented Dec 18, 2023

Excellent! I took an educated guess at the version-bump, so CI seems to be in order now.

@amtelekom
Copy link
Contributor

amtelekom commented Jan 15, 2024

Is there something that still needs to happen before this can be merged?

In case it helps: I feel this MR now implements things as I would have expected them to work from a user's perspective, and we've been working based on this branch successfully the last weeks. Would be good to be able to go back to the upstream version though :)

@Eugeny Eugeny merged commit 2a4b5a0 into warp-tech:main Jan 18, 2024
4 checks passed
@Eugeny
Copy link
Member

Eugeny commented Jan 18, 2024

No, sorry, just lost the track of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants