Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [JENKINS-44568]: restore old ways of handling ssh disconnection #66

Closed
wants to merge 4 commits into from

Conversation

Jimilian
Copy link
Contributor

@Jimilian Jimilian commented Jul 18, 2017

If I understood the source code correctly, since 57163bb re-connect is broken.
Because we do not "close" the sshConnection, as it was done before. So sshConnection object assumes that connection still exists.

As I mentioned in https://issues.jenkins-ci.org/browse/JENKINS-44568, I'm not sure that this PR really fixes the stuff, because I didn't have time to test it, so, let's wait for some feedback.

@Jimilian Jimilian closed this Jul 18, 2017
@Jimilian Jimilian reopened this Jul 18, 2017
@ssbarnea
Copy link
Contributor

I am a little bit confused because this gives me the impression that it disconnects right after receiving the data which would make the idea of polling quite useless.

I doubt we can afford to be disconnected too often or for long periods of time because we will lose events, especially because not all servers do have the ability to play missed ones.

If this is fixing the current problem it means that somehow we can endup with stalled SSH sessions, ones that do report as connected but which in fact are not receiving any data.

@rinrinne
Copy link
Contributor

rinrinne commented Jul 18, 2017

Because we do not "close" the sshConnection, as it was done before.

Reconnection is performed if calling reconnect(). In this method, sshConnection.disconnect() is called. By this, sshConnection.isConnected() returns false. Also sshConnection.isClosed() returns true, Finally null is set to sshConnction inside of finally block in run().

sshConnection has possibility to be reused for executing other SSH commands on a generated SSH Channel. So it should not be updated in this location. Gerrit stream is output of a SSH command performed on a SSH channel. So please handle SSH Channel rather than SSH Connection.

Updating line 439 to "if (channel != null) {" would help you.

@Jimilian
Copy link
Contributor Author

Jimilian commented Jul 19, 2017

@ssbarnea, this code that I copy-pasted from old version must be called only in case of broken ssh connection or -1 from read (means end - channel is closed). Maybe we should distinguish this two, but in my PR I just tried to return something that was in place before to give it a try in a real life.


@rinrinne, maybe I understood the bug in wrong way, but here that we have:

  1. read at some point starts to return 0, it does not return -1.
  2. it's not clear how code moved to this state at first point.

So, I can only assume that:

  1. connection was broken or '-1` was received
  2. after that we started to re-use "broken" connection.

I can replace line 439, but I don't see how it could help. But I agree that there is some misalignment between channel.isClosed()/channel.close() and channel.isConnected()/channel.connect().

Regarding the rest:

Reconnection is performed if calling reconnect()

Correct me if I'm wrong, but it must be called explicitly. Before it was done implicitly by calling connect each time inside the loop. I can put reconnect instead of re-introduced code if you think that it improves readability. (small addition: nullifyWatchdog shutdowns StreamWatchdog on the line 438 - so, reconnect can not be called from watchdog anymore).

In this method, sshConnection.disconnect() is called

Do you mean current PR or upstream code? In current PR I call it, just to restore old behaviour (and trigger reconnection implicitly). If you think that disconnect is called in current upstream version as well, please, point me to this call, I failed to find it in new version.

Based on mawinter69 investigation
@rinrinne
Copy link
Contributor

Sorry for late reply. Maybe #67 is root cause.

  1. read at some point starts to return 0, it does not return -1.

-1 indicates connection is broken. 0 means buffer.read() reads nothing from InputStream. It is caused by:

  1. No incoming data is arrived.
  2. Buffer for stored data is already full.

In this case, I guess that '2' caused this issue.

I can replace line 439, but I don't see how it could help.

Replacing line 439 makes channel disconnecting without checking channel status if channel has been created.

Before it was done implicitly by calling connect each time inside the loop.

Before introduced 57163bb, SshChannel was managed within SshConnection. So needed SshConnection.disconnect(). But after introducing non-blocking operation, SshChannel can be accessed within GerritConnection. So no need to treat SshConnection.disconnect() for each Gerrit stream-events.

Regarding JSch source code, channel.connect() has actual connection to remote service.

(small addition: nullifyWatchdog shutdowns StreamWatchdog on the line 438 - so, reconnect can not be called from watchdog anymore).

nullfyWatchdog() is called within finally block. It means that while loop for consuming stream is already exited. If GerritConnection.shutdown() is not called explicitly, consuming loop will be re-entered. This is the same as reconnection process.

Do you mean current PR or upstream code?

Means upstream code. So I submit my comment to this PR instead of each lines in your commit.

@rsandell rsandell mentioned this pull request Aug 4, 2017
@rsandell
Copy link
Collaborator

rsandell commented Aug 4, 2017

So I now have three different approaches claiming to fix JENKINS-44568 and JENKINS-44414; #66, #68 and #69
That is great! but also confusing of which one is more credible. From what I can tell #68 has been soaking in the "real world" and claims to be working there and that is always a good sign and gets more credibility because of that.
But #69 actually has unit tests that verifies something along the lines of the problem so we don't get regressions in the future which makes me more inclined to merge that.
#66 has a healthy code review discussion that also makes it a bit safer to merge since it has had more eyes on it.

So which one should I go for?

@mawinter69
Copy link
Contributor

I'm sure that this #66 will not fix the problem, so does #67 not solve the problem though due to the high buffer of 256kb it becomes very unlikely to happen.

@rinrinne
Copy link
Contributor

rinrinne commented Aug 4, 2017

+1 to @mawinter69.

#66 handles connection but it is not essentially needed. As I mentioned, this may affect other channels created from the same connection.
Regarding #67, I had similar concern with @mawinter69. So I started to create a patch to remove buffer size issue.
#68 was submitted during my implementation. It solves such concern by another approach. Actually I just hesitated to upload my patch. However, I decided to do it because I thought that it would be helpful for something.

So I think it is better to adopt either #68 or #69.

@Jimilian
Copy link
Contributor Author

Jimilian commented Aug 4, 2017

I think somebody should give a try to #69 in real world or we can ask @mawinter69 to add a more unit tests (basically, it's the same test cases that were introduced in #69).

I agree that my PR doesn't solve the problem and I will close it after the real fix will be merged. But also I introduced JVM parameter to change the default size of SSH_RX_BUFFER and it can be used in case of different "reality".

@Jimilian
Copy link
Contributor Author

#68 was merged. Close this one.

@Jimilian Jimilian closed this Aug 16, 2017
@Jimilian Jimilian deleted the jenkins_44568 branch August 24, 2017 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants