Fixes socket.io#1910 by calling transport.close() on ping timeout #336

Merged
merged 1 commit into from Aug 27, 2015

Projects

None yet

10 participants

@apeace
Contributor
apeace commented Aug 6, 2015

There have been a few of us struggling with socketio/socket.io#1910. @nkzawa was kind enough to replicate our results and reproduce the issue, as well as point me in the right direction to find the bug.

With this addition, I can no longer reproduce using the steps in that issue. I have not done any further testing beyond that. Note that I tried both with and without transports: ['polling'] as I described in the issue, and I couldn't reproduce any longer.

Interestingly, with this fix, I am still seeing a delay of about 10-15 seconds after receiving the disconnect event, before the socket is actually closed. I'm not sure why that would be the case. But I verified, without this change, even after waiting > 60 seconds, the socket never closes. So I think this fixes the issue.

I ran make test and all tests were passing except one--but it is also failing on engine.io#master.

Let me know if there is anything else I can do. I'm planning to test this change immediately in my production environment (about ~200 clients), so I'll report my results here.

@nkzawa
Contributor
nkzawa commented Aug 6, 2015

๐Ÿ‘

The delay after the disconnect event should be the close timeout of websocket.

@whifflevcq

Thanks @apeace and @nkzawa . You used transport polling or websocket?
In client, I use socket-client-java. When client send handshake socket, 3 TCP connection established. 3th connection will create socket connection. The remaining connections will be killed after 120s(Idle timeout). My problem is that the connection was not killed. I think transport.close() just solve 3th connection as I said above. But I'll test. Tomorrow will be about 1000 clients, hope everything will be fine.

@apeace
Contributor
apeace commented Aug 7, 2015

After about 20 hours in production, I'm glad to report that this patch seems to be working. We're still hovering around ~200 open sockets for ~200 clients. Typically by this point we'd have > 1000 sockets for ~200 clients.

@whifflevcq

I also recently tested over 1000 client user socket-client-java 1h ago. Normal would be about 6k TCP EST connection. And now ~ 1200 TCP EST connection. Seem the issue was solved. I will continue to monitor and feedback!

@whifflevcq

After about 30h. With ~700 client, have ~4000 TCP EST connection.Seem like to still memory leak. However this is a good sign. Because ago with about 1000 clients, after 12 hours, have ~ 30.000 TCP connection.

@nkzawa
Contributor
nkzawa commented Aug 9, 2015

@whifflevcq thank you for your feedback. So it's still not perfect, but anyway I believe closing all tcp sockets on client disconnection would solve the issue.

@nkzawa nkzawa referenced this pull request in socketio/socket.io-client-java Aug 10, 2015
Closed

socket.io LINUX TCP #195

@apeace
Contributor
apeace commented Aug 10, 2015

I can confirm what @whifflevcq said. There do still seem to be some sockets left open.

After almost four days of running with this change, I'm seeing ~280 sockets open, with ~200 clients. After restarting the server (all sockets closed, then clients reconnect), I'm back to ~200 sockets open.

So this fix is a drastic improvement, but there does still seem to be another leak of socket file descriptors.

@c8439
c8439 commented Aug 11, 2015

@apeace Transport.close () how to configure, the effect is good

@c8439
c8439 commented Aug 11, 2015

@whifflevcq Hello, you that effect is good, what is modified

@whifflevcq

@c8439 In socket.io/node_modules/engine.io/lib/socket.js. Fix setPingTimeOut
Socket.prototype.setPingTimeout = function () {
var self = this;
clearTimeout(self.pingTimeoutTimer);
self.pingTimeoutTimer = setTimeout(function () {
self.onClose('ping timeout');
self.transport.close(); // Fix this...
}, self.server.pingInterval + self.server.pingTimeout);
};

@whifflevcq

@c8439 you should tag this issue in netty-socketio and thanks to mrniko help

@nicofff
nicofff commented Aug 12, 2015

After almost a day of usage this PR is looking good.
I am seeing a slight difference between open sockets and connected clients, but it's remaining constant so far. I'm thinking it might have to do with counting sockets that are closing as active.
Just updated my log collection script to account for this.

@char-lieee

I had the same issue over last couple of weeks. My app is handling 2.5k users on 3.5k connected sockets online. I have up to 25k sessions per day.

I've noticed that during the night when number of users drops to 100, number of opened connections on the server doesn't decrease, stays the same.

Of course it was a big issue for my app - I had to increase ulimit -n on the server and restart my node.js app every couple of days.

This fix solves the problem. After implementing opened connections are killed when sockets disconnect.

Check out the graph. I started counting opened connection in the middle of the week, later on I implemented the fix, restarted server, there's weekend and after weekend you can see significant improvement.

graph

Thanks a lot for this fix!

BTW: My app is using polling as a default transport method. My clients do not use websockets.

@3rd-Eden
Contributor

Move the self.transport.close(); under the onClose so no test regression is introduced, it should stop the tests from failing and will make people press the merge button.

@apeace
Contributor
apeace commented Aug 12, 2015

@3rd-Eden moving the line like you suggested didn't fix the failing test case for me. The same test case is failing on the master branch, it seems unrelated:

  1) server extraHeaders should arrive from client to server via WebSockets:
     Uncaught Error: expected undefined to equal 'my-secret-access-token'

Though maybe if we fix that one, we can get this PR merged :P

Thanks to everyone for your help! Glad this problem is fixed in production for all of you :)

@nicofff
nicofff commented Aug 17, 2015

We ran this fix with the transports set to ["websockets"] for a few days, and then switched back the the default ["polling","websockets"].

It does appear that this PR does help a lot with the issue when using websockets, but there is still another issue related to the polling transport.

websocketsThenPolling

@char-lieee

I forgot to mention that fix was applied to application running on polling method. I don't use websockets at all, so in my case it works for polling.

@nkzawa
Contributor
nkzawa commented Aug 19, 2015

I looked through the source code and found a few suspicious points.

Transports are not explicitly closed on error. Usually, transports will close themselves when an error occurs, but it's not always true.
https://github.com/socketio/engine.io/blob/master/lib/socket.js#L95
https://github.com/socketio/engine.io/blob/master/lib/socket.js#L117

WebSocket connections are not closed when upgrade timeout since transport.readyState will never change to open (which should be a bug).
https://github.com/socketio/engine.io/blob/master/lib/socket.js#L169

Regarding polling transports, http server automatically closes sockets when they time out, so maybe we don't need to worry them.
https://nodejs.org/api/http.html#http_server_settimeout_msecs_callback

@danpe
danpe commented Aug 23, 2015

๐Ÿ‘

@nicofff
nicofff commented Aug 25, 2015

Still seeing problems if the polling transport is enabled (either alone or in conjunction with websockets), even after applying this PR and @nkzawa's.

@nkzawa
Contributor
nkzawa commented Aug 25, 2015

@nicofff I wonder why the PR doesn't solve your case. Do you have any particular settings for http server like timeout and keepAlive?

@nicofff
nicofff commented Aug 25, 2015

We're using https, no options other than cert and key.

@rauchg rauchg merged commit 90d3286 into socketio:master Aug 27, 2015

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
@rikukissa

Any updates on this one? We're still experiencing a lot of connections not being closed when using polling transport with websocket transport. We're also using HTTPS.

@rauchg
Contributor
rauchg commented Dec 28, 2015

This PR has been merged and included in a release already. What version are you running?

@rikukissa

The most recent one (1.6.4). Everything seems to be working fine now after we stopped using socket.io's https config property and started proxying all connections through nginx.

@rauchg
Contributor
rauchg commented Dec 28, 2015

@rikukissa so is this specifically related to SSL handling?

@rikukissa

It seems so, but I really can't say for sure. Lets consider this as solved and I'll get back to you if we figure out what caused this in a first place.

@rauchg
Contributor
rauchg commented Dec 28, 2015

@rikukissa thanks a lot!!

@Nibbler999 Nibbler999 added a commit to Nibbler999/eynio that referenced this pull request Nov 30, 2016
@Nibbler999 Nibbler999 Add timeout socket close fix from socketio/engine.io#336
git-svn-id: svn://internal-repo/nsc/NHome/homeserver@125571 bc4cc02b-413e-4181-b44e-24092dc36e11
5935462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment