High rate of "client not handshaken should reconnect" #438

Closed
squidfunk opened this Issue Aug 1, 2011 · 297 comments
@squidfunk

I am running a chat server with node.js / socket.io and have a lot of "client not handshaken" warnings. In the peak time there are around 1.000 to 3.000 open TCP connections.

For debugging purposes I plotted the graph of actions succeeding the server-side "set close timeout" event, because the warnings are always preceded by those, so the format is:

Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 2098080741242069807
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - xhr-polling closed due to exceeded duration
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 330973265416677743
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - setting request GET /socket.io/1/xhr-polling
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 10595896332140683620
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 10595896332140683620
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 21320636051749821863
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 21320636051749821863
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 3331715441803393577
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   warn  - client not handshaken client should reconnect

The following plot explained:

  • x axis: The passed time between the first and last seeing of a client id.
  • y axis: total amount of clients for a given time x terminating with a specific message (client not handshaken, cleared close timeout, etc.)

Plot Full size

I did not change the default timeouts and intervals provided by socket.io, but I think it is very strange, that there is a peak of handshake errors at around 10 seconds (even surpassing the successful cleared close timeouts!). Did anyone experience a similar situation?

Best regards,
Martin

@squidfunk squidfunk closed this Aug 4, 2011
@denisu

Hi,

were you able to solve it? I am still having this problem with 0.7.8. I am not able to reproduce it on my machine, but i can see it in the debug logs, some clients go really crazy (looks like more than 50 reconnects/second). I have this problem only with jsonp and xhr connections, turning off one of them didn't help though.

   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393095&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393150&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   ...

I am using custom namespaces btw.

@denisu

TCP dump of one of those terror-connections, if that helps (this goes on continously).

10:21:13.450665 IP (removed).8433 > (removed).65471: Flags [P.], seq 1973377331:1973377336, ack 537321861, win 54, length 5
@.@....(!....E ...u.Y3 ...P..6L....2::.
10:21:13.480742 IP 77.12.111.190.64720 > 188.40.33.215.8433: Flags [P.], seq 29040:29700, ack 6557, win 4169, length 660
....E...V.@.z...M.o..(!... ..%.)M...P..I^p..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072071&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.481279 IP (removed).8433 > (removed).64720: Flags [P.], seq 6557:6706, ack 29700, win 9911, length 149
...$!).q..E.....@.@....(!.M.o. ...M....%..P.&..7..HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.504212 IP (removed).64725 > (removed).8433: Flags [P.], seq 21428:21915, ack 6249, win 4356, length 487
.M.o..(!... ....\W,..P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)


10:21:13.504593 IP 188.40.33.215.8433 > 77.12.111.190.64725: Flags [P.], seq 6249:6391, ack 21915, win 7846, length 142
...$!).q..E.....@.@....(!.M.o. ...W,.....CP.......HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 7
Connection: Keep-Alive
Access-Control-Allow-Origin: *

7:::1+0
10:21:13.542058 IP (removed).64720 > (removed).8433: Flags [P.], seq 29700:30360, ack 6706, win 4132, length 660
....E...V.@.z...M.o..(!... ..%..M.._P..$Ro..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072149&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.542557 IP (removed).8433 > (removed).64720: Flags [P.], seq 6706:6855, ack 30360, win 9921, length 149
...$!).q..E.....@.@....(!.M.o. ...M.._.%.QP.&.....HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.567452 IP (removed).64725 >(removed)8433: Flags [P.], seq 21915:22402, ack 6391, win 4320, length 487
.M.o..(!... ....CW,.^P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
@squidfunk

Hi denisu,

actually for the last 2 weeks these high rate of handshake warnings was not as important as the crashes of my client's chat server due to memory leaks. These we're hopefully fixed with the latest pull of 3rd-eden into the master. In the next days I will be at my client's site again to check whether the chat server has crashed and investigate those handshaken warnings again.

For now they don't seem to be very severe (at least not as severe as the crashes). I will keep you updated here.

@denisu

Hi, thank you for the reply!

You are right, its not that critical. The server can handle this high rate of connections easily. I hope it doesn't cause any problems on the client side. I still wasn't able to reproduce it, but until now I had this problem with clients using Firefox 5.0, Opera 11 and IE 8.0 over the xhr and jsonp transports.

I will let you know if I found out more.

@gasteve

I've been having problems with this too...wondering whether there's something I should be doing on the client side that I'm not.

@denisu

Yeah, for about 500 connected clients, there are 2 to 5 clients constantly reconnecting (absolutely no delay). Clients with a super fast internet connection were reconnecting so fast that I had to block their IP in the firewall, since it was affecting the server.

@squidfunk can you reopen this issue?

@squidfunk squidfunk reopened this Aug 16, 2011
@squidfunk

I tried to investigate the problem but didn't find anything so far. However It may probably related to another problem, which is caused by the fact that socket.io doesn't enable Flashsocket transport by default. This leads to a problem in IE 6 and 7 when the chat server is not on the same domain as the client script, as IE 6 and 7 seem to prohibit cross-origin long polling (no problem with IE 8 though). Therefore it is not possible with IE 6 or 7 to connect. Maybe this is related.

I reopened it. Actually I closed it because I thought I was the only one experiencing the problem and that it may be too specific. But it doesn't seem to be.

@rafaelbrizola

This problem happened to me when I'm using Firefox 3.6.18 and change the codification from UTF-8 to ISO-8859-1.

@squidfunk

@rafaelbrizola: can you give a little more information? Did the "client not handshaken" warnings increase or weren't you be able to connect at all? Actually my chat server is running (has to run) in an ISO-8859-1 environment (sadly enough).

@denisu

My app runs in an ISO-8859-1 environment too, and the node server is not on the same domain as the client script. Maybe this could be some hint on whats wrong :). But I have seen all common browsers (except the ones that support Websockets) having this problem. Still investigating...

@akostibas

I've been experiencing the same thing since moving to 0.7.7 (now on 0.7.9). Also with a cross-domain Node/webserver setup. The biggest problem is that I have to set the log level to 0 to avoid filling up the tiny EC2 disk.

Let me know if I can provide anything useful for debugging.

@cowboyrushforth

Hi there,

Also experiencing this issue intermittently with brand new deployment of socket.io 8.4.

Seems to occur 20-30 minutes after server start. Until then, everything is normal, everyone is happy.

But after about 20-30 minutes in, 90% of the messages in the log are simply 'client not handshaken should reconnect'.

Once node instances are restarted ( we are running 8 of them ), then the issues goes away, until it comes back 20-30 minutes later.

For what its worth, we are load balancing (with ip based session persistence) all 8 of the node processes.

Would be happy to provide any more details on our setup. Cheers, and thanks for Socket.IO! Awesome!

@3rd-Eden

I suspected that changing https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L157 from .end to .close
But I havent been able to verify my finds yet. But I also see the same rate of requests coming in.

But it's combination of fail from the server and fail from the client side code.

@cowboyrushforth

@3rd-Eden - I have put your suggested fix into production for almost 45 minutes now, and rate of 'client not handshaken...' messages has significantly dropped. There is still occasionally these messages coming in, but they are about equal to log messages of real actual message passing activity. I will continue to watch this issue for a bit, but you may have solved it.

Thanks a million! Will keep this thread up to date if my situation changes.

@cowboyrushforth

I may have spoke a bit too soon. Given enough time, it seems that all the node instances do still fall back into the loop of client not handshaken messages. But for some reason it seems as though it lasts longer because they need to be restarted.

@joefaron

I'm seeing this issue as well with about 700 connected clients.. How do you re-connect?

My log is flooded with:
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect

@tommypowerz

having the same poblem... also about 500-1000 connections...

but it is also crashing with this from time to time.. :

node: src/uv-common.c:92: uv_err_name: Assertion `0' failed.

and i dont know how to catch this...

@joefaron

I cleaned up a lot of my code on quit()'s on disconnects and fixed my client side socket.io code to not poll longer than 20 and now I'm not getting this.. pretty sure its the client side code goin wacky.. reconnecting and starting dozens of connections per client.

@xtassin

Only got this when using XHR polling or JSONP transports. It stopped when forcing to Sockets and FlashSockets only.

@ryanto

I'm seeing this happen with IE clients using XHR polling

@nickiv

XHR polling seems most suitable for me, so I keep digging it.
I think the problem is in reconnect method of socket. It does call handhsake method again and again while handhsake requests (both jsonp and xhr) are not being cancelled at all.
Under certain network conditions responces for handhake can be delayed and when they eventually arrive fierce reconnect begins.

Now I got the method to reproduce bug. Suppose we have socket.io server running on 8080. Connect a client via xhr-polling from FF. Then add firewall rule on server:
iptables -A OUTPUT -p tcp -m tcp --sport 8080 -j DROP
You can see in the network section of firebug some handhake requests pending. Then drop rule:
iptables -D OUTPUT -p tcp -m tcp --sport 8080 -j DROP
After that reconnection begins.

In my opinion we should not call handhake unless previos one fails. And there must be a timer in it to decide failure.

@dominiek

Any news on this? This is pretty serious

@joelanman

I have this issue on socket.io 0.85, it takes up 100% CPU, and even if I kill node and restart, it starts straight back up with loads of 'client not handshaken' warnings. Any workarounds?

Update:

Tracked it down to a client running on IE9 - killed it and issue has gone, but surely a single client shouldnt be doing this?

@ryanto

@joelanman make sure you ie9 client a socket.io client version 0.8.5.

@joelanman

thanks - must have just been a client/server mismatch

@cowboyrushforth

Have been having this problem, even with matching client+server on 0.8.5, but this patch in this thread: #534 seems to have definitely helped.

@3rd-Eden

@cowboyrushforth helped or fixed?

@cowboyrushforth

I deployed the patch about 36 hours ago, and it seems that the handshake errors continue to slowly decrease over time. (As connected clients finally refresh and download new client side code)

I will continue to keep an eye on the rate of these things and report back in another day or two.

@cowboyrushforth

Ok its now been 72 hours+ since applying the patch in issue 534, things seem much more stable. No out of control clients, and no high rate of handshake errors. Cheers

@liamdon

In which file should I apply the patch? Is this going to be fixed in 0.8.6?

@3rd-Eden

@liamdom it's already the master of socket.io-client so yes, it will be available for the next release

@denisu

I don't think the base of that problem is solved in 0.8.6. The warnings are gone, but the clients are still reconnecting at a really high rate.

io.sockets.on('connection', function(client) {
    console.log('Client ' + client.id + ' connected');
});

... outputs overall at the same rate like before.

@liamdon

Agreed, neither the above patch nor 0.8.6 have solved this - I'm getting a very high rate of warn - client not handshaken client should reconnect when using xhr-polling. @cowboyrushforth is the fix still working out for you?

@pfeiffer

Also seeing this with 0.8.6.

@cowboyrushforth

@liamdon and others - sorry for misleading you, the patch I thought had worked to solve it completely turns out that the issue still comes back, albeit I've seen it much less, seems like its overall an infrequent occurrence now. I have not been extremely scientific in this, just recording/perusing/munging logs, etc, but as I look now they are piling up with these handshake requests again.

@cowboyrushforth

Also, fwiw, am running socket.io behind load balanced setup. In recent days a theory has formed that it only occurs only on some clients after the node.js/socket.io server they are speaking to may go away (ec2 drama, or node/app crash), and then the load balancer assigns them to another node.js/socket.io server.

In every testing environment failing scenario with every browser from ie8,9-ff3-7,safari,etc, this works fine, the clients realize the session is invalid after a few seconds (sometimes the polling length, 20 or so), but somehow, in production, with some weird browsers, I think this causes the client to go into a fierce reconnection loop. Haven't been able to reproduce this one reliably which is the crux of it. Anyone else having this issue also behind load balancers or high availability setups where the socket.io session may be severed?

@pfeiffer

I'm seeing this with clients connecting directly to node.js/socket.io - no load balancers.

@dominiek

Same here

@liamdon

@cowboyrushforth Yep I'm behind an Amazon Elastic Load Balancer and using the redis store. I haven't been able to reliably reproduce it either, except in production where, after 20 mins, our socket.io servers are effectively DDOS'd by the handshaking clients.

Do you have a current interim workaround? Don't use xhr-polling?

@ryanto

the hardest thing about this is that no one is able to reproduce it, yet we all see it happening. i'm still getting these, not as much, but they do still happen.

@cowboy i originally thought it was load balancing so i forced all of my traffic to one node for a week, it still continued to happen.

fwiw, the only time i was able to reproduce this was runing ie9 with a client/server version mismatch. i had 0.8.4 on the client, and 0.8.5 on the server. make sure you have your clients request the latest version when you update the server.

thats all i got, sorry...

@thekiur

A high number of 'client not handshaken client should reconnect' warnings can be produced by restarting the server quickly.
It seems that the client will attempt to use old handshake ID. A refresh on the client side is required to get a working connection. Not sure on what browsers that happens, but if i enable logging, i see lots of those warnings if i restart the server quickly enough. They will graudually fade away.

@hovu96

Same on my EC2 instances when using 'xhr-polling' transport. I cannot reproduce it myself (with my browsers) but the log file of the production instance contains tons of this warning.

Is there anything I can try that helps finding out what's going wrong pleas let me know!

@denisu

I also noticed that many clients which produced handshaking warnings in 0.8.5 are in 0.8.6 recognized as connected for about half a second or less, that makes my node instance running at almost 100% cpu. 0.8.5 run at 5% to 10%.

@hovu96

I just tried version 0.8.5 but get the same result (as with 0.8.6): after about 20 seconds after launching the node server (about 100 connections) I get a lot of the "client not handshaken client should reconnect" warnings...

@crishoj

I was seeing a spew of "client not handshaken client should reconnect" with socket.io-client and Node v0.5.10-pre.
After downgrading to Node v0.4.12 the issue seems to have disappeared.

@denisu

I have this problem with v0.4.8.

@pfeiffer

I'm seeing this with v0.4.8 and v0.4.12.

I've tried reproducing it, but have been unsuccessful so far. It happens after a restart of the node instance, where some clients reconnect like crazy - this can cause the node instance to fail and when restarted the reconnecting-loop starts again.

I'm seeing this issue more frequently in socket.io v0.8.6 than v0.8.5.

@rauchg

The solution for this will be in v0.8.7

@pfeiffer

@guille You're the man! :-)

@dominiek

Does this mean we've actually found the problem and been able to reproduce it?

I'm saying this because there's been a lot of "reappearances" of this problem in the past months :(

@brettkiefer

@guille Great! Can you give us any more concerning the problem? I'm testing socket.io in production right now with 0.8.6, so I am interested in what's going on.

@liamdon

Can you guys shed any light on what the problem is? We'd like to patch in a fix without waiting for the next release!

@denisu

The warnings still appear at a high rate with 0.8.7, but the short connections of 0.8.6 are fixed, so everything just seems like it was with 0.8.5 :D.

@peepo

the peepo.com server has a similar issue, but only from boxes on the same side of the firewall as the server.
they never connect.

so if anyone has client-side tests, scriptlets or whatever, let me know...

3 macs: 2 intel, 1 ppc

@martintajur

I am experiencing the exact same issue while I launched http://listhings.com on NodeJS version 0.6.0, Socket.IO version 0.8.7. Whenever I turned on xhr-polling as a transport option, I had a ton of "broken" connections with the "warn - client not handshaken client should reconnect" messages.

Right now I disabled xhr-polling in production but I can thus not support all browsers at the moment.

@martintajur

Okay, I seem to be getting a high rate of these mesages even when xhr-polling transport is not enabled. Can someone describe what the problems and implications surround these messages from the Socket.IO side of things — e.g. is there anything I could do to help speed up patching this issue?

Edit: The reason I saw these errors even with xhr-polling disabled was that I had jsonp-polling enabled as well. Now I am only allowing websocket and flashsocket connections, and Socket.IO performs relatively good. Some clients with heavy HTTP-specific firewalls now block connections, though.

@peepo

and for whatever reason, websocket and flashsocket connections only, dont work this side of our simple firewall....

@dominiek

I just upgraded to 0.8.7 and re-enabled Websockets. Do I have to pray now?

@einaros

@dominiek, I'd upgrade to master/head instead.

@liamdon

Is there a confirmed fix in HEAD? It seems like we still don't have a good test case for this. I have struggled to recreate the issue except in production with a few thousand clients.

@martintajur

I was able to reproduce this easily on localhost with 7-8 tabs open in Google Chrome, with the following steps:

  • First, Socket.IO is configured to only allow xhr-polling (and/or jsonp-polling)
  • The server process was instructed to disconncet all connected clients immediately upon SIGTERM
  • The clients are, in turn, instructed to try to reconnect with 5 seconds interval to the server upon disconnect
  • Once I then terminated the server, and re-started it, all those 7-8 clients started to reconnect but what happened was that (somewhere in the middle) those messages ("client not handshaken") start to appear for one of the clients. Not every time though.

The most worrying thing (and the reason this was a showstopper for me in production) is that the Socket.IO client, upon receiving such error message from the server, goes into neverending loop, and just tries to do the same request that again triggers the error message, that makes the client to make the request again, that again triggers the error message... quite fast, CPU usage went to 100%.

So it seems @nickiv said it right - the problem lies in the client code, which doesn't treat the error message it receives properly.

@dominiek

I could recreate this EXACTLY like @martinajur said. This is very very worrisome indeed.

@dominiek

(with version 0.8.7!)

@einaros

@liamdon, @martintajur, @dominiek, I've yet to see this issue on any of my production environments, but if I do manage to reproduce it with the HEAD, I'd be happy to spend time chasing down the cause.

That said, this is open source software. Those of you who are able to reliably reproduce the issue aren't violently far from obligated to attempt a fix.

Up until two months ago I had nothing to do with the project, but upon needing support for the HyBi websocket protocol, I sat down and wrote just that. Since that time I've contributed this and that, and intend to keep doing so. If you're on any level getting anything good from the project, my friendly suggestion is: Pay it forward, don't just consume.

@martintajur

@einaros I am absolutely on the same page with you, and I will try to take a stab at this issue in the client side code.

@dominiek

@einaros Noted. Totally with you that this is and should be a collaborative thing. I've recently contributed a unit test to the unicode issues.

I can confirm that I could recreate this issue by opening 8 tabs on both master/HEAD, 0.8.6, 0.8.6 and 0.7.9.

@dominiek

I tried recreating this into a test case but wasn't successful. It seems that to create this issue you need to simulate many client (browser) instances of the client with one server. Simply creating many connections with 'force new connection' doesn't work.

@diegovar

Just adding my grain of sand to this issue, I can only reproduce it when I have more than one socket.io server under HAProxy. If I have only one, with or without HAProxy I have no issues, but as soon as I add 2 node.js instances under HAProxy and open a number of tabs, I get random connections and reconnections and once in a while I get the neverending ultra-fast reconnect bug.

@rauchg

If you open many tabs make sure you're not saturating the socket limit per host (use many subdomains).

@gdiz

I have AVG antivirus [Free]. Enabling the Surf Shield causes this problem on IE and Safari [Chrome and FF work fine]. Disabling the Surf Shield resolves the issue. This is a MAJOR issue as a lot of customers will be using AVG and similar firewalls. Anyone have a solution yet? [ive tried ports 80,8080,843 and 443 without any luck]

@einaros

@diegovar, I have 12 running production applications, under one haproxy instance. So there's more to it than that.

@gdiz

I can confirm that FF,IE 8/9,Safari(windows),Chrome work fine if i go and disable the AVG Surf Shield. So this must be some sort of networking issue.

Also, i am using the entire stack
io.set('transports', [ // enable all transports (optional if you want flashsocket)
'websocket',
'flashsocket',
'htmlfile',
'jsonp-polling',
'xhr-polling'

]);

@dominiek

I'm sure firewalls like AVG can have an impact on this issue, but it's definitely NOT a fix for this issue. We could recreate this issue easily without any firewalls.

@dominiek

After a day of debugging we found out that this issue doesn't show up when you point your server towards Mecca and touch your nose three times!

No really: we've isolated this problem into a unit test. It's a very deep one and we suspect this "reconnect loop" can be triggered in other ways than we're illustrating here: socketio/socket.io-client#339

Here's a short description of the problem:

The infinite loop that happens can happen due to several causes. In this unit test we highlight one of those causes, which could be that the transport takes longer to "get ready" than the server-side handshake garbage collector allows (30 seconds). In the case of XHR this could happen due to the browser needing to load slow include files on the page (this in fact happens when you open 8 tabs in Chrome often). Here are the steps of this specific scenario:

  • The client does a io.connect
  • In connecting, the client does a successful handshake, the server stores the client id into its handshake buffer
  • For whatever reason, the client's transport is not ready yet (this means XHR#open will not start yet). Once the document.load event is triggered by the util.defer, the io.connect will continue and call XHR#open which does a get call.
  • If more than 30 seconds have passed, the server will have removed the client id and tell the client to reconnect- The client receives the error packet, but continues to do a XHR#get on each successful XHR request. The error is escalated to Socket#onError, but due to the fact that Socket#connected is still false, it will never attempt a reconnect and XHR#get will continue to loop. The server will keep on telling to reconnect (although it really also means that the client should re-handshake).

The delay in transport loading is just one potential case in which this death spiral can be triggered.

@pfeiffer

@dominiek Great work - awesome with a test-case!

@diegovar

Just wanted to report that for some reason if I run socket.io on port 80 I get this infinite reconnect behavior, but if I do so on some other port then it works correctly. This infinite reconnect behavior only happens for transports other than websockets, if I leave websocket as the only avaiable transport then the connection is simply never made (the connecting handler is called but nothing else)

@bmentges

@guille 0.8.7 is out and you said it would be fixed there. Is it fixed in 0.8.7 ?

Thanks,
Bruno

@faeldt

I appear to have the same behavior as diegovar when running on port 80, and sadly japanese carrier Softbank only allows port 80 for http communication on Android.

@mikl

I seem to have this problem at very high rates – I had ~133k instances of this error within a 3 hour period with about 20 concurrent users yesterday…

@arnesten

I get this in my logs when I use IE9 in IE8 mode with XHR-polling. As long as I am in IE8 mode no messages are received to the browser from the server. If I turn off IE8 mode and use real IE9, messages are received correctly and I don't get the error in my logs anymore.

@cris

I've applied patch from @dominiek by myself and it works perfectly!
Thanks, @dominiek.

@dominiek

Any news on this guys? My patch actually only fixed one part of the problem, it's definitely still happening in our multi-node production setup. Logs are full of it

We are kind of hoping (making a little SocketIO prayer before hitting "cake deploy") that the reconnect refactor will solve this for us

Cheers

Dominiek

@cowboyrushforth

Fwiw, have tried all patches in this thread in production, and none have solved the issue, including 3rd-eden's bugs/reconnect branch, dominiek's patch, and both combined.

@rauchg

@cowboyrushforth have you tried extending the garbage collection timer ?

@cowboyrushforth

@guille no, but I am happy to give that a shot, can you point me to some docs/info on how to experiment with that? Thanks!

@rauchg

Please keep me posted

@cowboyrushforth

@guile, fix deployed, will keep you posted. So far so good. I have set it to 90,000

@cowboyrushforth

No change unfortunately, the hankshake errors are piling up as usual.

From my debugging, this appears to happen when the load balancer decides one of the node.js/socket.io servers is unavailable (for example when it restarts) and moves the client to a new node.js/socket.io server. From there, the handshake errors immediately start.

It feels like the client tries to continue its session on a new node.js/socket.io server, but this server never had the handshake details to begin with, and says the 'client should reconnect', but for some reason this drives the client haywire and goes into the infinite re-connection storm, until the client does a full browser refresh.

@dominiek

Yes, its vital to test this on load balanced setups. We are using NginX with ip_hash based load balancing and are also seeing a lot of these errors.

We also noticed that if we have any normal HTTP call not responding on our node app instances, the load balancer will decide to use a different machine. This will trigger this behavior as well.

Now that we have fewer of these loadbalancer switchovers the "storm" is less, but there is still a pretty heavy wind blowing

@cris

I use sio without load-balance and checking the log, observed that "not handshaken" is reproduced, but with very-low pace compared to case without @dominiek patch.

@rauchg

Are you guys doing sticky load balancing?

@cowboyrushforth

@guille yes, all clients are always re-routed to the same node server. If that node server is restarted or crashes however, then clients are immediately routed to a different node server.

@rauchg

Well the problem is that if the server crashes, it loses the sessions, therefore it will advise clients to reconnect. Sounds like expected behavior. Why is your server crashing?

@cowboyrushforth

the crashing is irrelevant. it happens just the same on server restart. ie - when we deploy new code for a new feature.

@cowboyrushforth

the real problem is that when it advises clients to reconnect, this functionality is broken. it only works on some clients. for example with webkit browsers, it seems to work as advertised. but with firefox3 and ie, it doesn't. when the clients are advised to reconnect, they get into an infinite re-connection loop, which doesn't sound like expected behavior.

@dominiek

Yes, I forgot to report this, after my fix we also noticed reconnect loops (infinite) still happening, but this was also happening on connection failures before we had a load balancer.

In XHRPolling#get I put in a hard stop if NUM_ERRORS_HAPPENED exceeds 50. In the onPacket I've put the NUM_ERRORS_HAPPENED++ when an error is received.

This is severely fucked indeed

@rauchg

I see. So this is exclusively a client-side issue. Got confused. I'll review the code for the client asap.

@naxhh

I'm still having this bug, and not any kind of help :(

@jmyrland

+1

This is a big issue in our production environment using node v 0.8.8 and socket.io v 0.9.10.

With around 1k connected clients (using the XHR polling transport, mostly IE8 clients) the CPU goes quickly through the roof. The only fix is restarting the server.

We was able to stabilize the server when we enabled flashsockets (instead of XHR polling).

@rauchg

@jmyrland are you using a single process? Any proxies behind socket.io?

@jmyrland

@guille That is a good question.

When using multiple processes with XHR as transport, the client would some times not connect to the server. It became quite unstable. When using a single process, the client always got connected, thus we stayed with a single process (though set up with a redis store).

However, now that we are using flash- and websockets, we can cluster up to multiple processes without any connection instability.

No proxies that I'm aware of.

Let me know if I should provide any logs, to help you :)

@naxhh

XHR polling makes a really change?

I'm get same error as you get.

The server goes down with 35k+ unique connections every time

Now I have a cron job checking for node and restarting it, but isn't the way...

When node restarts the job I get more warnings than before so. Restart is the worst solution i came with.

@jmyrland

98% of the ~1k clients (IE8) connected to the server were using XHR polling. When moving to the flashsocket transport, the problem was marginally reduced - avoiding server meltdown.

So yes, in my case the XHR polling transport is the issue.

@naxhh

In my case i'm using websocket connection.
The numer of errors should go down with XHR polling?

@wavded

@guille is this something addressed in engine.io?

@rauchg

@naxhh restart is precisely what explains the warning: you kill all the sessions, then subsequent GET requests in the polling cycles don't find the id.

@naxhh

@guille yes but in the theory, the socket.io sends a request to restart the client

"user should reconnect"

But the client side seems to never reconnect that socket.

So the server get's a lot failling request and finally restart itself because can't handle requests.

There is anything i'm missing?

I have to add to this.
The error is given a lot when i restart the server, of course.

But seems to appear some times whithout restarting anything

@anthonywebb

Still broke in 0.9.10 only recourse was to apply the "line 677" fix, which has move in 0.9.10 to line 711, commenting out that line saved my server... Hopefully someday this bug will be solved, amazing it has lingered for so long, has to be the all-time longest thread in github history :)

@davidfooks

@anthonywebb I thought the changes in the client fixed the need for the "line 677" fix. That should stop the DOS attack that the clients are doing (or at least throttle it a lot more).

Did you try my hack? Its not tested so I wouldn't fully trust it but it works for us:
c382a6b.

@anthonywebb

@davidfooks to clarify, if I use your hack I dont need the line 677 fix right? I'll give it a try.

@anthonywebb

@davidfooks your hack (I removed the line 677 hack) seems to be working pretty solid, at least it looks good in the logs, no more DOS attack. CPU load spikes every 5 seconds or so to 100% but then comes back down. Not sure if this is normal, but I will watch it.

Another note, I see people talking number of "unique connections" is there some easy way to get at that number? I'm curious what mine is.

@davidfooks

No, my change fixes the specific issue we were seeing with clients being disconnected even though they require a keep-alive connection (I'm not sure if it was put into the release) which resulted in the DOS. #438 (comment)

The 677 fix just stops the DOS spamming but not the original issue causing the DOS. This DOS bug fix was applied client-side see #438 (comment).

@davidfooks

Ok, after reviewing all this (its been a long time) I remember that they put a version of this hack and the DOS fix into the release 0.9.5. We haven't seen the issue since. Are you sure that your load-balancer is redirecting reconnecting clients to the same machine? That can cause this behavior.

@anthonywebb
@shapeshed

Same issue here. Applying the '677 hack' (t-shirts anyone?) to what is now line 711 of lib/manager.js in v0.9.10 reduced CPU usage.

Running Socket.IO in production has been painful - before running on 0.8.14 we were seeing memory leaks even after disabling websockets. When running websockets there were also significant memory leaks. See this ticket.

@maxguzenski

I have a m1.small on EC2, just with node (0.8.14) and socket.io (0.9.11) and I have this issue. After a while, nodejs got a "EMFILE" many times, without stop or restart (just with manual restart).

I think that '677 hack' help a lot with 'EMFILE' error as well.

@mdahiemstra

I'm running a node server to keep track of a user's playing time. On disconnect the duration gets saved to redis. I'm running node 0.8.11 with socket.io 1.1.62 and I also have this issue, but in some cases its different:

When running the node app single process without RedisStore I can get up to 15k concurrent users, when the CPU can't handle it anymore (load around 100%) I'm getting the warn - client not handshaken client should reconnect messages allot.

When I run the node app clustered (with cluster mod), I run 4 processes and using RedisStore they each can handle around 1.5k concurrent users (~5k total), the loads are distributed and a lot lower then running in single mode but I'm getting the handsake warnings allot sooner. It looks like it's loosing his client session's or something I also don't see any records in the Redis db is this normal?

But anyway, it's no problem to run it in single mode, but I want to be able to scale the application up to 30k concurrent users. When running in single mode occasionally I get a (node) warning: possible EventEmitter memory leak detected. warning.

I did not apply the 677 hack by the way.

Any fixes yet for the handshake warnings? Maybe my way of benchmarking is not sane I don't know but haven't found a decent benchmark tool for node yet. I'm now just using a rampup scene wich I wrote in a node app. Running it from a external server to simulate connections that get killed in a random time after connection.

@softwareprojects

The amount of time we spent wrestling with these "client not handshaken" errors is absolutely mind boggling!

This ticket is now a year old and as of socket.io version 0.9.10 the bug is still very much alive and kicking.

The issue reproduces when you have multi-instance / multi-process nodes, with clients getting switched between the different node servers. Sure, under normal circumstances you should use stickiness and have the same client connect to the same node server, but servers go down. And when they do, you -will- end up having a client that was previously talking to node_instance_1, attempt to reconnect to node_instance_2.

We tried everything to make this scenario work. But it just doesn't work.

Applied several different patches to the socket.io code (as recommended by users) including a few patches to node.js, trying to force socket.io to gracefully handle a disconnect-reconnect.

The only way to fix this is by using redisstore, so that all node instances communicate with each other, sharing the session data. redisstore eliminates about 95% of this issue. However, from time to time, even with redisstore, the nasty "client not handshaken" error re-appears. When that happens, the only recourse is a full page refresh.

After spending A LOT of time struggling to make this work, we eventually came up with a clean solution, that works 100% of the time, no ifs or buts.

We ended up ditching socket.io and replacing it with sockjs.
Sure it has no bells & whistles, no emit function and no auto-reconnect.

It just works.

@mdahiemstra

Hi,

Yes its very sad to see how little support from the package maintainers on such a important issue. I have my node application running now in single process mode and it can support ~10k concurrent users. But my project requires it to scale up to ~30k and I want to have multiserver to guarantee uptime. So i'll look into your tip using sockjs, thanks.

@naxhh

Here I have 40k users now and a lot of handshake errors... thinking on change sockets too :(

@davidfooks

SockJS is good and I've never seen this error in my testing. I've been meaning to switch to it at some point. Best thing about it is that it is light weight and just like WebSockets API (futureproof) unlike socket.io which is far too powerful for what we are actually using it for.

@wavded

we switched our code base from socket.io to sockjs and generally its been a lot easier to maintain (adding room/context and reconnect support was pretty simple), the module does less which turned out to be a big maintenance win

@rauchg

This issue doesn't exist in engine.io in my testing, so you basically need to wait until it's integrated into socket.io. Reproducing transport work on both projects at this point is less than ideal.

@maxguzenski

To say that this issue only happens when you have multi-instances/multi-process is not true.
I have a single instance (EC2 m1.small) running node with a single process... and I have this issue (that "677 hack" helps a lot).

node 0.8.14 and socket.io 0.9.11. Socket.io is direct open in port 3000 to clients (no nginx, no haproxy, no anything).

@softwareprojects

@guille, love everything you're doing - really appreciated!

But your comment about "so you basically need to wait", doesn't work for us
We don't want to wait :-)

Having such a critical bug unattended for over a year is not reassuring. Two of our engineers spent a full week wrestling with this, applying various patches and reaching out to all the other users experiencing the same issues when socket.io runs in a multi-instance/multi-process setup.

There are more than 5,000 results in Google for "socket.io client not handshaken” !
We probably went through all of them.

Lots of pain, with no one offering any real solutions (other than integrating redisstore or using different hostnames).

No matter what we tried, socket.io failed to gracefully handle a disconnect-reconnect, when the server switches.

For us, sockjs was a life saver

@shapeshed

@guille firstly thank you for Socket.IO!

This issue doesn't exist in engine.io in my testing, so you basically need to wait until it's integrated into socket.io.

The direction of Socket.IO 0.9.x isn't really working at the moment for those of us using it in production. Just look at this thread. It would be great to have an understanding of

  • Whether Socket.IO 0.9.x is still supported / actively used / officially deprecated.
  • When and if to expect Socket.IO 1.x.x.
  • Is there a branch for Socket.IO 1.x.x?
  • Can the community help push the Socket.IO 1.x.x effort forward?
  • Whether to migrate to engine.io / sockjs now?
@Incisive

I've also switched to SockJS from socket.io, and have had great success running it on a production environment with tens of thousands of people on one server.

While sockJS may not include all the features that socket.io includes like reconnecting and 'rooms', but rolling my own implementation of this along with writing my own redis backend has proven to be much more efficient than what I experienced with socket.io.

@guille I appreciate what you have done for the socket.io community, however I feel that letting these issues continue to run amuck isn't the best way to approach this. I wish you guys had a more solid documentation area, that describes production scaling and clustering setups along with load balancer layouts, until these issues are addressed I'll be sticking with sockjs.

@pygy

Haters' gonna hate... Socket.IO is a gift, not a due.

Race conditions in distributed systems are sometimes horrible to track down, and there are only so many hours in a day.

With Engine.IO around the corner, I don''t think it's worth it to track this one down.

Keep on rocking, Guille!

@softwareprojects

@pygy this is not about hate.

We are all extremely appreciative of the work put into socket.io by Guille and others.
It's a very impressive package and has helped thousands of apps go real-time, across multiple devices.

Having said that, this critical bug is simply preventing the use of socket.io in a multi-server architecture.

Once your servers go down, clients will disconnect and fail to successfully re-connect again.
To the end-user this translates to an error-message that doesn't go away until a full page refresh.

engine.io sounds promising. If you can wait for it, by all means do so.

This is not a propaganda against socket.io, but merely a description of a problem we ran into and how we eventually solved it.

We posted on this thread to provide value for others.

Hopefully others who run into this problem, can take-away from our experience and save some time.

@dalssoft

@guille How can we help testing the new version with Socket.io + Engine.io?

@rauchg
@ADumaine

I don't know if this has a direct impact on all the different configurations that everyone is using, but I think it may provide a clue as to what might be causing some of the "client not handshaken client should reconnect" issues.

Try using an ip address on the client side instead of a hostname.

To load test our application (an auction site) we used another node instance with socket.io-client on a remote server to create connections and bid in an auction. I found that we can get no more than 12 connections but lots of "client not handshaken client should reconnect". The server acts very strange as the connection count seems to drift as the connections try to be maintained or reconnect.

After much head banging I decided to try using the ip address of the server on the client and was immediately able to make 1,000 connections. I found I had to use setInterval to stagger creating the connections by about 30ms otherwise it overloads the server. At connection 1012 I get a new error "warn - error raised: Error: accept EMFILE". I added "ulimit -n 200000" and tried for 2000 connections. The EMFILE errors stopped. Socket.io connection count went to 1105 and then stabilized at 1017. Not sure why. Some sort of limit from one ip address?

I have not had a chance to dig in to why an ip address has no issue connecting. It could be dns is causing a timing or handshake issue.

Using node 0.8.14 and socket.io 0.9.11 on both sides running on ubuntu 10. Socket.io is listening on port 3300 under a node https server.

Client side:

var client = require("socket.io-client");

var socketAddr= 'https://199.2xx.xxx.xxx';
//var socketAddr = 'https://xxx.mydomain.com';

var pid = '786';

var idStart = 1;  //test user id
var idEnd = 2000;


var socks ={};

sCount = idStart-1;
doConnect = setInterval(function()
        {
            sCount +=1;
            if (sCount >= idEnd)
            {
                clearInterval(doConnect);
                console.log("Cleared connect loop");
            }
            var uid = sCount;
            console.log("New sock count="+sCount);
            var socket = makeSocket(uid);

        // put into the array
            socks[uid]=socket;

        },  30);


function makeSocket(uid){
    var s = client.connect(socketAddr,{'force new connection': true,'port':'3300','connect timeout':'5000'});
    init(s, uid); // emit connect to application
    return s;
}

I hope this helps shed some light on this. I would very much like to get this resolved and have a stable socket.io.

@Kamil93

I have same issue on node v0.6.x, or v0.8.x with socket.io v0.9.10
I'm gonna try 677 fix but maybe a ADumaine solution works? Anyone knows?

Can 677 fix makes some trouble?

@mdahiemstra

I ported my application from socket.io to SockJS and running it on multiple servers load balanced by HAProxy. My benchmark resulted in ~50k concurrent users with a minimal load. Without any problems.

@softwareprojects

@mdahiemstra , we are about to go the same path

Have been running on sockjs for the last few weeks and loving it! The next step is putting it behind HAProxy

Would you be so kind and share your haproxy config, or point us in the right direction?
We're running sockjs on SSL because of sockjs/sockjs-client#94

Much appreciated!

@mdahiemstra

@softwareprojects Hi, Hmm I should look into running sockjs on SSL thanks for the heads up.

I used (modified) this haproxy config: https://github.com/sockjs/sockjs-node/blob/master/examples/haproxy.cfg

The file that worked for me on our staging environment is: http://cl.ly/code/2k1o1C0t1Z43

The configuration in production I can't provide because it's handled by a outsourced company.

@Kamil93

Should I wait for Socket.io 1.0 or port my app too Sockjs?

I'm using Node v0.6.2 with Socket.io 0.9.10 right now, and even after 677 fix (which has move in 0.9.10 to line 711)
still CPU usage increase.

I commented this line //transport.error('client not handshaken', 'reconnect');

@konklone

Only you can make that decision. :) SockJS is good stuff, and I just switched to it - but if you were used to some of socket.io's nice features (like a named event API, and Redis interaction), then you will have to put in more work to replace those things. It's not super hard, but it is a time investment.

@Kamil93

I really like Redis solution. :) Is Socket.io 1.0 going to has Redis too? And maybe anyone knows about Socket.io 1.0 releasing date?

@thalesfsp

Going to SockJS....

@guilloche

when this issue will be solved?

@Kamil93
@guilloche

when release socket.io 1.0? exactly this problem would not be?

@Kamil93
@rauchg

You can keep track of progress here:
https://github.com/learnboost/socket.io/tree/1.0
https://github.com/learnboost/socket.io-client/tree/1.0

It's very close. Working on tests, documentation, website and a thorough document about changes and scalability.

@guilloche

very close - this is when? in a week, month, year?

@pygy

The usual answer for open source projects is "When it's done".

Keep an eye on the 1.0 repo, and don't put pressure on the author who offers you something for free.

@evanp

I just switched to sockjs instead.

@jondubois jondubois added a commit that referenced this issue Dec 28, 2012
@jondubois jondubois Fixed issue #438 which caused high rate of "client not handshaken sho…
…uld reconnect" - With correct indentation :p
fb17aa6
@Kamil93

topcloudsystems, may it be a reason of memleaks in Socket.io 0.9.x?

@radius314

The error happens even more when using secure vs non-secure.

@jondubois

hashi101 - I don't know yet. I'm working on a project right now and after doing some stress testing, I've noticed a few memory leaks which I will be debugging shortly... Probably socket.io-related - I might have an answer to that soon.

@thalesfsp

One year after and this prolems still persist...

@Kamil93

If I will make application without Native Redis Support from Socket.io, and make just my own synchronization by Redis, should it work correctly? Without any strange things like memleaks or reconnecting without reason?

@lessmind

@hashi101 I had the same problem about one year ago, no Redis, no anything strange, but it still happended again and again.

@ddude

We had the same issue. Our setup uses node-redis, secure connections and process forking. The very first client to connect would trigger the bug every time.

Turns out it was the "cluster" module of node.js, we disabled it and the bug went away.

@zdwalter

i didn't use cluster module, but it has the same warning with about 1000 connections

@magickaito

i am still facing this problem...

@magickaito magickaito referenced this issue in sockjs/sockjs-node Mar 20, 2013
Closed

unable to connect from iphone client. #114

@sbellone

Hello,
I am also able to easily reproduce it with HAProxy, like @diegovar.

Issue reproduction:
In my case, the problems comes from the Socket.IO's handshake mechanism: There is two requests, which mean that if we load-balance them, a server instance will receive the first part and the second server will receive the persistent transport session, and this will fail with this error message.

Actually, with the recommended HAProxy configuration (http://stackoverflow.com/questions/4360221/haproxy-websocket-disconnection/4737648#4737648), it seems to work at first. But that just because the first www request is redirected to the first available www_backend server (i.e. server1), and then, the socket request is redirected to the first available socket_backend server (i.e. server1, and as it point to the same address, it works). Same with the 2nd client, etc, etc...

But if we restart HAProxy, all clients will try to reconnect at the same time, and the load balancing will mess up the handshake process: we have a huge amount of "client not handshaken client should reconnect".

Solutions:
One solution is to use the source algorithm, which will ensure that for a client, both requests are redirected to the same server. But this will not result in an optimal load balancing.
The second solution is to use the cookies mechanism of HAProxy. This works fine with clients coming from a browser. But I did not find a solution to use cookies with the socket.io-client lib.

Questions:
So, as I would like to use the roundrobin algorithm of HAProxy AND the socket.io-client lib, I have two question:

  • Would it be possible to deactivate the handshake process an have a single direct connection? I don't need to access to the header data.
  • Is it possible to access, in the socket.io-client, to the cookies got during the first part of the handshake and to send them back when trying to establish the persistent connection?

Thanks.

@sbellone

Ok, I think you can forget my comment.
Using a RedisStore instead of the MemoryStore is the solution in my case. It works super fine! 👍
Thanks, and keep up the good work!

@mmadden mmadden referenced this issue in socketstream/realtime-transport Apr 20, 2013
Open

rtt-faye module #1

@jondubois

Ok I decided to have another look at this issue recently.
The solution I posted a while ago (https://github.com/LearnBoost/socket.io/pull/1120/files) only reduced the occurrence of these errors, it did not stop them altogether.
It seems that the cause of the problem is a race condition related to the clustering socket.io across multiple processes.
This issue occurs while using the default redis store and with also socket.io-clusterhub.
If you look around line 800 of lib/manager.js, you can see that socket.io responds to the client's connection request before it publishes the handshake notification to other socket.io workers:

res.end(hs); // responds here

self.onHandshake(id, newData || handshakeData);
self.store.publish('handshake', id, newData || handshakeData); // publishes here

So occasionally, the client will know about the handshake before any other socket.io worker does.
When the client gets the response and tries to continue with the connection, it may be dealt with a worker
which is not yet aware of the handshake notification, hence the client is not handshaken (according to that particular worker).
It's not practical to check that every worker has in fact received the handshake notification so I think the best way
to solve this issue is to give the worker a second chance to check the handshake in case it doesn't see it the first time (after some timeout).

Something like this might work around line 710 (will probably need to have a 'handshake timeout' config property to use as the timeout):

Replace:

if (transport.open) {
transport.error('client not handshaken', 'reconnect');
}

transport.discard();

With:

if (transport.open) {
setTimeout(function() {
// If still open after timeout, THEN we will kill that connection
if(transport.open) {
transport.error('client not handshaken', 'reconnect');
}
transport.discard();

}, replaceThisWithTheHandshakeTimeoutVariable);
}

Note that I am not making a pull request for this because it refers to the pre-1.0 version.
This issue may have been resolved in the new version.

@SteveEdson

@topcloud I'm interested in trying this, what would you recommend the timeout variable to be set to?

@jondubois

@SteveEdson I guess you could add a 'handshake timeout' property as a soket.io config option. That may not completely get rid of the error. There might be other places in the code which have similar race conditions. It did reduce the number of failures.

@SteveEdson

@topcloud Thanks, I'll give it a try. Any idea what the actual value should be? Would something like 1 second be to high? etc

@jondubois

@SteveEdson 500ms sounds about right for inter-process communication. You should experiment. Also note that I made a mistake in the code I pasted above, it should be if(transport.open && !this.handshaken[data.id]) instead of the second if(transport.open)...

@Billydwilliams

@topcloud
@SteveEdson
It's been a couple months, how have the changes worked for you?

@SteveEdson

I think it helped, but I've just realised that when I updated socket.io last month, it would have overwritten the change. I'll have to reapply it and see how it goes. Cheers.

@jwarkentin

Hopefully this helps someone. I just did a little more investigating to try to figure out what's happening in my case and found some interesting stuff. I have several instances of my Socket.io app running behind a load balancer. Normally this is just fine, but sometimes something goes wrong and if falls back to long polling. When this happens the connections get bounced back and forth between the servers behind the load balancer. If it hits any server other than the one it originally authenticated with then it floods the logs with these errors. The simple solution is to use the RedisStore stuff to share session information between running instances. There are also other solutions. Here are some links that may be useful:

http://stackoverflow.com/questions/9617625/client-not-handshaken-client-should-reconnect-socket-io-in-cluster
http://stackoverflow.com/questions/9267292/examples-in-using-redisstore-in-socket-io
https://github.com/LearnBoost/Socket.IO/wiki/Configuring-Socket.IO
https://github.com/fent/socket.io-clusterhub

@ruanosaur

@topcloud @SteveEdson Hi guys, any news on whether this would be fixed in v 1.0? Currently using redisCloud on heroku and experiencing the same error - going to try the timeout if v1.0 doesn't fix it...

@bossonchan

I am using Cluster module and RedisStore that reproduces this bug.
After some exploration, I have the same conclusion with @topcloud.
My solution is to use Nginx instead of cluster.
Nginx is a load balancer and it have a extension module -- nginx-sticky-module, which can ensure that normal users are always redirected to the same backend server.
Now everything seems to work fine, I hope this helps. :)
(P.S. the version of node and socket.io are 0.10.28 and 0.9.16)

@edgework

I see a lot of people suggesting RedisStore as a solution. Though it may solve the "not handshaken" error it will introduce a massive memory leak. There's various "solutions" to that as well but none work reliably.

@jwarkentin

With the RedisStore I've also run into problems with leaking sockets. I'm thinking Socket.io-clusterhub is probably a better solution, though I still haven't tried it myself. Ultimately, to really make things work right though there needs to be a reliable way of sharing the handshake data between servers in a cluster. Also, while I haven't looked that deeply into the code, I suspect that all of these handshake sharing mechanisms are susceptible to a race condition where the connection could be attempted before the handshake data has actually been shared with all the servers in the cluster. I haven't seen anything that protects against it (though I could easily have missed something). This race condition is much less likely to be observed though because the network latency within the cluster is almost always going to be less than the latency of the connecting client.

@toblerpwn

Cannot speak to the cause, but #1371 has totally removed the 'leaking sockets' issue for us in prod in a fairly high-volume application (100,000+ active users per day & 7,000+ concurrent connections when busy). See my comments in that issue for CPU leak analysis, etc.

Also notably: version 1.0 does not use Stores per se. I forget what they call them now ('connectors'?) - and I'm unsure whether the actual implementation differs, but presumably it does something totally new, because why else change a name? 😄

@edgework

We tried 1371 and were getting other failures. I forget which but we had to quickly revert when our app blew up. The failures didn't crop up until actual users connected. Our internal tests worked fine. It probably has something to do with dropping down to XHR polling or something. In any case after it blew up we just gave up on it.

In the end we ditched any kind of store beside memory store and synchronized the sockets ourselves using redis pub/sub and actually we saw a decent performance increase.

At some point we'll be presenting our various solutions to Node meet ups etc..

@rauchg
@jwarkentin

That's too bad. I've had to disable RedisStore in production because of the problems it was causing. I might still try 1371, especially since I've disabled everything but websockets now. If that doesn't work I'll try clusterhub or write something myself. I need to solve this in the next week or so though. If you have any suggestions I'd love to hear them.

@rauchg
@edgework

Basically our solution was to use Redis pub/sub to connect the server instances. If a packet originating from a socket on one server was meant for a socket on another server it would be queued. This includes broadcasts. Every 10th of a second the queue is sent to all the other instances and they will pick up the packets for their sockets and forward them on. Of course if the packet is meant for a socket on the same server its just handled internally and the other servers don't get notified.

@jwarkentin

I'm already using sticky load balancing, but that doesn't seem to work that well when most people that are connecting are behind some big NATs. @guille does socket.io-redis work with socket.io 0.9.x?

@edgework

We're using HAProxy with no problems.

@rauchg

This issue is not relevant for 1.0

@rauchg rauchg closed this Aug 19, 2014
@wavded

yey! its closed!... it's closed :)

@alexpersian alexpersian referenced this issue in socketio/socket.io-client-swift Aug 7, 2015
Closed

Connection is connected again and again #101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment