New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High rate of "client not handshaken should reconnect" #438

Closed
squidfunk opened this Issue Aug 1, 2011 · 297 comments

Comments

Projects
None yet
@squidfunk

squidfunk commented Aug 1, 2011

I am running a chat server with node.js / socket.io and have a lot of "client not handshaken" warnings. In the peak time there are around 1.000 to 3.000 open TCP connections.

For debugging purposes I plotted the graph of actions succeeding the server-side "set close timeout" event, because the warnings are always preceded by those, so the format is:

Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 2098080741242069807
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - xhr-polling closed due to exceeded duration
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 330973265416677743
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - setting request GET /socket.io/1/xhr-polling
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 10595896332140683620
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 10595896332140683620
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 21320636051749821863
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 21320636051749821863
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 3331715441803393577
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   warn  - client not handshaken client should reconnect

The following plot explained:

  • x axis: The passed time between the first and last seeing of a client id.
  • y axis: total amount of clients for a given time x terminating with a specific message (client not handshaken, cleared close timeout, etc.)

Plot Full size

I did not change the default timeouts and intervals provided by socket.io, but I think it is very strange, that there is a peak of handshake errors at around 10 seconds (even surpassing the successful cleared close timeouts!). Did anyone experience a similar situation?

Best regards,
Martin

@squidfunk squidfunk closed this Aug 4, 2011

@denisu

This comment has been minimized.

denisu commented Aug 9, 2011

Hi,

were you able to solve it? I am still having this problem with 0.7.8. I am not able to reproduce it on my machine, but i can see it in the debug logs, some clients go really crazy (looks like more than 50 reconnects/second). I have this problem only with jsonp and xhr connections, turning off one of them didn't help though.

   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393095&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393150&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   ...

I am using custom namespaces btw.

@denisu

This comment has been minimized.

denisu commented Aug 9, 2011

TCP dump of one of those terror-connections, if that helps (this goes on continously).

10:21:13.450665 IP (removed).8433 > (removed).65471: Flags [P.], seq 1973377331:1973377336, ack 537321861, win 54, length 5
@.@....(!....E ...u.Y3 ...P..6L....2::.
10:21:13.480742 IP 77.12.111.190.64720 > 188.40.33.215.8433: Flags [P.], seq 29040:29700, ack 6557, win 4169, length 660
....E...V.@.z...M.o..(!... ..%.)M...P..I^p..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072071&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.481279 IP (removed).8433 > (removed).64720: Flags [P.], seq 6557:6706, ack 29700, win 9911, length 149
...$!).q..E.....@.@....(!.M.o. ...M....%..P.&..7..HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.504212 IP (removed).64725 > (removed).8433: Flags [P.], seq 21428:21915, ack 6249, win 4356, length 487
.M.o..(!... ....\W,..P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)


10:21:13.504593 IP 188.40.33.215.8433 > 77.12.111.190.64725: Flags [P.], seq 6249:6391, ack 21915, win 7846, length 142
...$!).q..E.....@.@....(!.M.o. ...W,.....CP.......HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 7
Connection: Keep-Alive
Access-Control-Allow-Origin: *

7:::1+0
10:21:13.542058 IP (removed).64720 > (removed).8433: Flags [P.], seq 29700:30360, ack 6706, win 4132, length 660
....E...V.@.z...M.o..(!... ..%..M.._P..$Ro..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072149&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.542557 IP (removed).8433 > (removed).64720: Flags [P.], seq 6706:6855, ack 30360, win 9921, length 149
...$!).q..E.....@.@....(!.M.o. ...M.._.%.QP.&.....HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.567452 IP (removed).64725 >(removed)8433: Flags [P.], seq 21915:22402, ack 6391, win 4320, length 487
.M.o..(!... ....CW,.^P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
@squidfunk

This comment has been minimized.

squidfunk commented Aug 9, 2011

Hi denisu,

actually for the last 2 weeks these high rate of handshake warnings was not as important as the crashes of my client's chat server due to memory leaks. These we're hopefully fixed with the latest pull of 3rd-eden into the master. In the next days I will be at my client's site again to check whether the chat server has crashed and investigate those handshaken warnings again.

For now they don't seem to be very severe (at least not as severe as the crashes). I will keep you updated here.

@denisu

This comment has been minimized.

denisu commented Aug 9, 2011

Hi, thank you for the reply!

You are right, its not that critical. The server can handle this high rate of connections easily. I hope it doesn't cause any problems on the client side. I still wasn't able to reproduce it, but until now I had this problem with clients using Firefox 5.0, Opera 11 and IE 8.0 over the xhr and jsonp transports.

I will let you know if I found out more.

@gasteve

This comment has been minimized.

gasteve commented Aug 16, 2011

I've been having problems with this too...wondering whether there's something I should be doing on the client side that I'm not.

@denisu

This comment has been minimized.

denisu commented Aug 16, 2011

Yeah, for about 500 connected clients, there are 2 to 5 clients constantly reconnecting (absolutely no delay). Clients with a super fast internet connection were reconnecting so fast that I had to block their IP in the firewall, since it was affecting the server.

@squidfunk can you reopen this issue?

@squidfunk squidfunk reopened this Aug 16, 2011

@squidfunk

This comment has been minimized.

squidfunk commented Aug 16, 2011

I tried to investigate the problem but didn't find anything so far. However It may probably related to another problem, which is caused by the fact that socket.io doesn't enable Flashsocket transport by default. This leads to a problem in IE 6 and 7 when the chat server is not on the same domain as the client script, as IE 6 and 7 seem to prohibit cross-origin long polling (no problem with IE 8 though). Therefore it is not possible with IE 6 or 7 to connect. Maybe this is related.

I reopened it. Actually I closed it because I thought I was the only one experiencing the problem and that it may be too specific. But it doesn't seem to be.

@rafaelbrizola

This comment has been minimized.

rafaelbrizola commented Aug 16, 2011

This problem happened to me when I'm using Firefox 3.6.18 and change the codification from UTF-8 to ISO-8859-1.

@squidfunk

This comment has been minimized.

squidfunk commented Aug 17, 2011

@rafaelbrizola: can you give a little more information? Did the "client not handshaken" warnings increase or weren't you be able to connect at all? Actually my chat server is running (has to run) in an ISO-8859-1 environment (sadly enough).

@denisu

This comment has been minimized.

denisu commented Aug 17, 2011

My app runs in an ISO-8859-1 environment too, and the node server is not on the same domain as the client script. Maybe this could be some hint on whats wrong :). But I have seen all common browsers (except the ones that support Websockets) having this problem. Still investigating...

@akostibas

This comment has been minimized.

akostibas commented Aug 25, 2011

I've been experiencing the same thing since moving to 0.7.7 (now on 0.7.9). Also with a cross-domain Node/webserver setup. The biggest problem is that I have to set the log level to 0 to avoid filling up the tiny EC2 disk.

Let me know if I can provide anything useful for debugging.

@cowboyrushforth

This comment has been minimized.

cowboyrushforth commented Sep 16, 2011

Hi there,

Also experiencing this issue intermittently with brand new deployment of socket.io 8.4.

Seems to occur 20-30 minutes after server start. Until then, everything is normal, everyone is happy.

But after about 20-30 minutes in, 90% of the messages in the log are simply 'client not handshaken should reconnect'.

Once node instances are restarted ( we are running 8 of them ), then the issues goes away, until it comes back 20-30 minutes later.

For what its worth, we are load balancing (with ip based session persistence) all 8 of the node processes.

Would be happy to provide any more details on our setup. Cheers, and thanks for Socket.IO! Awesome!

@3rd-Eden

This comment has been minimized.

Contributor

3rd-Eden commented Sep 16, 2011

I suspected that changing https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L157 from .end to .close
But I havent been able to verify my finds yet. But I also see the same rate of requests coming in.

But it's combination of fail from the server and fail from the client side code.

@cowboyrushforth

This comment has been minimized.

cowboyrushforth commented Sep 17, 2011

@3rd-Eden - I have put your suggested fix into production for almost 45 minutes now, and rate of 'client not handshaken...' messages has significantly dropped. There is still occasionally these messages coming in, but they are about equal to log messages of real actual message passing activity. I will continue to watch this issue for a bit, but you may have solved it.

Thanks a million! Will keep this thread up to date if my situation changes.

@cowboyrushforth

This comment has been minimized.

cowboyrushforth commented Sep 17, 2011

I may have spoke a bit too soon. Given enough time, it seems that all the node instances do still fall back into the loop of client not handshaken messages. But for some reason it seems as though it lasts longer because they need to be restarted.

@joefaron

This comment has been minimized.

joefaron commented Sep 27, 2011

I'm seeing this issue as well with about 700 connected clients.. How do you re-connect?

My log is flooded with:
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect

@tommypowerz

This comment has been minimized.

tommypowerz commented Oct 1, 2011

having the same poblem... also about 500-1000 connections...

but it is also crashing with this from time to time.. :

node: src/uv-common.c:92: uv_err_name: Assertion `0' failed.

and i dont know how to catch this...

@joefaron

This comment has been minimized.

joefaron commented Oct 2, 2011

I cleaned up a lot of my code on quit()'s on disconnects and fixed my client side socket.io code to not poll longer than 20 and now I'm not getting this.. pretty sure its the client side code goin wacky.. reconnecting and starting dozens of connections per client.

@xtassin

This comment has been minimized.

xtassin commented Oct 2, 2011

Only got this when using XHR polling or JSONP transports. It stopped when forcing to Sockets and FlashSockets only.

@ryanto

This comment has been minimized.

ryanto commented Oct 3, 2011

I'm seeing this happen with IE clients using XHR polling

@nickiv

This comment has been minimized.

nickiv commented Oct 4, 2011

XHR polling seems most suitable for me, so I keep digging it.
I think the problem is in reconnect method of socket. It does call handhsake method again and again while handhsake requests (both jsonp and xhr) are not being cancelled at all.
Under certain network conditions responces for handhake can be delayed and when they eventually arrive fierce reconnect begins.

Now I got the method to reproduce bug. Suppose we have socket.io server running on 8080. Connect a client via xhr-polling from FF. Then add firewall rule on server:
iptables -A OUTPUT -p tcp -m tcp --sport 8080 -j DROP
You can see in the network section of firebug some handhake requests pending. Then drop rule:
iptables -D OUTPUT -p tcp -m tcp --sport 8080 -j DROP
After that reconnection begins.

In my opinion we should not call handhake unless previos one fails. And there must be a timer in it to decide failure.

@dominiek

This comment has been minimized.

dominiek commented Oct 13, 2011

Any news on this? This is pretty serious

@joelanman

This comment has been minimized.

joelanman commented Oct 17, 2011

I have this issue on socket.io 0.85, it takes up 100% CPU, and even if I kill node and restart, it starts straight back up with loads of 'client not handshaken' warnings. Any workarounds?

Update:

Tracked it down to a client running on IE9 - killed it and issue has gone, but surely a single client shouldnt be doing this?

@ryanto

This comment has been minimized.

ryanto commented Oct 17, 2011

@joelanman make sure you ie9 client a socket.io client version 0.8.5.

@joelanman

This comment has been minimized.

joelanman commented Oct 20, 2011

thanks - must have just been a client/server mismatch

@cowboyrushforth

This comment has been minimized.

cowboyrushforth commented Oct 20, 2011

Have been having this problem, even with matching client+server on 0.8.5, but this patch in this thread: #534 seems to have definitely helped.

@3rd-Eden

This comment has been minimized.

Contributor

3rd-Eden commented Oct 20, 2011

@cowboyrushforth helped or fixed?

@cowboyrushforth

This comment has been minimized.

cowboyrushforth commented Oct 20, 2011

I deployed the patch about 36 hours ago, and it seems that the handshake errors continue to slowly decrease over time. (As connected clients finally refresh and download new client side code)

I will continue to keep an eye on the rate of these things and report back in another day or two.

@sbellone

This comment has been minimized.

sbellone commented Mar 26, 2013

Hello,
I am also able to easily reproduce it with HAProxy, like @diegovar.

Issue reproduction:
In my case, the problems comes from the Socket.IO's handshake mechanism: There is two requests, which mean that if we load-balance them, a server instance will receive the first part and the second server will receive the persistent transport session, and this will fail with this error message.

Actually, with the recommended HAProxy configuration (http://stackoverflow.com/questions/4360221/haproxy-websocket-disconnection/4737648#4737648), it seems to work at first. But that just because the first www request is redirected to the first available www_backend server (i.e. server1), and then, the socket request is redirected to the first available socket_backend server (i.e. server1, and as it point to the same address, it works). Same with the 2nd client, etc, etc...

But if we restart HAProxy, all clients will try to reconnect at the same time, and the load balancing will mess up the handshake process: we have a huge amount of "client not handshaken client should reconnect".

Solutions:
One solution is to use the source algorithm, which will ensure that for a client, both requests are redirected to the same server. But this will not result in an optimal load balancing.
The second solution is to use the cookies mechanism of HAProxy. This works fine with clients coming from a browser. But I did not find a solution to use cookies with the socket.io-client lib.

Questions:
So, as I would like to use the roundrobin algorithm of HAProxy AND the socket.io-client lib, I have two question:

  • Would it be possible to deactivate the handshake process an have a single direct connection? I don't need to access to the header data.
  • Is it possible to access, in the socket.io-client, to the cookies got during the first part of the handshake and to send them back when trying to establish the persistent connection?

Thanks.

@sbellone

This comment has been minimized.

sbellone commented Mar 27, 2013

Ok, I think you can forget my comment.
Using a RedisStore instead of the MemoryStore is the solution in my case. It works super fine! 👍
Thanks, and keep up the good work!

@jondubois

This comment has been minimized.

jondubois commented May 19, 2013

Ok I decided to have another look at this issue recently.
The solution I posted a while ago (https://github.com/LearnBoost/socket.io/pull/1120/files) only reduced the occurrence of these errors, it did not stop them altogether.
It seems that the cause of the problem is a race condition related to the clustering socket.io across multiple processes.
This issue occurs while using the default redis store and with also socket.io-clusterhub.
If you look around line 800 of lib/manager.js, you can see that socket.io responds to the client's connection request before it publishes the handshake notification to other socket.io workers:

res.end(hs); // responds here

self.onHandshake(id, newData || handshakeData);
self.store.publish('handshake', id, newData || handshakeData); // publishes here

So occasionally, the client will know about the handshake before any other socket.io worker does.
When the client gets the response and tries to continue with the connection, it may be dealt with a worker
which is not yet aware of the handshake notification, hence the client is not handshaken (according to that particular worker).
It's not practical to check that every worker has in fact received the handshake notification so I think the best way
to solve this issue is to give the worker a second chance to check the handshake in case it doesn't see it the first time (after some timeout).

Something like this might work around line 710 (will probably need to have a 'handshake timeout' config property to use as the timeout):

Replace:

if (transport.open) {
transport.error('client not handshaken', 'reconnect');
}

transport.discard();

With:

if (transport.open) {
setTimeout(function() {
// If still open after timeout, THEN we will kill that connection
if(transport.open) {
transport.error('client not handshaken', 'reconnect');
}
transport.discard();

}, replaceThisWithTheHandshakeTimeoutVariable);
}

Note that I am not making a pull request for this because it refers to the pre-1.0 version.
This issue may have been resolved in the new version.

@SteveEdson

This comment has been minimized.

SteveEdson commented May 28, 2013

@Topcloud I'm interested in trying this, what would you recommend the timeout variable to be set to?

@jondubois

This comment has been minimized.

jondubois commented May 28, 2013

@SteveEdson I guess you could add a 'handshake timeout' property as a soket.io config option. That may not completely get rid of the error. There might be other places in the code which have similar race conditions. It did reduce the number of failures.

@SteveEdson

This comment has been minimized.

SteveEdson commented May 28, 2013

@Topcloud Thanks, I'll give it a try. Any idea what the actual value should be? Would something like 1 second be to high? etc

@jondubois

This comment has been minimized.

jondubois commented May 28, 2013

@SteveEdson 500ms sounds about right for inter-process communication. You should experiment. Also note that I made a mistake in the code I pasted above, it should be if(transport.open && !this.handshaken[data.id]) instead of the second if(transport.open)...

@Billydwilliams

This comment has been minimized.

Billydwilliams commented Jul 22, 2013

@Topcloud
@SteveEdson
It's been a couple months, how have the changes worked for you?

@SteveEdson

This comment has been minimized.

SteveEdson commented Jul 23, 2013

I think it helped, but I've just realised that when I updated socket.io last month, it would have overwritten the change. I'll have to reapply it and see how it goes. Cheers.

@jwarkentin

This comment has been minimized.

jwarkentin commented Aug 21, 2013

Hopefully this helps someone. I just did a little more investigating to try to figure out what's happening in my case and found some interesting stuff. I have several instances of my Socket.io app running behind a load balancer. Normally this is just fine, but sometimes something goes wrong and if falls back to long polling. When this happens the connections get bounced back and forth between the servers behind the load balancer. If it hits any server other than the one it originally authenticated with then it floods the logs with these errors. The simple solution is to use the RedisStore stuff to share session information between running instances. There are also other solutions. Here are some links that may be useful:

http://stackoverflow.com/questions/9617625/client-not-handshaken-client-should-reconnect-socket-io-in-cluster
http://stackoverflow.com/questions/9267292/examples-in-using-redisstore-in-socket-io
https://github.com/LearnBoost/Socket.IO/wiki/Configuring-Socket.IO
https://github.com/fent/socket.io-clusterhub

@ruanosaur

This comment has been minimized.

ruanosaur commented Apr 9, 2014

@Topcloud @SteveEdson Hi guys, any news on whether this would be fixed in v 1.0? Currently using redisCloud on heroku and experiencing the same error - going to try the timeout if v1.0 doesn't fix it...

@bossonchan

This comment has been minimized.

bossonchan commented May 11, 2014

I am using Cluster module and RedisStore that reproduces this bug.
After some exploration, I have the same conclusion with @Topcloud.
My solution is to use Nginx instead of cluster.
Nginx is a load balancer and it have a extension module -- nginx-sticky-module, which can ensure that normal users are always redirected to the same backend server.
Now everything seems to work fine, I hope this helps. :)
(P.S. the version of node and socket.io are 0.10.28 and 0.9.16)

@edgework

This comment has been minimized.

edgework commented May 22, 2014

I see a lot of people suggesting RedisStore as a solution. Though it may solve the "not handshaken" error it will introduce a massive memory leak. There's various "solutions" to that as well but none work reliably.

@jwarkentin

This comment has been minimized.

jwarkentin commented May 22, 2014

With the RedisStore I've also run into problems with leaking sockets. I'm thinking Socket.io-clusterhub is probably a better solution, though I still haven't tried it myself. Ultimately, to really make things work right though there needs to be a reliable way of sharing the handshake data between servers in a cluster. Also, while I haven't looked that deeply into the code, I suspect that all of these handshake sharing mechanisms are susceptible to a race condition where the connection could be attempted before the handshake data has actually been shared with all the servers in the cluster. I haven't seen anything that protects against it (though I could easily have missed something). This race condition is much less likely to be observed though because the network latency within the cluster is almost always going to be less than the latency of the connecting client.

@toblerpwn

This comment has been minimized.

toblerpwn commented May 22, 2014

Cannot speak to the cause, but #1371 has totally removed the 'leaking sockets' issue for us in prod in a fairly high-volume application (100,000+ active users per day & 7,000+ concurrent connections when busy). See my comments in that issue for CPU leak analysis, etc.

Also notably: version 1.0 does not use Stores per se. I forget what they call them now ('connectors'?) - and I'm unsure whether the actual implementation differs, but presumably it does something totally new, because why else change a name? 😄

@edgework

This comment has been minimized.

edgework commented May 22, 2014

We tried 1371 and were getting other failures. I forget which but we had to quickly revert when our app blew up. The failures didn't crop up until actual users connected. Our internal tests worked fine. It probably has something to do with dropping down to XHR polling or something. In any case after it blew up we just gave up on it.

In the end we ditched any kind of store beside memory store and synchronized the sockets ourselves using redis pub/sub and actually we saw a decent performance increase.

At some point we'll be presenting our various solutions to Node meet ups etc..

@rauchg

This comment has been minimized.

Contributor

rauchg commented May 22, 2014

Agreed. In the 1.0 branch we moved away from synchronization and now socket.io-redis just does clean pub sub.—
Sent from Mailbox

On Thu, May 22, 2014 at 1:29 PM, edgework notifications@github.com
wrote:

We tried 1371 and were getting other failures. I forget which but we had to quickly revert when our app blew up. The failures didn't crop up until actual users connected. Our internal tests worked fine. It probably has something to do with dropping down to XHR polling or something. In any case after it blew up we just gave up on it.
In the end we ditched any kind of store beside memory store and synchronized the sockets ourselves using redis pub/sub and actually we saw a decent performance increase.

At some point we'll be presenting our various solutions to Node meet ups etc..

Reply to this email directly or view it on GitHub:
#438 (comment)

@jwarkentin

This comment has been minimized.

jwarkentin commented May 22, 2014

That's too bad. I've had to disable RedisStore in production because of the problems it was causing. I might still try 1371, especially since I've disabled everything but websockets now. If that doesn't work I'll try clusterhub or write something myself. I need to solve this in the next week or so though. If you have any suggestions I'd love to hear them.

@rauchg

This comment has been minimized.

Contributor

rauchg commented May 22, 2014

Sticky (ip) load balancing + pubsub to communicate between nodes.

Check out http://github.com/Guille/weplay for an example of a multi node architecture.

Sent from Mailbox

On Thu, May 22, 2014 at 1:36 PM, Justin Warkentin
notifications@github.com wrote:

That's too bad. I've had to disable RedisStore in production because of the problems it was causing. I might still try 1371, especially since I've disabled everything but websockets now. If that doesn't work I'll try clusterhub or write something myself. I need to solve this in the next week or so though. If you have any suggestions I'd love to hear them.

Reply to this email directly or view it on GitHub:
#438 (comment)

@edgework

This comment has been minimized.

edgework commented May 22, 2014

Basically our solution was to use Redis pub/sub to connect the server instances. If a packet originating from a socket on one server was meant for a socket on another server it would be queued. This includes broadcasts. Every 10th of a second the queue is sent to all the other instances and they will pick up the packets for their sockets and forward them on. Of course if the packet is meant for a socket on the same server its just handled internally and the other servers don't get notified.

@jwarkentin

This comment has been minimized.

jwarkentin commented May 22, 2014

I'm already using sticky load balancing, but that doesn't seem to work that well when most people that are connecting are behind some big NATs. @guille does socket.io-redis work with socket.io 0.9.x?

@edgework

This comment has been minimized.

edgework commented May 22, 2014

We're using HAProxy with no problems.

@rauchg

This comment has been minimized.

Contributor

rauchg commented Aug 19, 2014

This issue is not relevant for 1.0

@rauchg rauchg closed this Aug 19, 2014

@wavded

This comment has been minimized.

wavded commented Aug 19, 2014

yey! its closed!... it's closed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment