First, I'd like to say thank you to the Socket.io devs for the great work, and for providing a way to scale this across processes with the Redis store! I think that this is a really promising thing, even though we've been encountering a few issues.
We've been having some problems scaling up socket.io with the Redis store. The problem appears to be that the PubSub is too chatty. For instance, each time a socket disconnects, every other process subscribes to the dispatch:[id] channel for that socket and waits for the close message, then unsubscribes from that channel. So in the case where you have to restart your entire service (let's say on a non-backward-compatible release), you have a TOTAL number of sockets connected to all of your worker processes (let's call that n) and a number of processes (let's call that p) and each process has to subscribe and unsubscribe for each socket, and the complexity of an unsubscribe in Redis is O(n) where n is the number of already-subscribed listeners on a channel. So that would seem to be O(n*p^2) work for Redis, and anyway is O(n) for every worker process.
Since we are only targeting websockets, we are working around this by disabling all pubsub channels except handshake, and only doing the initial handshake broadcast (so all processes other than the one that completes the handshake simply discard the data from the 'handshaken' hash as though the handshake was never finished). This appears to mitigate the issues sufficiently that we can run websockets via socket.io in production across our 36 node.js processes on 3 machines, but it still obviously won't scale indefinitely because every process still has to process a handshake message on the 'handshake' channel for each handshake initiated.
I propose that this would scale across processes if it used Redis as a store rather than publishing all data to all workers in these cases. For instance, rather than publishing on 'handshake', write a handshake:[id] key in Redis with a timeout of 30s and then let the process that completes the handshake update that key (or delete it and write a handshake-completed:[id] key).
I understand that this gets more complicated for other transports, where buffering responses becomes important, but it seems like this might be better-solved by storing the buffer as a capped list of JSON objects in Redis (our AJAX short-polling fallback uses that approach, and it works well), rather than using pubsub to pass them around.
I apologize for the lengthy and unsolicited suggestion, and we'd love to help if y'all are interested in going this direction. Also, if we're just doing something in a completely wrong or unanticipated way, please let me know. Thanks!
We already talked on IRC about this, and i'm definitely +1 on this, as i'm also seeing the same behavior in my own application.
It would be just as easy to have one listener for handshakes,disconnects etc instead of adding new listeners for each connection. Which would probably make the stores much smoother to work with and gain a better overal performance.
Marked as bug + store. But I would love to see pull request against this ;)
Right now the decision is for us to continue to run with our current hack (turning off all pubsub channels except for the inital handshake propagation). We're not a good candidate for fixing the store, since we only run websockets in production and are happy with our short-polling fallback there for downlevel browsers (which are anyway decreasing in number). Also, it looks like guille's work on websocket.io and engine.io may obviate this kind of work.
News about this?
@brettkiefer mind sharing the hack? :)
I believe this was our final set of patches: https://gist.github.com/2428207
Keep in mind that this only works if you turn off all transports except websockets.
The important part is where we turn off subscription and unsubscription on all channels other than handshake.
+1 any forks?
I implemented brettkiefer's hack and handle messages myself (each node.js process subscribes to messages channel). I use websocekts and flashsockets - those two transports seem to cover all major cases. However, I didn't test it on IE6 and older browsers because I don't care :P
Sticking only to socket transport is also very good for your server - xhr-polling uses a lot of "resources" when there's a lot of messages.
I just implemented the hack and so far it seems to be working. Will send it into testing and see if anything else happens.
How will socket.io v1.0 affect this hack? Will the team have fixed the scaling across clusters issue by then, to enable us to revert from using the hack?