handshake propagation issue when using cluster #952

Closed
ashwinphatak opened this Issue Jul 11, 2012 · 6 comments

Projects

None yet

5 participants

@ashwinphatak

I'm using node 0.8.1, socket.io 0.9.6, websockets and the cluster module. The Redis module uses pubsub to communicate handshaking events across processes, but looking at the code, it might not behave correctly under heavy load due to timing issues.

I've experienced the following problem trying to replace Redis with RabbitMQ, but as far as I can tell the problem is timing related and independent of what pubsub tool we use.

Scenario:

Let's say there are two worker processes in the cluster: W1 & W2

The initial request to allocate a client session/websocket (say /socket.io/1/?t=1341994956158) comes to W1, which updates it's list of 'handshaken' clients. It also publishes this handshaking event for other processes to update their lists.

Due to clustering, W2 receives the HTTP Upgrade request (say /socket.io/1/websocket/1860678371557773727 ) before it gets the 'handshake' event published by W1.

W2 doesn't find 1860678371557773727 in the list of 'handshaken' clients, and discards the transport with a "client not handshaken - should reconnect" error.

During the reconnect tried by the browser, the same story repeats (with workers interchanged), leading to the browser failing to establish a websocket connection with the server even after multiple retries.

If the 'handshake' event sent by W1 reaches W2 before the HTTP Upgrade request, everything seems to work fine.

Has anyone faced this or similar issues? Or, am I missing something?

@ashwinphatak
Manager.prototype.handleUpgrade = function (req, socket, head) {
  var data = this.checkRequest(req)
    , self = this;

  if (!data) {
    if (this.enabled('destroy upgrade')) {
      socket.end();
      this.log.debug('destroying non-socket.io upgrade');
    }

    return;
  }

  req.head = head;

  // HOT FIX
  setTimeout(function() {
    self.handleClient(data, req);
  }, 1000);

  // ORIGINAL this.handleClient(data, req);
};

If we introduce an artificial delay during Upgrade as above, it gives the 'handshake' events enough time to propagate, and the "client not handshaken - should reconnect" errors go away.

I'm not in any way suggesting this as a fix, just using it to illustrate the issue better.

@trungnb
trungnb commented Jul 23, 2012

I'm using nodejs, socket.io 0.9.6, nginx patched with tcp_proxy module and redis for scaling socket processes. Now I'm got stuck with situation similar with yours. Client could not "handshake" with server (but sometimes successfully!) and in log file I see:

debug: websocket writing 2::
debug: set heartbeat timeout for client 4767100961459878228
debug: got heartbeat packet
debug: cleared heartbeat timeout for client 4767100961459878228
debug: set heartbeat interval for client 4767100961459878228

Client sent request connect which is not success, so he repeatly send requests again!. Very appreciate if you could give me some advice. Thanks.

@agubler
agubler commented Nov 13, 2013

@guille @LearnBoost Are there any plans to address this issue?

@gkorland
gkorland commented Feb 8, 2014

+1 blocker

@gkorland

Does anyone know if the future to come Socket.io 1.0 still has this issue?

@mkoryak
mkoryak commented Mar 8, 2014

+1

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment