Can't get Node to emit connect_failed or reconnect_failed if server is down #311

mixu opened this Issue Sep 27, 2011 · 7 comments


None yet
5 participants

mixu commented Sep 27, 2011

As part of my tests under 0.6.17, I used to take down the server and bring it back up when the client noticed it was disconnected.

Additionally, I used to test specifying invalid connection port information to the client and use the connect_failed event to fall back to the correct port. This is basically because some corporate clients can't access the normal port used for, since it's not 443 or 80 and need to fall back to port 443 (which is a polling-only fallback through an Nginx rule).

However, I cannot get the client to emit connect_failed or reconnect_failed in my tests under Node - which breaks both of the above since I need those events in the client to respond correctly to server failure and connect failure due to firewall blocking the connection.

The test code consists of a runner - a separate process for the server, since we don't care why the server is inaccessible (and doing client disconnects() sends back "booted" which prevents reconnects) - sio.runner.js:

var http = require('http');
var io = require('');
var server = http.createServer(function(req, res) { res.end('Server running.'); } );
server.listen(8123, 'localhost');

And then the test itself:

// Starts the child process with the server
function spawn_server() {
  var srv = require('child_process').spawn('node', ['sio.runner.js'], { cwd: __dirname });
  srv.stdout.on('data', function (data) { console.log('stdout: ' + data); });
  srv.stderr.on('data', function (data) { console.log('stderr: ' + data); });
  srv.on('exit', function(code, signal) { console.log('Child process exited', code, signal); });
  return srv;
var server = spawn_server();
// Need to have a timeout here, since under Node will not retry connections
// if the server isn't up, and it will not be (timing is too quick as spawn happens in the background)
setTimeout(function() {
  var io = require('');
  var socket = io.connect('http://localhost:8123/', {
    transports: ['websocket'],
    'force new connection': true,
    'connect timeout': 1000,
    'reconnection limit': 2,
    'max reconnection attempts': 2
  socket.on('connecting', function() { console.log('SIO Connecting', arguments); });
  socket.on('connect', function() {
    console.log('SIO Connected', arguments);
    // terminate server after connecting
    console.log('Now, kill the server');
  socket.on('reconnecting', function() { console.log('SIO reconnecting', arguments); });
  socket.on('reconnect', function() { console.log('SIO reconnected', arguments); });
  socket.on('disconnect', function() { console.log('SIO disconnect', arguments); });
  socket.on('connect_failed', function() {
    console.log('SIO connect_failed', arguments);
    // here, I would configure the client with an alternative set of settings (omitted for simplicity)
    // and then restart the server -- test passed
    server = spawn_server();
  socket.on('reconnect_failed', function() {
    console.log('SIO reconnect_failed', arguments);
    // here, I would configure the client with an alternative set of settings (omitted for simplicity)
    // and then restart the server -- test passed
    server = spawn_server();
}, 1000);

Now, if you run that code, you get:

SIO Connecting { '0': 'websocket' }
SIO Connected {}
Now, kill the server
Child process exited 1 null
SIO disconnect { '0': undefined }
SIO reconnecting { '0': 500, '1': 1 }
SIO reconnecting { '0': 500, '1': 2 }

Note that you need to wait quite long for the client heartbeat to fail.

I would expect to get a "connect_failed" or "reconnect_failed" event to be emitted before giving up, but can't seem to get one.


mixu commented Sep 27, 2011

Oh, one more thing. To get the reconnect logic to even fire, you need to apply 3rd-Edens one-line patch to prevent Node errors from bubbling up:


After that, you will get the reconnecting events (but not connect_failed / reconnect_failed)

byrion commented Sep 28, 2011

I found the following useful to receive connect_failed events:



mixu commented Sep 29, 2011

@byrion Thanks, but that patch emits a connect_failed under a very specific case (setting a timeout after injecting a script element to detect script loading failures).

It doesn't solve the problem, which is that the *_failed events don't get emitted in the client at all under Node. The core of the problem is that something is wrong with the redoTransports code path:


mixu commented Sep 29, 2011

OK, so here is a bit more info - looking for the cause:

  1. The reason Node stops executing after two reconnects is that while the "non-redoTransports" code path sets a timeout, the redoTransports path does not. Since no timeout is active when the transports are redone, execution terminates when the first redoTransports connect fails.

  2. It seems that the redoTransports path expects a connect_failed event to occur in order to do cleanup:

However, since failures that occur during handshakes cause an "error" event to be emitted rather than some more descriptive event (connect_failed would probably be a better choice), the cleanup never occurs, and neither are the "connect_failed" and "reconnect_failed" events emitted.

In addition, even "error" events during the handshake are suppressed when the socket is reconnecting:

So I think the fixes would include:

  • Emitting connect_failed even when the handshake fails
  • Not suppressing connect_failed / error events when reconnecting, so that the sequence of events would be:
    • connecting
    • connected
    • disconnect
    • reconnecting
    • connect_failed (not emitted currently)
    • ... (more reconnecting + connect_failed events) ...
    • reconnect_failed (not emitted currently)
  • I would argue that the "redoTransports" path is confusing behavior:
    • It does not obey the "max reconnection attempts" setting
    • It does not emit any events when it fails
    • It cannot be disabled! This is a problem since it prolongs the period between "disconnect" and "reconnect_failed", making the user experience worse (long disconnect time) even when I have written logic for the "reconnect_failed" that will use an alternative port/server when the regular config fails. If I would want to retry every transport, I would write that to happen at reconnect_failed explicitly, but it is not a particularly useful as an unconfigurable default.
  • having tests in place where the server goes away to actually test the reconnect workflow (please!)

Having a mandatory "go through all the failed transports again" phase makes the time spent in the disconnected state without being able to apply custom recovery options even longer than necessary. If reconnect_failed happens, that should mean that the developer should do whatever they have planned as the fallback.

So I would remove the redoTransports code path and just emit a reconnect_failed directly + make handshakes emit connect_failed events.

Core devs, any guidance on this?

I've found similar results from my testing. To get around the reconnect_failed message not being sent, I instead use the reconnecting event and look for the last attempt and throw an error message on the screen so the user knows push updates won't occur anymore. I also throw the same message up on an error. Between the both it seems to reliably inform the user of the status of the connection.


amnjdm commented Nov 1, 2011

I am experiencing the same issue with: reconnect_failed
I set max reconnection attempts to 3
Start the node server and my status text is updated to say Connected
I then stop the node server and the status text is updated 3 times with Reconnecting in [num] seconds
Doesn't fire reconnect_failed

Initially I had:
'max reconnection attempts': 3
If I add:
'reconnection limit': 4000
reconnect_failed fires.

So it requires a reconnection limit that matches the max reconnection attempts in ms.

Another example:
'reconnection delay': 500,
'reconnection limit': 32000,
'max reconnection attempts': 6

I did run into an issue with the exponential back off, but I will open a new issue for that.

It turns out those numbers don't even match correctly (attempts and the limit numbers) and the above update doesn't even work properly all the time. I have narrowed down where the logic stops and prevents the firing of reconnect_failed (as long as 'max reconnection attempts' is given a value, reconnection limit is not necessary):

Within the reconnect prototype, the maybeReconnect method contains this logic:

if (self.reconnectionAttempts++ >= maxAttempts) {
if (!self.redoTransports) {
console.log('stuck with redo: ' +self.reconnectionDelay) <-------- logs reconnection delay after final attempt
self.on('connect_failed', maybeReconnect);
self.options['try multiple transports'] = true;
self.transport = self.getTransport();
self.redoTransports = true;
} else {

After the final reconnection attempt, its caught in the if(!self.redoTransports) statement.
Hope this helps fix the issue. I am going to continue to work on the issue.


amnjdm commented Nov 2, 2011

Reading @mixu post above, he makes a valid point about there not being a timeout in the redoTransport logic. So I am going to fork and add: self.reconnectionTimer = setTimeout(maybeReconnect, 1000); to the end of that logic. It adds an additional second to allow whatever that logic is attempting to do and then executes maybeReconnect again with self.redoTransports set to true.

This fork, coupled with my other pull request: LearnBoost#328 fixes both issues I had previously.

@rauchg rauchg closed this Nov 25, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment