TimeoutException during new connection creation stops processing #262

shakuzen · 2017-04-20T01:07:53Z

As asked on the mailing list, there appears to be some unexpected behavior in the logic for creating a new connection.

Scenario

GIVEN: multiple addresses available (a cluster of RabbitMQ nodes); first address (node) is unresponsive, second address (node) is healthy
WHEN: creating a new connection
THEN: a connection is successfully returned

Expected behavior

Connecting to the first node fails (after a timeout, i.e. handshake timeout), but connecting to the second node is attempted and succeeds. A connection to the second node should be returned.

Actual behavior

Connecting to the first node fails and a TimeoutException is thrown; no attempt is made to connect to any other node, as the TimeoutException is uncaught. Note that IOException, on the other hand, is caught and would lead to the next address, if any, being tried.

Solution

Would simply catching TimeoutException and treating it the same as IOException (see specific code portions below) be the right thing to do here?

Supplements

Stacktrace

Relevant portion of a stacktrace from this scenario:

java.util.concurrent.TimeoutException: null
        at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
        at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
        at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
        at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:372)
        at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:297)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:909)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:859)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:799)
        at org.springframework.amqp.rabbit.connection.AbstractConnectionFactory.createBareConnection(AbstractConnectionFactory.java:351)

Code

The code where this behavior manifests is in com.rabbitmq.client.ConnectionFactory#newConnection(java.util.concurrent.ExecutorService, com.rabbitmq.client.AddressResolver, java.lang.String) (lines 896 - 917):

if (isAutomaticRecoveryEnabled()) {
    // see com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory#newConnection
    AutorecoveringConnection conn = new AutorecoveringConnection(params, fhFactory, addressResolver, metricsCollector);

    conn.init();
    return conn;
} else {
    List<Address> addrs = addressResolver.getAddresses();
    IOException lastException = null;
    for (Address addr : addrs) {
        try {
            FrameHandler handler = fhFactory.create(addr);
            AMQConnection conn = new AMQConnection(params, handler, metricsCollector);
            conn.start();
            this.metricsCollector.newConnection(conn);
            return conn;
        } catch (IOException e) {
            lastException = e;
        }
    }
    throw (lastException != null) ? lastException : new IOException("failed to connect");
}

And for AutorecoveringConnections the same issue would happen (though I haven't actually tried since auto-recovery is disabled in current versions of Spring AMQP). From com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory:

RecoveryAwareAMQConnection newConnection() throws IOException, TimeoutException {
    IOException lastException = null;
    List<Address> shuffled = shuffle(addressResolver.getAddresses());

    for (Address addr : shuffled) {
        try {
            FrameHandler frameHandler = factory.create(addr);
            RecoveryAwareAMQConnection conn = new RecoveryAwareAMQConnection(params, frameHandler, metricsCollector);
            conn.start();
            metricsCollector.newConnection(conn);
            return conn;
        } catch (IOException e) {
            lastException = e;
        }
    }

    throw (lastException != null) ? lastException : new IOException("failed to connect");
}

(not directly related to this issue but I thought the difference of shuffling the address list in the RecoveryAwareAMQConnectionFactory but not in the regular ConnectionFactory was interesting)

The text was updated successfully, but these errors were encountered:

When auto-recovery is disabled. References #262

Fixes #262

acogoluegnes · 2017-05-16T09:38:48Z

@shakuzen I pushed a snapshot of 4.0.3, can you give it a try? Thanks!

acogoluegnes · 2017-05-17T13:51:17Z

@shakuzen I released 4.0.3.RC1 on our Bintray Milestone repo, can you give it a try?

Note there's also 4.1.1.RC1 which contains the bug fix as well.

We'll release both GA as-is in the next few days.

shakuzen · 2017-05-17T17:12:55Z

@acogoluegnes yesterday I gave the snapshots a run through the automated tests for one of our producer applications and did some ad hoc testing. Everything worked fine. Unfortunately, I was never able to reproduce the exact failure described in this issue. I've only ever had it happen in our production environment and my efforts to simulate it locally have been futile. If you have any ideas how the behavior could be forcefully reproduced, please let me know. I tried blocking outbound traffic on the AMQP port for one RabbitMQ instance hoping it would result in a TimeoutException, but it was a different exception that already failed over properly.
Sorry for the delayed reply, I've been at a conference all day.

michaelklishin · 2017-05-17T21:11:18Z

@shakuzen no worries, thank you for the testing. There definitely are problems that are very hard to isolate and the only real way to confirm that they've been addressed is after not seeing them in production for a while.

noahhaon · 2017-05-17T21:44:56Z

@shakuzen I think you could use qdisc to simulate the necessary conditions to reproduce using a delay just under the connection timeout, so the TCP connection is established but the handshake times out:

# export RMQ_SLOW_NODE=10.0.0.1
# tc qdisc add dev eth0 root handle 1: prio
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip dst $RMQ_SLOW_NODE flowid 2:1
# tc qdisc add dev eth0 parent 1:1 handle 2: netem delay 5000ms

michaelklishin · 2017-05-17T22:49:41Z

@noahhaon that you, qdisc is a traffic shaping option I personally was unaware of.

noahhaon · 2017-05-17T23:18:39Z

@michaelklishin you bet - FYI the alternative for *BSD and OSX is dummynet via ipfw

http://info.iet.unipi.it/~luigi/dummynet/

Some more details on using netem with qdisc for network fault injection:

https://wiki.linuxfoundation.org/networking/netem

Cheers

acogoluegnes · 2017-05-19T13:12:05Z

@shakuzen You can try the RC on your production environment and tell us if the bug fix works. It's not risky, the only change is the bug fix.

acogoluegnes self-assigned this May 15, 2017

acogoluegnes added bug effort-low labels May 15, 2017

acogoluegnes added this to the 4.0.3 milestone May 15, 2017

acogoluegnes added a commit that referenced this issue May 15, 2017

Handle TimeoutException in ConnectionFactory

518c340

When auto-recovery is disabled. References #262

acogoluegnes mentioned this issue May 15, 2017

DO NOT MERGE Handle TimeoutException in ConnectionFactory #269

Closed

acogoluegnes added a commit that referenced this issue May 15, 2017

Handle TimeoutException in RecoveryAwareAMQConnectionFactory

a77deb9

Fixes #262

acogoluegnes mentioned this issue May 16, 2017

Handle TimeoutException on connection creation #270

Merged

michaelklishin closed this as completed May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeoutException during new connection creation stops processing #262

TimeoutException during new connection creation stops processing #262

shakuzen commented Apr 20, 2017

acogoluegnes commented May 16, 2017

acogoluegnes commented May 17, 2017

shakuzen commented May 17, 2017

michaelklishin commented May 17, 2017

noahhaon commented May 17, 2017

michaelklishin commented May 17, 2017

noahhaon commented May 17, 2017 •

edited

Loading

acogoluegnes commented May 19, 2017

TimeoutException during new connection creation stops processing #262

TimeoutException during new connection creation stops processing #262

Comments

shakuzen commented Apr 20, 2017

Scenario

Expected behavior

Actual behavior

Solution

Supplements

Stacktrace

Code

acogoluegnes commented May 16, 2017

acogoluegnes commented May 17, 2017

shakuzen commented May 17, 2017

michaelklishin commented May 17, 2017

noahhaon commented May 17, 2017

michaelklishin commented May 17, 2017

noahhaon commented May 17, 2017 • edited Loading

acogoluegnes commented May 19, 2017

noahhaon commented May 17, 2017 •

edited

Loading