Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutException during new connection creation stops processing #262

Closed
shakuzen opened this issue Apr 20, 2017 · 8 comments
Closed

TimeoutException during new connection creation stops processing #262

shakuzen opened this issue Apr 20, 2017 · 8 comments
Assignees
Milestone

Comments

@shakuzen
Copy link

As asked on the mailing list, there appears to be some unexpected behavior in the logic for creating a new connection.

/cc @michaelklishin

Scenario

GIVEN: multiple addresses available (a cluster of RabbitMQ nodes); first address (node) is unresponsive, second address (node) is healthy
WHEN: creating a new connection
THEN: a connection is successfully returned

Expected behavior

Connecting to the first node fails (after a timeout, i.e. handshake timeout), but connecting to the second node is attempted and succeeds. A connection to the second node should be returned.

Actual behavior

Connecting to the first node fails and a TimeoutException is thrown; no attempt is made to connect to any other node, as the TimeoutException is uncaught. Note that IOException, on the other hand, is caught and would lead to the next address, if any, being tried.

Solution

Would simply catching TimeoutException and treating it the same as IOException (see specific code portions below) be the right thing to do here?

Supplements

Stacktrace

Relevant portion of a stacktrace from this scenario:

java.util.concurrent.TimeoutException: null
        at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
        at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
        at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
        at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:372)
        at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:297)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:909)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:859)
        at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:799)
        at org.springframework.amqp.rabbit.connection.AbstractConnectionFactory.createBareConnection(AbstractConnectionFactory.java:351)

Code

The code where this behavior manifests is in com.rabbitmq.client.ConnectionFactory#newConnection(java.util.concurrent.ExecutorService, com.rabbitmq.client.AddressResolver, java.lang.String) (lines 896 - 917):

if (isAutomaticRecoveryEnabled()) {
    // see com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory#newConnection
    AutorecoveringConnection conn = new AutorecoveringConnection(params, fhFactory, addressResolver, metricsCollector);

    conn.init();
    return conn;
} else {
    List<Address> addrs = addressResolver.getAddresses();
    IOException lastException = null;
    for (Address addr : addrs) {
        try {
            FrameHandler handler = fhFactory.create(addr);
            AMQConnection conn = new AMQConnection(params, handler, metricsCollector);
            conn.start();
            this.metricsCollector.newConnection(conn);
            return conn;
        } catch (IOException e) {
            lastException = e;
        }
    }
    throw (lastException != null) ? lastException : new IOException("failed to connect");
}

And for AutorecoveringConnections the same issue would happen (though I haven't actually tried since auto-recovery is disabled in current versions of Spring AMQP). From com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory:

RecoveryAwareAMQConnection newConnection() throws IOException, TimeoutException {
    IOException lastException = null;
    List<Address> shuffled = shuffle(addressResolver.getAddresses());

    for (Address addr : shuffled) {
        try {
            FrameHandler frameHandler = factory.create(addr);
            RecoveryAwareAMQConnection conn = new RecoveryAwareAMQConnection(params, frameHandler, metricsCollector);
            conn.start();
            metricsCollector.newConnection(conn);
            return conn;
        } catch (IOException e) {
            lastException = e;
        }
    }

    throw (lastException != null) ? lastException : new IOException("failed to connect");
}

(not directly related to this issue but I thought the difference of shuffling the address list in the RecoveryAwareAMQConnectionFactory but not in the regular ConnectionFactory was interesting)

@acogoluegnes
Copy link
Contributor

@shakuzen I pushed a snapshot of 4.0.3, can you give it a try? Thanks!

@acogoluegnes
Copy link
Contributor

@shakuzen I released 4.0.3.RC1 on our Bintray Milestone repo, can you give it a try?

Note there's also 4.1.1.RC1 which contains the bug fix as well.

We'll release both GA as-is in the next few days.

@shakuzen
Copy link
Author

@acogoluegnes yesterday I gave the snapshots a run through the automated tests for one of our producer applications and did some ad hoc testing. Everything worked fine. Unfortunately, I was never able to reproduce the exact failure described in this issue. I've only ever had it happen in our production environment and my efforts to simulate it locally have been futile. If you have any ideas how the behavior could be forcefully reproduced, please let me know. I tried blocking outbound traffic on the AMQP port for one RabbitMQ instance hoping it would result in a TimeoutException, but it was a different exception that already failed over properly.
Sorry for the delayed reply, I've been at a conference all day.

@michaelklishin
Copy link
Member

@shakuzen no worries, thank you for the testing. There definitely are problems that are very hard to isolate and the only real way to confirm that they've been addressed is after not seeing them in production for a while.

@noahhaon
Copy link

@shakuzen I think you could use qdisc to simulate the necessary conditions to reproduce using a delay just under the connection timeout, so the TCP connection is established but the handshake times out:

# export RMQ_SLOW_NODE=10.0.0.1
# tc qdisc add dev eth0 root handle 1: prio
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip dst $RMQ_SLOW_NODE flowid 2:1
# tc qdisc add dev eth0 parent 1:1 handle 2: netem delay 5000ms

@michaelklishin
Copy link
Member

@noahhaon that you, qdisc is a traffic shaping option I personally was unaware of.

@noahhaon
Copy link

noahhaon commented May 17, 2017

@michaelklishin you bet - FYI the alternative for *BSD and OSX is dummynet via ipfw

http://info.iet.unipi.it/~luigi/dummynet/

Some more details on using netem with qdisc for network fault injection:

https://wiki.linuxfoundation.org/networking/netem

Cheers

@acogoluegnes
Copy link
Contributor

@shakuzen You can try the RC on your production environment and tell us if the bug fix works. It's not risky, the only change is the bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants