Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

arobson · 2014-01-16T06:28:12Z

Present Behavior

Even when a dead-letter-exchange is defined for a queue that has a TTL set (x-expires), when the queue expires, the messages are lost.

Use Case

One approach to providing isolation guarantees in addition to horizontal scalability is to allow every service instance to have it's own queue. Unfortunately, if the connection is severed between the service instance and its queue, (or the instance crashes) messages are 'orphaned' in the queue with no elegant way to detect this issue and recover the messages. (per-message TTL is very problematic because its dependent upon average message process time and queue depth)

Test Case

In a system that needs to provide some level of isolation guarantees (say by using a consistent hashing exchange using correlation id for hashing), each new service instance would define its own queue bound to the consistent hash exchange with a TTL and the consistent hash exchange as the dead-letter-exchange.

Messages were published to the consistent hash exchange and split amongst queues. One queue's consumer is terminated and after the TTL expires, the queue is removed and all messages are republished to the consistent hash exchange and effectively redistributed to the remaining queues.

Request For Comments

Please let me know if this behavior would be undesirable, it fits our use case perfectly but if there's a different set of configurations we could use to get the same benefits, we'd like to know about those.

If there are problems with the code as its written, I'd be happy to make any changes necessary in order for this PR to be suitable for further consideration.

If my implementation is so problematic that it would be easier to just ignore and have a core committer add the feature properly, I totally understand. I tried it partly for the learning experience and partly out of a desire to potentially contribute to an excellent piece of software I've been using every day for a long time.

Workflow

It looks like GitHub is not the normal avenue for submitting changes but I'm far less familiar with Hg and wasn't sure how to (or if I could) create a ticket in your system for a feature in order to match the branch to the ticket number.

It says a lot about RabbitMQ's team and implementation that I could pull something like this off in a few hours of source diving. Sorry if I've made a big mess, as I mentioned, this would be a huge win for us going forward if we could get this behavior implemented.

…o dead-letter messages in an expiring queue when a dead-letter-exchange was provided

arobson · 2014-01-16T15:40:31Z

Found an edge case in this code: when using a DLX that points to the expiring queue, the dead-letter feature sees this as a cycle and does not re-publish the messages. I tried calling rabbit_bindings:remove_from_destination to remove the bindings before dead-lettering the messages but it appears to still detect the cycle. All this to say, this PR needs more work.

…remove bindings from both the expiring queue (destination) and DLX exchange if they exist to prevent consistent hash routing DL'd messages to the expiring queue.

arobson · 2014-01-17T21:47:31Z

This update cleans up the way the original code was written in addition to resolving the undesired behavior when used with a consistent-hash exchange as the DLX.

Interested to hear back on whether this seems like a feasible approach.

@simonmacmullen expressed concerns about this feature in general when used with large queues that have messages on disk. I'd imagine the performance would be slow, but I know that I could cause similar behavior today if I set a DLX and Message TTL on a queue, threw tons of messages at it until they hit disk and then started expiring.

drapp · 2014-09-12T17:49:15Z

Any word on this pull request? I could also use this functionality. It's very hard to have expiring queues without this. I'm trying to have a system that routes to queues that are transient, and I want to have a combination of an alternate exchange and dead lettering to ensure that messages that don't get consumed end up in a fallback queue.

If I unbind each transient queue before its consumer goes away, I can make sure the messages expire and get dead lettered before the queue goes away, and once the queue is unbound the messages flow through the alternate exchange. There's big hole though if the consumer dies suddenly though: if a queue expires while messages are flowing into it, those messages are lost.

If there's a workaround I'd be open to it, but this pr seems like a fix for a dangerous bug.

michaelklishin · 2014-09-12T22:55:24Z

There is still no consensus about whether this feature is worth including, given the edge cases (e.g. huge queues with most of messages on disk).

arobson · 2015-03-19T12:07:03Z

@michaelklishin - just following up: I realize the edge case could cause performance issues but the same edge case already exists for any queue that is backlogged with a dead-letter exchange and long enough message TTL. In other words: there are already other ways one can create performance problems if one doesn't take the proper precautions when designing their topology.

Would the Rabbit team consider including cautionary language around this feature as part of the documentation? I've built several systems around Rabbit as the messaging backbone - there has not been a single case where message loss was preferable to temporary performance degradation. The easy way to prevent the performance issue with this feature is to ensure your queues aren't being overrun. With a consistent-hash-exchange (or exchange-queue pairs for each service instance), it's rather simple to solve for this by simply spinning up an additional service instance.

Without this feature, we're left with developing an entire subsystem to monitor queues and consumers that waits for a queue's consumer to drop, unbinds the queue from everything, and directs an existing consumer to pick up the slack then monitors the queue so it can be removed when empty. This is actually quite problematic and has the follow undesirable consequences:

load imbalance across consumers
uneven latency spikes in processing messages from the orphaned queue (since it can't be evenly redistributed)
constant load on rabbit's HTTP API, especially if you need to catch the issue quickly to minimize time lost

Distributed mutual exclusion is a very difficult problem to solve. I believe that with the inclusion of this feature (or a selective-consumer feature), Rabbit would be uniquely positioned to remove or at least greatly reduce the role of a distributed locking mechanism. Without this capability, we're left with complex work-arounds and a new set of problems to address.

michaelklishin · 2015-03-19T13:46:13Z

I am positive about including this feature. Let's wait for other team members' opinions.

dumbbell · 2015-03-19T14:00:03Z

I agree this seems a useful feature. It may even be what users expect when using TTL on a queue.

simonmacmullen · 2015-03-19T15:13:43Z

I don't think anyone is debating whether dlx-on-queue-delete is a useful feature in the abstract. It totally is!

My concerns with the design of this pull request are:

It doesn't use flow control; therefore it's possible that if the deleted queue is large then it can spam the DLX-ing queues with messages fast enough to use up all available memory and crash the server.

Yes theoretically this issue exists already with messages getting DLXed by other means. But it's quite hard to hit; this patch makes it very easy. There's a good chance deleting any large queue with a DLX set would do it. So this really needs addressing.

It doesn't guarantee messages will not be lost when queues with DLX set are deleted (e.g. if you shut down at the wrong time). That's more fixable by documentation though.

I also have concerns with the implementation:

The bit about removing bindings first is dubious; other clients could add bindings back after this point. And calling into exchange callbacks is especially dubious. It happens to work for the CHX but I am not at all convinced that it would do the right thing for (e.g.) federation. See @rade's previous discussion here:

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-January/033354.html

Now, @rade was not too concerned with offering solutions there beyond "it's complicated". But I would suggest that doing this in the termination phase (after rabbit_amqqueue:internal_delete/1 has already happened) is definitely a preferable way forward.

You still have the issue with "dark queues" he mentioned, but there is now infrastructure that could let you fix that too; we can show crashed queues which are there-but-dead, storing the fact that they're crashed in Mnesia:

rabbitmq-server/src/rabbit_amqqueue_process.erl

Line 237 in 9388c66

terminate(_Reason, State = #q{q = Q}) ->

This could be extended to support deleting queues too.

arobson · 2015-03-20T15:14:17Z

@simonmacmullen - I can try to address your concerns but I won't pretend that I understand Rabbit well enough to handle it. It's been over a year since I've even looked at this code, I was actually pleasantly surprised to see the PR hadn't been closed.

If someone on the core team wanted to take on adding the feature properly, addressing your concerns, we can certainly close this out. Otherwise, do you have recommendations on how I could start working towards addressing your concerns and get some team feedback? Would you prefer I just continue making commits to this branch and request review?

Thanks for being open to discussion and for the guidance provided so far.

michaelklishin · 2015-03-20T15:42:31Z

I'll take the liberty to close this. Not because it's a feature we don't want but because the implementation doesn't cover some edge cases we are quite worried about.

Add file_cache_handler and buffer tuning back

URL Cleanup

* Faster implementation of scan_index * avoid jumping through the file all the time (especially skipping back which invalidates read-ahead buffer) * TODO: scan_index is still called when attaching by timestamp which is unnccesarry * TODO: binary search (I have an implementation locally but need to integrate it) * All tests pass We can now better handle corner cases such as end of log or truncated log. All tests pass without changing them so there's a chance we maintain behaviour in all cases. * Incorporate PR feedback * pass #seg_info * check if offset within the segment range * handle (unlikely) eof * refactor how PreviousChunk is passed * return offset_out_of_range in unlikely corener cases * Return `eof`, instead of `{error, eof}` Co-authored-by: Michal Kuratczyk <mkuratczyk@pivotal.io>

Add new method drop_expired_queue_msgs to rabbitmq_amqqueue_process t…

9824bea

…o dead-letter messages in an expiring queue when a dead-letter-exchange was provided

arobson mentioned this pull request Jan 16, 2014

Exchange Continues To Return Routes To Removed Queue rabbitmq/rabbitmq-consistent-hash-exchange#1

Closed

Separate dead lettering of expired queue's messages to another call, …

8946aa4

…remove bindings from both the expiring queue (destination) and DLX exchange if they exist to prevent consistent hash routing DL'd messages to the expiring queue.

Fixing knuckle-head mistake in list comprehension pattern

d6c7c6d

ghost force-pushed the master branch from 9ab05b0 to 861f127 Compare September 29, 2014 14:20

michaelklishin added request effort-low labels Feb 19, 2015

dumbbell added enhancement and removed request labels Mar 17, 2015

michaelklishin added effort-medium and removed effort-low need-followup labels Mar 20, 2015

michaelklishin closed this Mar 20, 2015

dumbbell added this to the n/a milestone Sep 11, 2015

dcorbacho pushed a commit that referenced this pull request May 27, 2016

Merge pull request #29 from rabbitmq/rabbitmq-server-446

3f2f2bc

Add file_cache_handler and buffer tuning back

leongersen mentioned this pull request May 23, 2019

Add note about expiring queues to DLX documentation rabbitmq/rabbitmq-website#769

Closed

binarin pushed a commit that referenced this pull request Sep 28, 2021

Merge pull request #29 from spring-operator/polish-urls-remaining-master

2f6c574

URL Cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

arobson commented Jan 16, 2014

arobson commented Jan 16, 2014

arobson commented Jan 17, 2014

drapp commented Sep 12, 2014

michaelklishin commented Sep 12, 2014

arobson commented Mar 19, 2015

michaelklishin commented Mar 19, 2015

dumbbell commented Mar 19, 2015

simonmacmullen commented Mar 19, 2015

arobson commented Mar 20, 2015

michaelklishin commented Mar 20, 2015

Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

Conversation

arobson commented Jan 16, 2014

Present Behavior

Use Case

Test Case

Request For Comments

Workflow

arobson commented Jan 16, 2014

arobson commented Jan 17, 2014

drapp commented Sep 12, 2014

michaelklishin commented Sep 12, 2014

arobson commented Mar 19, 2015

michaelklishin commented Mar 19, 2015

dumbbell commented Mar 19, 2015

simonmacmullen commented Mar 19, 2015

arobson commented Mar 20, 2015

michaelklishin commented Mar 20, 2015