Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expiring queues should dead-letter messages when a dead-letter-exchange is defined #29

Closed

Conversation

arobson
Copy link

@arobson arobson commented Jan 16, 2014

Present Behavior

Even when a dead-letter-exchange is defined for a queue that has a TTL set (x-expires), when the queue expires, the messages are lost.

Use Case

One approach to providing isolation guarantees in addition to horizontal scalability is to allow every service instance to have it's own queue. Unfortunately, if the connection is severed between the service instance and its queue, (or the instance crashes) messages are 'orphaned' in the queue with no elegant way to detect this issue and recover the messages. (per-message TTL is very problematic because its dependent upon average message process time and queue depth)

Test Case

In a system that needs to provide some level of isolation guarantees (say by using a consistent hashing exchange using correlation id for hashing), each new service instance would define its own queue bound to the consistent hash exchange with a TTL and the consistent hash exchange as the dead-letter-exchange.

Messages were published to the consistent hash exchange and split amongst queues. One queue's consumer is terminated and after the TTL expires, the queue is removed and all messages are republished to the consistent hash exchange and effectively redistributed to the remaining queues.

Request For Comments

Please let me know if this behavior would be undesirable, it fits our use case perfectly but if there's a different set of configurations we could use to get the same benefits, we'd like to know about those.

If there are problems with the code as its written, I'd be happy to make any changes necessary in order for this PR to be suitable for further consideration.

If my implementation is so problematic that it would be easier to just ignore and have a core committer add the feature properly, I totally understand. I tried it partly for the learning experience and partly out of a desire to potentially contribute to an excellent piece of software I've been using every day for a long time.

Workflow

It looks like GitHub is not the normal avenue for submitting changes but I'm far less familiar with Hg and wasn't sure how to (or if I could) create a ticket in your system for a feature in order to match the branch to the ticket number.

It says a lot about RabbitMQ's team and implementation that I could pull something like this off in a few hours of source diving. Sorry if I've made a big mess, as I mentioned, this would be a huge win for us going forward if we could get this behavior implemented.

…o dead-letter messages in an expiring queue when a dead-letter-exchange was provided
@arobson
Copy link
Author

arobson commented Jan 16, 2014

Found an edge case in this code: when using a DLX that points to the expiring queue, the dead-letter feature sees this as a cycle and does not re-publish the messages. I tried calling rabbit_bindings:remove_from_destination to remove the bindings before dead-lettering the messages but it appears to still detect the cycle. All this to say, this PR needs more work.

…remove bindings from both the expiring queue (destination) and DLX exchange if they exist to prevent consistent hash routing DL'd messages to the expiring queue.
@arobson
Copy link
Author

arobson commented Jan 17, 2014

This update cleans up the way the original code was written in addition to resolving the undesired behavior when used with a consistent-hash exchange as the DLX.

Interested to hear back on whether this seems like a feasible approach.

@simonmacmullen expressed concerns about this feature in general when used with large queues that have messages on disk. I'd imagine the performance would be slow, but I know that I could cause similar behavior today if I set a DLX and Message TTL on a queue, threw tons of messages at it until they hit disk and then started expiring.

@drapp
Copy link

drapp commented Sep 12, 2014

Any word on this pull request? I could also use this functionality. It's very hard to have expiring queues without this. I'm trying to have a system that routes to queues that are transient, and I want to have a combination of an alternate exchange and dead lettering to ensure that messages that don't get consumed end up in a fallback queue.

If I unbind each transient queue before its consumer goes away, I can make sure the messages expire and get dead lettered before the queue goes away, and once the queue is unbound the messages flow through the alternate exchange. There's big hole though if the consumer dies suddenly though: if a queue expires while messages are flowing into it, those messages are lost.

If there's a workaround I'd be open to it, but this pr seems like a fix for a dangerous bug.

@michaelklishin
Copy link
Member

There is still no consensus about whether this feature is worth including, given the edge cases (e.g. huge queues with most of messages on disk).

@arobson
Copy link
Author

arobson commented Mar 19, 2015

@michaelklishin - just following up: I realize the edge case could cause performance issues but the same edge case already exists for any queue that is backlogged with a dead-letter exchange and long enough message TTL. In other words: there are already other ways one can create performance problems if one doesn't take the proper precautions when designing their topology.

Would the Rabbit team consider including cautionary language around this feature as part of the documentation? I've built several systems around Rabbit as the messaging backbone - there has not been a single case where message loss was preferable to temporary performance degradation. The easy way to prevent the performance issue with this feature is to ensure your queues aren't being overrun. With a consistent-hash-exchange (or exchange-queue pairs for each service instance), it's rather simple to solve for this by simply spinning up an additional service instance.

Without this feature, we're left with developing an entire subsystem to monitor queues and consumers that waits for a queue's consumer to drop, unbinds the queue from everything, and directs an existing consumer to pick up the slack then monitors the queue so it can be removed when empty. This is actually quite problematic and has the follow undesirable consequences:

  • load imbalance across consumers
  • uneven latency spikes in processing messages from the orphaned queue (since it can't be evenly redistributed)
  • constant load on rabbit's HTTP API, especially if you need to catch the issue quickly to minimize time lost

Distributed mutual exclusion is a very difficult problem to solve. I believe that with the inclusion of this feature (or a selective-consumer feature), Rabbit would be uniquely positioned to remove or at least greatly reduce the role of a distributed locking mechanism. Without this capability, we're left with complex work-arounds and a new set of problems to address.

@michaelklishin
Copy link
Member

I am positive about including this feature. Let's wait for other team members' opinions.

@dumbbell
Copy link
Member

I agree this seems a useful feature. It may even be what users expect when using TTL on a queue.

@simonmacmullen
Copy link
Contributor

I don't think anyone is debating whether dlx-on-queue-delete is a useful feature in the abstract. It totally is!

My concerns with the design of this pull request are:

  1. It doesn't use flow control; therefore it's possible that if the deleted queue is large then it can spam the DLX-ing queues with messages fast enough to use up all available memory and crash the server.

Yes theoretically this issue exists already with messages getting DLXed by other means. But it's quite hard to hit; this patch makes it very easy. There's a good chance deleting any large queue with a DLX set would do it. So this really needs addressing.

  1. It doesn't guarantee messages will not be lost when queues with DLX set are deleted (e.g. if you shut down at the wrong time). That's more fixable by documentation though.

I also have concerns with the implementation:

  1. The bit about removing bindings first is dubious; other clients could add bindings back after this point. And calling into exchange callbacks is especially dubious. It happens to work for the CHX but I am not at all convinced that it would do the right thing for (e.g.) federation. See @rade's previous discussion here:

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-January/033354.html

Now, @rade was not too concerned with offering solutions there beyond "it's complicated". But I would suggest that doing this in the termination phase (after rabbit_amqqueue:internal_delete/1 has already happened) is definitely a preferable way forward.

You still have the issue with "dark queues" he mentioned, but there is now infrastructure that could let you fix that too; we can show crashed queues which are there-but-dead, storing the fact that they're crashed in Mnesia:

terminate(_Reason, State = #q{q = Q}) ->

This could be extended to support deleting queues too.

@arobson
Copy link
Author

arobson commented Mar 20, 2015

@simonmacmullen - I can try to address your concerns but I won't pretend that I understand Rabbit well enough to handle it. It's been over a year since I've even looked at this code, I was actually pleasantly surprised to see the PR hadn't been closed.

If someone on the core team wanted to take on adding the feature properly, addressing your concerns, we can certainly close this out. Otherwise, do you have recommendations on how I could start working towards addressing your concerns and get some team feedback? Would you prefer I just continue making commits to this branch and request review?

Thanks for being open to discussion and for the guidance provided so far.

@michaelklishin
Copy link
Member

I'll take the liberty to close this. Not because it's a feature we don't want but because the implementation doesn't cover some edge cases we are quite worried about.

@dumbbell dumbbell added this to the n/a milestone Sep 11, 2015
dcorbacho pushed a commit that referenced this pull request May 27, 2016
Add file_cache_handler and buffer tuning back
binarin pushed a commit that referenced this pull request Sep 28, 2021
dcorbacho pushed a commit that referenced this pull request Jul 5, 2023
* Faster implementation of scan_index

* avoid jumping through the file all the time (especially skipping back
  which invalidates read-ahead buffer)
* TODO: scan_index is still called when attaching by timestamp which is
  unnccesarry
* TODO: binary search (I have an implementation locally but need to
  integrate it)

* All tests pass

We can now better handle corner cases such as end of log or truncated log.
All tests pass without changing them so there's a chance we maintain
behaviour in all cases.

* Incorporate PR feedback

* pass #seg_info
* check if offset within the segment range
* handle (unlikely) eof
* refactor how PreviousChunk is passed
* return offset_out_of_range in unlikely corener cases

* Return `eof`, instead of `{error, eof}`

Co-authored-by: Michal Kuratczyk <mkuratczyk@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants