Quorum Queue: Avoid unbounded log growth due to returned (reject, requeue=true) messages #3776

kjnilsson · 2021-11-19T11:04:46Z

Quorum queues do not handle the case of repeated re-queues very well as they a require consumption to proceed in a roughly fifo-ish order so that regular snapshots can be taken and the on-disk log truncated.

Unfortunately many client consumer applications do not take proper action depending on the consumer failure and just re-queue in response to any failure. This will remain an anti-pattern of course. There is little benefit with re-queuing messages based on any failure as they will just be delivered back to the consumer shortly after.

When a return command arrives at the quorum queue rather than just moving the message record from the consumer back to the returns queue we could instead convert it into a new command: replace_return which will append the original message as part of the command. When this command is applied it means this raft index is no longer needed and may allow us to increment the "release_cursor" (and truncate the log.). Effectively we are re-writing the message at the front of the log make its prior raft index superfluous.

Of course there is a risk that writing more data just exacerbates the problem as there may be other message still live with lower raft indexes. To help with this we could add a check to only re-write the message if the raft index is the lowest live message in the queue.

After implementing the strike-through suggestion above it soon became clear this did not work. Because re-queues are still returned to the front of the queue any ready messages that don't fit into the consumer's prefetch will never be processed and thus will hold the log back.

To handle the case where a consumer continuously returns message we need to cycle the entire queue in fifo (ish) order. To do so explicit re-queues should be re written and returned to the back of the queue. This will ensure ordered processing and timely snapshotting.

However, if a queue has a delivery_limit set it would be acceptable to return as is currently done as the re-processing attempts are limited and thus will ensure fifo-ish progression through the queue.
In this case we can switch the behaviour to return at the front of the queue when a delivery_limit is set and return at the back (with a message re-write) if no delivery_limit is set.

The text was updated successfully, but these errors were encountered:

kjnilsson · 2021-11-22T13:43:59Z

There are complications of course. If, say, we only re-write a returned message if it is the oldest message on queue then if the consumer also returns another message after it may re-order the returns. For the case where there may be multiple consumers this is ok-ish as queue ordering guarantees are effectively random for at-least-once consumers but when using single active consumer we want to ensure that returns retain the original order. So once the queue starts re-writing returns for a consumer it needs to re-write all future returns as well to retain order.

Rewriting all returns of course is more expensive and uses more disk space if for some reason the log still can't be truncated (i.e. we return messages that aren't the oldest message on disk and the oldest message is still retained by a faulty consumer).

The above will never work as the state to check if a message is the oldest in the queue isn't deterministic at recovery (as we do not store the raft indexes lookup in the snapshot).

kjnilsson · 2022-03-24T16:43:54Z

Done as part of #3221

kjnilsson mentioned this issue Nov 19, 2021

Mitigate effects of repeated re-queues in quorum queues. #3778

Closed

kjnilsson mentioned this issue Dec 3, 2021

QQ: introduce new machine version (2) #3121

Merged

8 tasks

kjnilsson closed this as completed Mar 24, 2022

michaelklishin added this to the 3.10.0 milestone Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quorum Queue: Avoid unbounded log growth due to returned (reject, requeue=true) messages #3776

Quorum Queue: Avoid unbounded log growth due to returned (reject, requeue=true) messages #3776

kjnilsson commented Nov 19, 2021 •

edited

kjnilsson commented Nov 22, 2021 •

edited

kjnilsson commented Mar 24, 2022

Quorum Queue: Avoid unbounded log growth due to returned (reject, requeue=true) messages #3776

Quorum Queue: Avoid unbounded log growth due to returned (reject, requeue=true) messages #3776

Comments

kjnilsson commented Nov 19, 2021 • edited

kjnilsson commented Nov 22, 2021 • edited

kjnilsson commented Mar 24, 2022

kjnilsson commented Nov 19, 2021 •

edited

kjnilsson commented Nov 22, 2021 •

edited