v2 design ideas #229

michaelklishin · 2023-05-26T15:08:41Z

Updates

The design below evolved over time. The last major update is from Sep 26, 2023.

Problem Definition

The current design of this plugin has plenty of limitations that make it unsuitable for many environments. They are visibly called out in the README but to reiterate:

All delayed messages are stored in a non-replicated Mnesia table
The only metric provided is the total number of delayed messages
Mnesia has a lot of peculiarities in how it approaches recovery from failures. It will be removed from RabbitMQ as of 4.0, so we need a replacement

These limitations call for a new plugin with the same goal but a different design. For the lack of a more creative name, let's call it Delayed Message Exchange v2 or simply "v2".

This issue is an outline of some of the ideas our team has. I am intentionally filing it as
an issue, despite this being just a number of ideas.

You are welcome to share your ideas in the comments as long as

The goal does not venture too far from what this plugin currently can do. In other words, the goal is to keep it small and focussed, and some ideas will not be accepted
The discussion remains civil, unlike some other issues in this repo. If it ends up being too heated, it will be locked. Insults towards the maintainers will not be tolerated

Where to Store Messages

This plugin stores messages for future and some metadata about them:

When it is due for re-publishing
Where to publish it
How to route it

This information should be replicated. Unlike in the early 2010s, when this plugin was first envisioned, modern RabbitMQ versions have a mature and efficient Raft implementation
plus a few more sub-projects that can be considered for distributed plugins such
as this one in v2.

Streams do not allow for random reads of individual
messages, so using a stream or an Osiris log directly would only be possible
as a replication mechanism that local instances of the plugin will read from and
put the data into a suitable store providing random reads.

Perhaps ironically we need a classic efficient disk-based key value store,
with local storage provided by something like

RocksDB (there is an Erlang NIF API)
Marble
Agate

and the distribution layer provided by Ra. This example Ra-based K/V store would have
to have a more mature version used for more than running Jepsen tests against Ra.

This store would allow for random reads of individual messages stored for later publishing
and needs to provide a very minimalistic API.

Rust-based stores such as Marble can be accessed via Rustler.

Where to Store Metadata

RabbitMQ is in the middle of a transition to Khepri, a Raft-based tree-structured data store we
have developed for storing schema.

Khepri will be first introduced in RabbitMQ 3.13 and is going to be merged into main.

Specifically Khepri is suitable for storing an index of delayed messages:

Timestamps can be used as keys on which we sort/filter
Values can be message IDs (or mc/message container data structures if we target RabbitMQ 3.13 from the start)

The index will help locate the IDs of messages that are up for delivery. The messages
themselves, possibly with their metadata, can be loaded from a durable key/value store
described above.

Because Khepri will be available via a feature flag in 3.13, this plugin will have to
require that flag.

Using Fewer Timers

Assuming that all metadata, including expiration, is stored in Khepri in a way that makes it
possible to easily query a subset of delayed messages, this plugin can use a very small
number of timers, for example, just one or one per virtual host.

Not having a lot of far-in-the-future timers has another benefit: the current (2^32 - 1)
limitation for timer intervals will no longer apply. The timer will be used as a periodic
tick process that finds out what messages are up for re-publishing, loads their metadata and
hands it all off to a different process.

This way, if someone wants to delay a message for more than ≈ 24 days, they will be able to
do it, even though I would still not recommend it.

What Metrics to Provide

Right now this plugin provides a single metric: the number of delayed messages. There is
interest in turning it into a histogram of time periods.

More Powerful CLI Tools

Besides inspecting various metrics provided by this plugin, the operator will occasionally need
to manipulate the messages delayed: delete individual messages or a subset, inspect them, and so on.

Moving storage to a stream will make inspection possible, and deletion can be done on just the
metadata, with stream retention policies taking care of cleaning up the newly orphaned messages
on disk.

Re-publishing from a "Leader"

Like with many non-eventually consistent distributed systems, we have to either decide to perform writes via a single elected leader, or partition the data set such that N writers can
co-exist within a single cluster.

Some existing (commercial) plugins use a per-virtual host partitioning scheme. For this
plugin it makes more sense to do this per-exchange. We cannot do this per-queue because
the queues the message will route to is unknown in advance.

Mapping routing keys to a set of writers/publishers won't work either because we need to
guarantee message ordering within a single queue.

Khepri, much like etcd, can be used for leader elections much like etcd is used by
Kubernetes-oriented systems.

Known Problems and Limitations

The biggest issue before RabbitMQ 3.13.0 will be Khepri cluster formation. Getting it right
from a plugin can be painful. When RabbitMQ itself introduces Khepri in the core, it's not
clear whether a reasonable upgrade path can be provided.

Khepri is a tree-structured store, so certain types of queries will not be an option. This
means that the data model of this plugin has to be carefully designed to support
the subset of CLI commands we'd introduce.

Khepri, much like etcd, can be used for leader elections. But it's not RabbitMQ's use
case for Khepri, so there may be "unknown unknowns".

In the discussion about how many leaders/writers the cluster of plugin instances should have,
we completely avoid the issue of message ordering at their original publishing time.

For example, if messages M1, M2, …, Mn are published in order but their delay is such that
they all must be published at the same time, would the users expect the M1, M2, …, Mn order to be preserved? If so, what kind of leader/writer design trade-offs would that entail?

The text was updated successfully, but these errors were encountered:

michaelklishin · 2023-09-26T23:30:47Z

Updated after a closer review with @SimonUnge and a few members of the core team.

gomoripeti · 2023-11-03T15:57:20Z

Great initiative, great write up. I dont have any comment on the details of the storage implementation itself (khepri for metadata and a KV store sounds good)
Just putting down some random ideas about the plugin

Expectations

My 2c is that delivery order of messages scheduled for the same
timestamp is unspecified (not the publishing order)
(it would be different if there would be an explicit "schedule at"
parameter and not a "delay by")
Would be good to store as little data in memory as conveniently
possible, so that more messages can be delayed (ie there is some
data that is only stored on disk - as opposed to Mnesia or khepri)
(for example the exchange, although metadata, could be stored in the KV store and not khepri)

Metrics

total size of messages (bodies) per exchange
max message body size per exchange
timestamp histogram might be hard to maintain, but a first/last timestamp would be still informative
memory/disk usage of the plugin

Actions

purge messages per exchange

Feature ideas

Allow limiting delivery rate of expired messages

Could be useful when the broker is stopped for a period, and at startup
tries to deliver all the messages which expired while it was stopped,
creating a delivery (and hence cpu/memory) spike.
Configurable upper limit of memory used by the plugin

Reject publishes if above.
Alternatively could be similar to max-length or max-length-bytes (per-exchange or global)
+ Metric: number of rejected publishes per exchange

michaelklishin added the enhancement label May 26, 2023

michaelklishin self-assigned this May 26, 2023

michaelklishin mentioned this issue May 26, 2023

CLI Report of Delayed Message Counts #123

Closed

michaelklishin mentioned this issue Sep 21, 2023

Extending based on dynamic time window function and message quantity grouping delivery threshold. #252

Closed

michaelklishin mentioned this issue Feb 5, 2024

RabbitMQ is not able to recover/start when its not gracefully shutdown #261

Closed

michaelklishin mentioned this issue Mar 13, 2024

Investigate ways to make this plugin compatible with Khepri-enabled clusters #272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 design ideas #229

v2 design ideas #229

michaelklishin commented May 26, 2023 •

edited

michaelklishin commented Sep 26, 2023

gomoripeti commented Nov 3, 2023

v2 design ideas #229

v2 design ideas #229

Comments

michaelklishin commented May 26, 2023 • edited

Updates

Problem Definition

Where to Store Messages

Where to Store Metadata

Using Fewer Timers

What Metrics to Provide

More Powerful CLI Tools

Re-publishing from a "Leader"

Known Problems and Limitations

michaelklishin commented Sep 26, 2023

gomoripeti commented Nov 3, 2023

Expectations

Metrics

Actions

Feature ideas

michaelklishin commented May 26, 2023 •

edited