Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow MQTT QoS 0 subscribers to reconnect #10244

Merged
merged 1 commit into from
Dec 28, 2023
Merged

Allow MQTT QoS 0 subscribers to reconnect #10244

merged 1 commit into from
Dec 28, 2023

Conversation

ansd
Copy link
Member

@ansd ansd commented Dec 27, 2023

The solution in #10203 has the following issues:

  1. Bindings can be left ofter in Mnesia table rabbit_durable_queue. One solution to 1. would be to first delete the old queue via rabbit_amqqueue:internal_delete(Q, User, missing_owner) and subsequently declare the new queue via
    rabbit_amqqueue:internal_declare(Q, false)
    However, even then, it suffers from:
  2. Race conditions between rabbit_amqqueue:on_node_down/1 and rabbit_mqtt_qos0_queue:declare/2:
    rabbit_amqqueue:on_node_down/1 could first read the queue records that need to be deleted, thereafter rabbit_mqtt_qos0_queue:declare/2 could re-create the queue owned by the new connection PID, and rabbit_amqqueue:on_node_down/1 could subsequently delete the re-created queue.

Unfortunately, rabbit_amqqueue:on_node_down/1 does not delete transient queues in one isolated transaction. Instead it first reads queues and subsequenlty deletes queues in batches making it prone to race conditions.

Ideally, this commit deletes all rabbit_mqtt_qos0_queue queues of the node that has crashed including their bindings.
However, doing so in one transaction is risky as there may be millions of such queues and the current code path applies the same logic on all live nodes resulting in conflicting transactions and therefore a long database operation.

Hence, this commit uses the simplest approach which should still be safe:
Do not remove rabbit_mqtt_qos0_queue queues if a node crashes. Other live nodes will continue to route to these dead queues. That should be okay, given that the rabbit_mqtt_qos0_queue clients auto confirm.
Continuing routing however has the effect of counting as routing result for AMQP 0.9.1 mandatory property.
If an MQTT client re-connects to a live node with the same client ID, the new node will delete and then re-create the queue. Once the crashed node comes back online, it will clean up its leftover queues and bindings.

The solution in #10203 has the following issues:
1. Bindings can be left ofter in Mnesia table rabbit_durable_queue.
One solution to 1. would be to first delete the old queue via
`rabbit_amqqueue:internal_delete(Q, User, missing_owner)`
and subsequently declare the new queue via
`rabbit_amqqueue:internal_declare(Q, false)`
However, even then, it suffers from:
2. Race conditions between `rabbit_amqqueue:on_node_down/1`
and `rabbit_mqtt_qos0_queue:declare/2`:
`rabbit_amqqueue:on_node_down/1` could first read the queue records that
need to be deleted, thereafter `rabbit_mqtt_qos0_queue:declare/2` could
re-create the queue owned by the new connection PID, and `rabbit_amqqueue:on_node_down/1`
could subsequently delete the re-created queue.

Unfortunately, `rabbit_amqqueue:on_node_down/1` does not delete
transient queues in one isolated transaction. Instead it first reads
queues and subsequenlty deletes queues in batches making it prone to
race conditions.

Ideally, this commit deletes all rabbit_mqtt_qos0_queue queues of the
node that has crashed including their bindings.
However, doing so in one transaction is risky as there may be millions
of such queues and the current code path applies the same logic on all
live nodes resulting in conflicting transactions and therefore a long
database operation.

Hence, this commit uses the simplest approach which should still be
safe:
Do not remove rabbit_mqtt_qos0_queue queues if a node crashes.
Other live nodes will continue to route to these dead queues.
That should be okay, given that the rabbit_mqtt_qos0_queue clients auto
confirm.
Continuing routing however has the effect of counting as routing result
for AMQP 0.9.1 `mandatory` property.
If an MQTT client re-connects to a live node with the same client ID,
the new node will delete and then re-create the queue.
Once the crashed node comes back online, it will clean up its leftover
queues and bindings.
@michaelklishin
Copy link
Member

This was a rebase to include a Selenium test suite runner fix.

@ansd ansd marked this pull request as ready for review December 28, 2023 10:44
@michaelklishin
Copy link
Member

Note that the whole problem will go away naturally once Khepri ships because transient entities in general will cease to exist internally, and transient entity (per protocol semantics) removal will inevitably be revisited and simplified.

@michaelklishin michaelklishin merged commit 824b2d8 into main Dec 28, 2023
19 checks passed
@michaelklishin michaelklishin deleted the qos0-queue branch December 28, 2023 12:37
michaelklishin added a commit that referenced this pull request Dec 28, 2023
Allow MQTT QoS 0 subscribers to reconnect (backport #10244)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants