Allow MQTT QoS 0 subscribers to reconnect #10244

ansd · 2023-12-27T16:11:17Z

The solution in #10203 has the following issues:

Bindings can be left ofter in Mnesia table rabbit_durable_queue. One solution to 1. would be to first delete the old queue via rabbit_amqqueue:internal_delete(Q, User, missing_owner) and subsequently declare the new queue via
rabbit_amqqueue:internal_declare(Q, false)
However, even then, it suffers from:
Race conditions between rabbit_amqqueue:on_node_down/1 and rabbit_mqtt_qos0_queue:declare/2:
rabbit_amqqueue:on_node_down/1 could first read the queue records that need to be deleted, thereafter rabbit_mqtt_qos0_queue:declare/2 could re-create the queue owned by the new connection PID, and rabbit_amqqueue:on_node_down/1 could subsequently delete the re-created queue.

Unfortunately, rabbit_amqqueue:on_node_down/1 does not delete transient queues in one isolated transaction. Instead it first reads queues and subsequenlty deletes queues in batches making it prone to race conditions.

Ideally, this commit deletes all rabbit_mqtt_qos0_queue queues of the node that has crashed including their bindings.
However, doing so in one transaction is risky as there may be millions of such queues and the current code path applies the same logic on all live nodes resulting in conflicting transactions and therefore a long database operation.

Hence, this commit uses the simplest approach which should still be safe:
Do not remove rabbit_mqtt_qos0_queue queues if a node crashes. Other live nodes will continue to route to these dead queues. That should be okay, given that the rabbit_mqtt_qos0_queue clients auto confirm.
Continuing routing however has the effect of counting as routing result for AMQP 0.9.1 mandatory property.
If an MQTT client re-connects to a live node with the same client ID, the new node will delete and then re-create the queue. Once the crashed node comes back online, it will clean up its leftover queues and bindings.

The solution in #10203 has the following issues: 1. Bindings can be left ofter in Mnesia table rabbit_durable_queue. One solution to 1. would be to first delete the old queue via `rabbit_amqqueue:internal_delete(Q, User, missing_owner)` and subsequently declare the new queue via `rabbit_amqqueue:internal_declare(Q, false)` However, even then, it suffers from: 2. Race conditions between `rabbit_amqqueue:on_node_down/1` and `rabbit_mqtt_qos0_queue:declare/2`: `rabbit_amqqueue:on_node_down/1` could first read the queue records that need to be deleted, thereafter `rabbit_mqtt_qos0_queue:declare/2` could re-create the queue owned by the new connection PID, and `rabbit_amqqueue:on_node_down/1` could subsequently delete the re-created queue. Unfortunately, `rabbit_amqqueue:on_node_down/1` does not delete transient queues in one isolated transaction. Instead it first reads queues and subsequenlty deletes queues in batches making it prone to race conditions. Ideally, this commit deletes all rabbit_mqtt_qos0_queue queues of the node that has crashed including their bindings. However, doing so in one transaction is risky as there may be millions of such queues and the current code path applies the same logic on all live nodes resulting in conflicting transactions and therefore a long database operation. Hence, this commit uses the simplest approach which should still be safe: Do not remove rabbit_mqtt_qos0_queue queues if a node crashes. Other live nodes will continue to route to these dead queues. That should be okay, given that the rabbit_mqtt_qos0_queue clients auto confirm. Continuing routing however has the effect of counting as routing result for AMQP 0.9.1 `mandatory` property. If an MQTT client re-connects to a live node with the same client ID, the new node will delete and then re-create the queue. Once the crashed node comes back online, it will clean up its leftover queues and bindings.

michaelklishin · 2023-12-28T01:47:18Z

This was a rebase to include a Selenium test suite runner fix.

michaelklishin · 2023-12-28T12:36:46Z

Note that the whole problem will go away naturally once Khepri ships because transient entities in general will cease to exist internally, and transient entity (per protocol semantics) removal will inevitably be revisited and simplified.

Allow MQTT QoS 0 subscribers to reconnect (backport #10244)

ansd added the backport-v3.12.x label Dec 27, 2023

michaelklishin force-pushed the qos0-queue branch from eb1da36 to 78b4fcc Compare December 28, 2023 01:47

ansd marked this pull request as ready for review December 28, 2023 10:44

michaelklishin merged commit 824b2d8 into main Dec 28, 2023
19 checks passed

michaelklishin deleted the qos0-queue branch December 28, 2023 12:37

mergify bot mentioned this pull request Dec 28, 2023

Allow MQTT QoS 0 subscribers to reconnect (backport #10244) #10252

Merged

michaelklishin added a commit that referenced this pull request Dec 28, 2023

Merge pull request #10252 from rabbitmq/mergify/bp/v3.12.x/pr-10244

91b2964

Allow MQTT QoS 0 subscribers to reconnect (backport #10244)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow MQTT QoS 0 subscribers to reconnect #10244

Allow MQTT QoS 0 subscribers to reconnect #10244

ansd commented Dec 27, 2023

michaelklishin commented Dec 28, 2023

michaelklishin commented Dec 28, 2023

Allow MQTT QoS 0 subscribers to reconnect #10244

Allow MQTT QoS 0 subscribers to reconnect #10244

Conversation

ansd commented Dec 27, 2023

michaelklishin commented Dec 28, 2023

michaelklishin commented Dec 28, 2023