Fix MQTT test flake in Khepri mixed version mode (backport #13235) #13238

mergify · 2025-02-11T21:17:04Z

The following test flaked in CI under Khepri in mixed version mode:

make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_3:will_delay_node_restart RABBITMQ_METADATA_STORE=khepri SECONDARY_DIST=rabbitmq_server-4.0.5 FULL=1

The first node took exactly 30 seconds for draining:

2025-02-10 15:00:09.550824+00:00 [debug] <0.1449.0> MQTT accepting TCP connection <0.1449.0> (127.0.0.1:33376 -> 127.0.0.1:27005)
2025-02-10 15:00:09.550992+00:00 [debug] <0.1449.0> Received a CONNECT, client ID: sub0, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval']
2025-02-10 15:00:09.551134+00:00 [debug] <0.1449.0> MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost
2025-02-10 15:00:09.551219+00:00 [debug] <0.1449.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal
2025-02-10 15:00:09.551530+00:00 [info] <0.1449.0> Accepted MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 for client ID sub0
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription,<<"my/topic">>,
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0>                                             {mqtt_subscription_opts,0,false,
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0>                                              false,0,undefined}}]
2025-02-10 15:00:09.556233+00:00 [debug] <0.896.0> RabbitMQ metadata store: follower leader cast - redirecting to {rabbitmq_metadata,'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'}
2025-02-10 15:00:09.561518+00:00 [debug] <0.1456.0> MQTT accepting TCP connection <0.1456.0> (127.0.0.1:33390 -> 127.0.0.1:27005)
2025-02-10 15:00:09.561634+00:00 [debug] <0.1456.0> Received a CONNECT, client ID: will, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval']
2025-02-10 15:00:09.561715+00:00 [debug] <0.1456.0> MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost
2025-02-10 15:00:09.561828+00:00 [debug] <0.1456.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal
2025-02-10 15:00:09.562596+00:00 [info] <0.1456.0> Accepted MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 for client ID will
2025-02-10 15:00:09.565743+00:00 [warning] <0.1460.0> This node is being put into maintenance (drain) mode
2025-02-10 15:00:09.565833+00:00 [debug] <0.1460.0> Marking the node as undergoing maintenance
2025-02-10 15:00:09.570772+00:00 [info] <0.1460.0> Marked this node as undergoing maintenance
2025-02-10 15:00:09.570904+00:00 [info] <0.1460.0> Asked to suspend 9 client connection listeners. No new client connections will be accepted until these listeners are resumed!
2025-02-10 15:00:09.572268+00:00 [warning] <0.1460.0> Suspended all listeners and will no longer accept client connections
2025-02-10 15:00:09.572317+00:00 [warning] <0.1460.0> Closed 0 local client connections
2025-02-10 15:00:09.572418+00:00 [warning] <0.1449.0> MQTT disconnecting client <<"127.0.0.1:33376 -> 127.0.0.1:27005">> with client ID 'sub0', reason: maintenance
2025-02-10 15:00:09.572414+00:00 [warning] <0.1000.0> Closed 2 local (Web) MQTT client connections
2025-02-10 15:00:09.572499+00:00 [warning] <0.1456.0> MQTT disconnecting client <<"127.0.0.1:33390 -> 127.0.0.1:27005">> with client ID 'will', reason: maintenance
2025-02-10 15:00:09.572866+00:00 [alert] <0.1000.0> Closed 0 local STOMP client connections
2025-02-10 15:00:09.577432+00:00 [debug] <0.1456.0> scheduled delayed Will Message to topic my/topic for MQTT client ID will to be sent in 10000 ms
2025-02-10 15:00:12.991328+00:00 [debug] <0.1469.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:12.991443+00:00 [debug] <0.1469.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:12.992497+00:00 [debug] <0.1469.0> Done with virtual host processes reconciliation (run 3)
2025-02-10 15:00:16.511733+00:00 [debug] <0.1476.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:16.511864+00:00 [debug] <0.1476.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:16.514293+00:00 [debug] <0.1476.0> Done with virtual host processes reconciliation (run 4)
2025-02-10 15:00:24.897477+00:00 [debug] <0.1479.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:24.897607+00:00 [debug] <0.1479.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:24.898483+00:00 [debug] <0.1479.0> Done with virtual host processes reconciliation (run 5)
2025-02-10 15:00:24.898527+00:00 [debug] <0.1479.0> Will reschedule virtual host process reconciliation after 30 seconds
2025-02-10 15:00:32.994347+00:00 [debug] <0.1484.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:32.994474+00:00 [debug] <0.1484.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:32.996539+00:00 [debug] <0.1484.0> Done with virtual host processes reconciliation (run 6)
2025-02-10 15:00:32.996585+00:00 [debug] <0.1484.0> Will reschedule virtual host process reconciliation after 30 seconds
2025-02-10 15:00:39.576325+00:00 [info] <0.1460.0> Will transfer leadership of 0 quorum queues with current leader on this node
2025-02-10 15:00:39.576456+00:00 [info] <0.1460.0> Leadership transfer for quorum queues hosted on this node has been initiated
2025-02-10 15:00:39.576948+00:00 [info] <0.1460.0> Will stop local follower replicas of 0 quorum queues on this node
2025-02-10 15:00:39.576990+00:00 [info] <0.1460.0> Stopped all local replicas of quorum queues hosted on this node
2025-02-10 15:00:39.577120+00:00 [info] <0.1460.0> Will transfer leadership of metadata store with current leader on this node
2025-02-10 15:00:39.577282+00:00 [info] <0.1460.0> Khepri clustering: transferring leadership to node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577424+00:00 [info] <0.1460.0> Khepri clustering: skipping leadership transfer, leader is already in node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577547+00:00 [info] <0.1460.0> Leadership transfer for metadata store on this node has been done. The new leader is 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577674+00:00 [info] <0.1460.0> Node is ready to be shut down for maintenance or upgrade
2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> SIGTERM received - shutting down
2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0>
2025-02-10 15:00:39.595758+00:00 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown

Running the same test locally revealed that rabbit_maintenance:status_consistent_read/1 takes exactly 30 seconds to complete.

The test case assumes a Will Delay higher than the time it takes to drain and shut down the node. Hence, this commit increases the Will Delay time from 10 seconds to 40 seconds.

This is an automatic backport of pull request #13235 done by Mergify.

The following test flaked in CI under Khepri in mixed version mode: ``` make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_3:will_delay_node_restart RABBITMQ_METADATA_STORE=khepri SECONDARY_DIST=rabbitmq_server-4.0.5 FULL=1 ``` The first node took exactly 30 seconds for draining: ``` 2025-02-10 15:00:09.550824+00:00 [debug] <0.1449.0> MQTT accepting TCP connection <0.1449.0> (127.0.0.1:33376 -> 127.0.0.1:27005) 2025-02-10 15:00:09.550992+00:00 [debug] <0.1449.0> Received a CONNECT, client ID: sub0, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval'] 2025-02-10 15:00:09.551134+00:00 [debug] <0.1449.0> MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-02-10 15:00:09.551219+00:00 [debug] <0.1449.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-02-10 15:00:09.551530+00:00 [info] <0.1449.0> Accepted MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 for client ID sub0 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription,<<"my/topic">>, 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> {mqtt_subscription_opts,0,false, 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> false,0,undefined}}] 2025-02-10 15:00:09.556233+00:00 [debug] <0.896.0> RabbitMQ metadata store: follower leader cast - redirecting to {rabbitmq_metadata,'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'} 2025-02-10 15:00:09.561518+00:00 [debug] <0.1456.0> MQTT accepting TCP connection <0.1456.0> (127.0.0.1:33390 -> 127.0.0.1:27005) 2025-02-10 15:00:09.561634+00:00 [debug] <0.1456.0> Received a CONNECT, client ID: will, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval'] 2025-02-10 15:00:09.561715+00:00 [debug] <0.1456.0> MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-02-10 15:00:09.561828+00:00 [debug] <0.1456.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-02-10 15:00:09.562596+00:00 [info] <0.1456.0> Accepted MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 for client ID will 2025-02-10 15:00:09.565743+00:00 [warning] <0.1460.0> This node is being put into maintenance (drain) mode 2025-02-10 15:00:09.565833+00:00 [debug] <0.1460.0> Marking the node as undergoing maintenance 2025-02-10 15:00:09.570772+00:00 [info] <0.1460.0> Marked this node as undergoing maintenance 2025-02-10 15:00:09.570904+00:00 [info] <0.1460.0> Asked to suspend 9 client connection listeners. No new client connections will be accepted until these listeners are resumed! 2025-02-10 15:00:09.572268+00:00 [warning] <0.1460.0> Suspended all listeners and will no longer accept client connections 2025-02-10 15:00:09.572317+00:00 [warning] <0.1460.0> Closed 0 local client connections 2025-02-10 15:00:09.572418+00:00 [warning] <0.1449.0> MQTT disconnecting client <<"127.0.0.1:33376 -> 127.0.0.1:27005">> with client ID 'sub0', reason: maintenance 2025-02-10 15:00:09.572414+00:00 [warning] <0.1000.0> Closed 2 local (Web) MQTT client connections 2025-02-10 15:00:09.572499+00:00 [warning] <0.1456.0> MQTT disconnecting client <<"127.0.0.1:33390 -> 127.0.0.1:27005">> with client ID 'will', reason: maintenance 2025-02-10 15:00:09.572866+00:00 [alert] <0.1000.0> Closed 0 local STOMP client connections 2025-02-10 15:00:09.577432+00:00 [debug] <0.1456.0> scheduled delayed Will Message to topic my/topic for MQTT client ID will to be sent in 10000 ms 2025-02-10 15:00:12.991328+00:00 [debug] <0.1469.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:12.991443+00:00 [debug] <0.1469.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:12.992497+00:00 [debug] <0.1469.0> Done with virtual host processes reconciliation (run 3) 2025-02-10 15:00:16.511733+00:00 [debug] <0.1476.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:16.511864+00:00 [debug] <0.1476.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:16.514293+00:00 [debug] <0.1476.0> Done with virtual host processes reconciliation (run 4) 2025-02-10 15:00:24.897477+00:00 [debug] <0.1479.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:24.897607+00:00 [debug] <0.1479.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:24.898483+00:00 [debug] <0.1479.0> Done with virtual host processes reconciliation (run 5) 2025-02-10 15:00:24.898527+00:00 [debug] <0.1479.0> Will reschedule virtual host process reconciliation after 30 seconds 2025-02-10 15:00:32.994347+00:00 [debug] <0.1484.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:32.994474+00:00 [debug] <0.1484.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:32.996539+00:00 [debug] <0.1484.0> Done with virtual host processes reconciliation (run 6) 2025-02-10 15:00:32.996585+00:00 [debug] <0.1484.0> Will reschedule virtual host process reconciliation after 30 seconds 2025-02-10 15:00:39.576325+00:00 [info] <0.1460.0> Will transfer leadership of 0 quorum queues with current leader on this node 2025-02-10 15:00:39.576456+00:00 [info] <0.1460.0> Leadership transfer for quorum queues hosted on this node has been initiated 2025-02-10 15:00:39.576948+00:00 [info] <0.1460.0> Will stop local follower replicas of 0 quorum queues on this node 2025-02-10 15:00:39.576990+00:00 [info] <0.1460.0> Stopped all local replicas of quorum queues hosted on this node 2025-02-10 15:00:39.577120+00:00 [info] <0.1460.0> Will transfer leadership of metadata store with current leader on this node 2025-02-10 15:00:39.577282+00:00 [info] <0.1460.0> Khepri clustering: transferring leadership to node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577424+00:00 [info] <0.1460.0> Khepri clustering: skipping leadership transfer, leader is already in node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577547+00:00 [info] <0.1460.0> Leadership transfer for metadata store on this node has been done. The new leader is 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577674+00:00 [info] <0.1460.0> Node is ready to be shut down for maintenance or upgrade 2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> SIGTERM received - shutting down 2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> 2025-02-10 15:00:39.595758+00:00 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown ``` Running the same test locally revealed that [rabbit_maintenance:status_consistent_read/1](https://github.com/rabbitmq/rabbitmq-server/blob/55ae91809433d9e6edfcc98563bcb2f0736ee79e/deps/rabbit/src/rabbit_maintenance.erl#L131) takes exactly 30 seconds to complete. The test case assumes a Will Delay higher than the time it takes to drain and shut down the node. Hence, this commit increases the Will Delay time from 10 seconds to 40 seconds. (cherry picked from commit 38cba9d)

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238)

mergify bot mentioned this pull request Feb 11, 2025

Fix MQTT test flake in Khepri mixed version mode #13235

Merged

michaelklishin added the backport-v4.0.x label Feb 11, 2025

michaelklishin merged commit 4c4a728 into v4.1.x Feb 11, 2025
38 of 39 checks passed

michaelklishin deleted the mergify/bp/v4.1.x/pr-13235 branch February 11, 2025 21:18

mergify bot mentioned this pull request Feb 11, 2025

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

Merged

michaelklishin added a commit that referenced this pull request Feb 11, 2025

Merge pull request #13240 from rabbitmq/mergify/bp/v4.0.x/pr-13238

c11fd97

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MQTT test flake in Khepri mixed version mode (backport #13235) #13238

Fix MQTT test flake in Khepri mixed version mode (backport #13235) #13238

Uh oh!

mergify bot commented Feb 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix MQTT test flake in Khepri mixed version mode (backport #13235) #13238

Fix MQTT test flake in Khepri mixed version mode (backport #13235) #13238

Uh oh!

Conversation

mergify bot commented Feb 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants