mirrored_supervisor: Rework error handling after a failed update (backport #14018) #14020

mergify · 2025-06-03T12:31:34Z

Why

The retry logic I added in 4621fe7 was completely wrong. If Khepri reached its own timeout of 30 seconds (as of this writing), the mirrored supervisor would retry 50 times because it would not check the time spent. This means it would retry for 25 minutes. Nice.

That retry would be terminated forcefully by the parent supervisor after 5 minutes if it was part of a shutdown.

How

This time, the code simply pass the error (timeout or something else) down to the following case. It will shut the mirrored supervisor down.

This fixes very long RabbitMQ node termination (at least 5 minutes, sometimes more) in testsuites. An example to reproduce:

gmake -C deps/rabbitmq_mqtt \
  RABBITMQ_METADATA_STORE=khepri \
  ct-v5 t=cluster_size_3:session_takeover_v3_v5

In this one, the third node of the cluster will take 5+ minutes to stop.

This is an automatic backport of pull request #14018 done by Mergify.

[Why] The retry logic I added in 4621fe7 was completely wrong. If Khepri reached its own timeout of 30 seconds (as of this writing), the mirrored supervisor would retry 50 times because it would not check the time spent. This means it would retry for 25 minutes. Nice. That retry would be terminated forcefully by the parent supervisor after 5 minutes if it was part of a shutdown. [How] This time, the code simply pass the error (timeout or something else) down to the following `case`. It will shut the mirrored supervisor down. This fixes very long RabbitMQ node termination (at least 5 minutes, sometimes more) in testsuites. An example to reproduce: gmake -C deps/rabbitmq_mqtt \ RABBITMQ_METADATA_STORE=khepri \ ct-v5 t=cluster_size_3:session_takeover_v3_v5 In this one, the third node of the cluster will take 5+ minutes to stop. (cherry picked from commit 376dd2c)

mergify bot assigned dumbbell Jun 3, 2025

michaelklishin added this to the 4.1.1 milestone Jun 3, 2025

michaelklishin merged commit 280ce65 into v4.1.x Jun 3, 2025
539 of 540 checks passed

michaelklishin deleted the mergify/bp/v4.1.x/pr-14018 branch June 3, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mirrored_supervisor: Rework error handling after a failed update (backport #14018) #14020

mirrored_supervisor: Rework error handling after a failed update (backport #14018) #14020

Uh oh!

mergify bot commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mirrored_supervisor: Rework error handling after a failed update (backport #14018) #14020

mirrored_supervisor: Rework error handling after a failed update (backport #14018) #14020

Uh oh!

Conversation

mergify bot commented Jun 3, 2025

Why

How

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants