New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve rabbit_fifo_dlx_worker resiliency #7677
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108) of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests` fail sometimes with following error: ``` {{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]} ``` At the end of the Jepsen test, there are 2 DLX workers on the same node. Analysing the logs reveals the following: Source quorum queue node becomes leader and starts its DLX worker: ``` 2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/' ``` Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen). The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out: ``` 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}: {timeout, 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}} 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)... ``` 3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore: ``` 2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3 ``` But because the DLX worker at this point failed to register as consumer, it will not be terminated in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275 Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader. This commit prevents this issue. So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.
ansd
force-pushed
the
ensure-single-dlx-worker
branch
from
March 20, 2023 14:45
dd1a45b
to
c23fba0
Compare
ansd
changed the title
Terminate replaced rabbit_fifo_dlx_worker
Improve rabbit_fifo_dlx_worker resiliency
Mar 20, 2023
The failing |
Previously, it used the default intensity: "intensity defaults to 1 and period defaults to 5." However, it's a bit low given there can be dozens or hundreds of DLX workers: If only 2 fail within 5 seconds, the whole supervisor terminates. Even with the new values, there shouldn't be any infnite loop of the supervisor terminating and restarting childs because the rabbit_fifo_dlx_worker is terminated and started very quickly given that the (the slow) consumer registration happens in rabbit_fifo_dlx_worker:handle_continue/2.
The rabbit_fifo_dlx_worker should be co-located with the quorum queue leader. If a new leader on a different node gets elected before the rabbit_fifo_dlx_worker initialises (i.e. registers itself as a consumer), it should stop itself normally, such that it is not restarted by rabbit_fifo_dlx_sup. Another rabbit_fifo_dlx_worker should be created on the new quorum queue leader node.
ansd
force-pushed
the
ensure-single-dlx-worker
branch
from
March 20, 2023 16:30
7fcb99a
to
029ce84
Compare
kjnilsson
approved these changes
Mar 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
acogoluegnes
approved these changes
Mar 20, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1. Terminate replaced rabbit_fifo_dlx_worker
Jepsen dead lettering tests of job
qq-jepsen-test-3-12
of Concourse pipelinejepsen-tests
fail sometimes with following error:At the end of the Jepsen test, there are 2 DLX workers on the same node.
Analysing the logs reveals the following:
Source quorum queue node becomes leader and starts its DLX worker:
Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen).
The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out:
3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore:
But because the DLX worker at this point failed to register as consumer, it will not be terminated in
rabbitmq-server/deps/rabbit/src/rabbit_fifo_dlx.erl
Lines 264 to 275 in 865d533
Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in
rabbitmq-server/deps/rabbit/src/rabbit_fifo_dlx_client.erl
Lines 41 to 58 in 865d533
This commit prevents this issue.
So, last consumer who does a
#checkout{}
wins and the “old one” has to terminate.2. Make rabbit_fifo_dlx_sup more resilient
by not terminating itself if a low number of its children terminate
3. Do not restart DLX worker if leader is non-local
Instead, terminate, and let the new rabbit_fifo_dlx_worker on the other node take over.