Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rabbit_fifo_dlx_worker resiliency #7677

Merged
merged 3 commits into from Mar 20, 2023
Merged

Conversation

ansd
Copy link
Member

@ansd ansd commented Mar 20, 2023

1. Terminate replaced rabbit_fifo_dlx_worker

Jepsen dead lettering tests of job qq-jepsen-test-3-12 of Concourse pipeline jepsen-tests fail sometimes with following error:

{{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]}

At the end of the Jepsen test, there are 2 DLX workers on the same node.

Analysing the logs reveals the following:

Source quorum queue node becomes leader and starts its DLX worker:

2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/'

Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen).
The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out:

2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                  'rabbit@concourse-qq-jepsen-312-3'}: {timeout,
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                        {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                         'rabbit@concourse-qq-jepsen-312-3'}}
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)...

3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore:

2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3

But because the DLX worker at this point failed to register as consumer, it will not be terminated in

ensure_worker_terminated(#?MODULE{consumer = undefined}) ->
ok;
ensure_worker_terminated(#?MODULE{consumer = #dlx_consumer{pid = Pid}}) ->
case is_local_and_alive(Pid) of
true ->
%% Note that we can't return a mod_call effect here
%% because mod_call is executed on the leader only.
ok = supervisor:terminate_child(rabbit_fifo_dlx_sup, Pid),
rabbit_log:debug("terminated rabbit_fifo_dlx_worker ~tp", [Pid]);
false ->
ok
end.

Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in

process_command(Cmd, State, 5).
process_command(_Cmd, _State, 0) ->
{error, ra_command_failed};
process_command(Cmd, #state{leader = Leader} = State, Tries) ->
case ra:process_command(Leader, Cmd, 60_000) of
{ok, ok, Leader} ->
{ok, State#state{leader = Leader}};
{ok, ok, L} ->
rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp because actual leader is ~tp.",
[Cmd, Leader, L]),
{error, ra_command_failed};
Err ->
rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp: ~tp~n"
"Trying ~b more time(s)...",
[Cmd, Leader, Err, Tries]),
process_command(Cmd, State, Tries - 1)
end.
), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader.

This commit prevents this issue.
So, last consumer who does a #checkout{} wins and the “old one” has to terminate.

2. Make rabbit_fifo_dlx_sup more resilient

by not terminating itself if a low number of its children terminate

3. Do not restart DLX worker if leader is non-local

Instead, terminate, and let the new rabbit_fifo_dlx_worker on the other node take over.

[Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108)
of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests`
fail sometimes with following error:
```
{{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]}
```

At the end of the Jepsen test, there are 2 DLX workers on the same node.

Analysing the logs reveals the following:

Source quorum queue node becomes leader and starts its DLX worker:
```
2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/'
```
Less than 1 second later, Mnesia reports a network partition (introduced
by Jepsen).
The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out:
```
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                  'rabbit@concourse-qq-jepsen-312-3'}: {timeout,
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                        {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                         'rabbit@concourse-qq-jepsen-312-3'}}
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)...
```
3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore:
```
2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3
```
But because the DLX worker at this point failed to register as consumer, it will not be terminated in
https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275

Eventually, when the local node becomes a leader again, that DLX worker succeeds to register
as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58),
and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd
got started when the local quorum queue node transitioned to become a leader.

This commit prevents this issue.
So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.
@ansd ansd changed the title Terminate replaced rabbit_fifo_dlx_worker Improve rabbit_fifo_dlx_worker resiliency Mar 20, 2023
@ansd
Copy link
Member Author

ansd commented Mar 20, 2023

The failing Test Mixed Version Clusters / Test (Mixed Version Cluster) (25_3) (pull_request) seems to be unrelated to this PR and caused by #7672

@ansd ansd marked this pull request as ready for review March 20, 2023 16:19
Previously, it used the default intensity:
"intensity defaults to 1 and period defaults to 5."

However, it's a bit low given there can be dozens or hundreds of DLX
workers: If only 2 fail within 5 seconds, the whole supervisor
terminates.

Even with the new values, there shouldn't be any infnite loop of the
supervisor terminating and restarting childs because the
rabbit_fifo_dlx_worker is terminated and started very quickly
given that the (the slow) consumer registration happens in
rabbit_fifo_dlx_worker:handle_continue/2.
The rabbit_fifo_dlx_worker should be co-located with the quorum
queue leader.

If a new leader on a different node gets elected before the
rabbit_fifo_dlx_worker initialises (i.e. registers itself as a
consumer), it should stop itself normally, such that it is not restarted
by rabbit_fifo_dlx_sup.

Another rabbit_fifo_dlx_worker should be created on the new quorum
queue leader node.
Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ansd ansd merged commit 557c23b into main Mar 20, 2023
15 of 16 checks passed
@ansd ansd deleted the ensure-single-dlx-worker branch March 20, 2023 16:54
michaelklishin added a commit that referenced this pull request Mar 20, 2023
Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680)
michaelklishin added a commit that referenced this pull request Mar 20, 2023
Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680) (backport #7681)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants