Improve rabbit_fifo_dlx_worker resiliency #7677

ansd · 2023-03-20T12:37:05Z

1. Terminate replaced rabbit_fifo_dlx_worker

Jepsen dead lettering tests of job qq-jepsen-test-3-12 of Concourse pipeline jepsen-tests fail sometimes with following error:

{{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]}

At the end of the Jepsen test, there are 2 DLX workers on the same node.

Analysing the logs reveals the following:

Source quorum queue node becomes leader and starts its DLX worker:

2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/'

Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen).
The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out:

2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                  'rabbit@concourse-qq-jepsen-312-3'}: {timeout,
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                        {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0>                                                                                                                         'rabbit@concourse-qq-jepsen-312-3'}}
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)...

3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore:

2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3

But because the DLX worker at this point failed to register as consumer, it will not be terminated in

rabbitmq-server/deps/rabbit/src/rabbit_fifo_dlx.erl

Lines 264 to 275 in 865d533

    
           ensure_worker_terminated(#?MODULE{consumer = undefined}) -> 
        
               ok; 
        
           ensure_worker_terminated(#?MODULE{consumer = #dlx_consumer{pid = Pid}}) -> 
        
               case is_local_and_alive(Pid) of 
        
                   true -> 
        
                       %% Note that we can't return a mod_call effect here 
        
                       %% because mod_call is executed on the leader only. 
        
                       ok = supervisor:terminate_child(rabbit_fifo_dlx_sup, Pid), 
        
                       rabbit_log:debug("terminated rabbit_fifo_dlx_worker ~tp", [Pid]); 
        
                   false -> 
        
                       ok 
        
               end.

Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in

rabbitmq-server/deps/rabbit/src/rabbit_fifo_dlx_client.erl

Lines 41 to 58 in 865d533

    
               process_command(Cmd, State, 5). 
        
           process_command(_Cmd, _State, 0) -> 
        
               {error, ra_command_failed}; 
        
           process_command(Cmd, #state{leader = Leader} = State, Tries) -> 
        
               case ra:process_command(Leader, Cmd, 60_000) of 
        
                   {ok, ok, Leader} -> 
        
                       {ok, State#state{leader = Leader}}; 
        
                   {ok, ok, L} -> 
        
                       rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp because actual leader is ~tp.", 
        
                                          [Cmd, Leader, L]), 
        
                       {error, ra_command_failed}; 
        
                   Err -> 
        
                       rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp: ~tp~n" 
        
                                          "Trying ~b more time(s)...", 
        
                                          [Cmd, Leader, Err, Tries]), 
        
                       process_command(Cmd, State, Tries - 1) 
        
               end.

), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader.

This commit prevents this issue.
So, last consumer who does a #checkout{} wins and the “old one” has to terminate.

2. Make rabbit_fifo_dlx_sup more resilient

by not terminating itself if a low number of its children terminate

3. Do not restart DLX worker if leader is non-local

Instead, terminate, and let the new rabbit_fifo_dlx_worker on the other node take over.

[Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108) of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests` fail sometimes with following error: ``` {{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]} ``` At the end of the Jepsen test, there are 2 DLX workers on the same node. Analysing the logs reveals the following: Source quorum queue node becomes leader and starts its DLX worker: ``` 2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/' ``` Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen). The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out: ``` 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}: {timeout, 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}} 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)... ``` 3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore: ``` 2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3 ``` But because the DLX worker at this point failed to register as consumer, it will not be terminated in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275 Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader. This commit prevents this issue. So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.

ansd · 2023-03-20T16:19:21Z

The failing Test Mixed Version Clusters / Test (Mixed Version Cluster) (25_3) (pull_request) seems to be unrelated to this PR and caused by #7672

Previously, it used the default intensity: "intensity defaults to 1 and period defaults to 5." However, it's a bit low given there can be dozens or hundreds of DLX workers: If only 2 fail within 5 seconds, the whole supervisor terminates. Even with the new values, there shouldn't be any infnite loop of the supervisor terminating and restarting childs because the rabbit_fifo_dlx_worker is terminated and started very quickly given that the (the slow) consumer registration happens in rabbit_fifo_dlx_worker:handle_continue/2.

The rabbit_fifo_dlx_worker should be co-located with the quorum queue leader. If a new leader on a different node gets elected before the rabbit_fifo_dlx_worker initialises (i.e. registers itself as a consumer), it should stop itself normally, such that it is not restarted by rabbit_fifo_dlx_sup. Another rabbit_fifo_dlx_worker should be created on the new quorum queue leader node.

kjnilsson

LGTM

Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680)

Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680) (backport #7681)

ansd added backport-v3.11.x backport-v3.12.x backport-v3.10.x bug labels Mar 20, 2023

ansd force-pushed the ensure-single-dlx-worker branch from dd1a45b to c23fba0 Compare March 20, 2023 14:45

ansd changed the title ~~Terminate replaced rabbit_fifo_dlx_worker~~ Improve rabbit_fifo_dlx_worker resiliency Mar 20, 2023

ansd marked this pull request as ready for review March 20, 2023 16:19

ansd added 2 commits March 20, 2023 16:29

dumbbell mentioned this pull request Mar 20, 2023

Classic queues: make CQv2 the new default #7672

Merged

ansd force-pushed the ensure-single-dlx-worker branch from 7fcb99a to 029ce84 Compare March 20, 2023 16:30

kjnilsson approved these changes Mar 20, 2023

View reviewed changes

ansd merged commit 557c23b into main Mar 20, 2023
15 of 16 checks passed

ansd deleted the ensure-single-dlx-worker branch March 20, 2023 16:54

mergify bot mentioned this pull request Mar 20, 2023

Improve rabbit_fifo_dlx_worker resiliency (backport #7677) #7680

Merged

acogoluegnes approved these changes Mar 20, 2023

View reviewed changes

michaelklishin added a commit that referenced this pull request Mar 20, 2023

Merge pull request #7681 from rabbitmq/mergify/bp/v3.11.x/pr-7680

b8bfb87

Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680)

michaelklishin added a commit that referenced this pull request Mar 20, 2023

Merge pull request #7683 from rabbitmq/mergify/bp/v3.10.x/pr-7681

9d39cb0

Improve rabbit_fifo_dlx_worker resiliency (backport #7677) (backport #7680) (backport #7681)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rabbit_fifo_dlx_worker resiliency #7677

Improve rabbit_fifo_dlx_worker resiliency #7677

ansd commented Mar 20, 2023 •

edited

ansd commented Mar 20, 2023

kjnilsson left a comment

	ensure_worker_terminated(#?MODULE{consumer = undefined}) ->
	ok;
	ensure_worker_terminated(#?MODULE{consumer = #dlx_consumer{pid = Pid}}) ->
	case is_local_and_alive(Pid) of
	true ->
	%% Note that we can't return a mod_call effect here
	%% because mod_call is executed on the leader only.
	ok = supervisor:terminate_child(rabbit_fifo_dlx_sup, Pid),
	rabbit_log:debug("terminated rabbit_fifo_dlx_worker ~tp", [Pid]);
	false ->
	ok
	end.

	process_command(Cmd, State, 5).

	process_command(_Cmd, _State, 0) ->
	{error, ra_command_failed};
	process_command(Cmd, #state{leader = Leader} = State, Tries) ->
	case ra:process_command(Leader, Cmd, 60_000) of
	{ok, ok, Leader} ->
	{ok, State#state{leader = Leader}};
	{ok, ok, L} ->
	rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp because actual leader is ~tp.",
	[Cmd, Leader, L]),
	{error, ra_command_failed};
	Err ->
	rabbit_log:warning("Failed to process command ~tp on quorum queue leader ~tp: ~tp~n"
	"Trying ~b more time(s)...",
	[Cmd, Leader, Err, Tries]),
	process_command(Cmd, State, Tries - 1)
	end.

Improve rabbit_fifo_dlx_worker resiliency #7677

Improve rabbit_fifo_dlx_worker resiliency #7677

Conversation

ansd commented Mar 20, 2023 • edited

1. Terminate replaced rabbit_fifo_dlx_worker

2. Make rabbit_fifo_dlx_sup more resilient

3. Do not restart DLX worker if leader is non-local

ansd commented Mar 20, 2023

kjnilsson left a comment

Choose a reason for hiding this comment

ansd commented Mar 20, 2023 •

edited