queue process nerver be alive so queue declare blocked #349

baoyonglei · 2015-10-08T07:34:32Z

I use rabbitmq of version 3.4.3, and have a rabbitmq cluster consists of two nodes; I have a testcase which will run a script on each node, the script will restart rabbitmq with a interval of 30 seconds.
After a while since the testcase begin, sometimes half an hour, sometimes 2 hours, or 3 hours. Some queues will happen to something wrong, when I decare the queue, will block and never return.
I also have tried version 3.5.4, the issue exists too.

I have debugged the issue for a time, and I found something. Can you help have a check whether this is a real bug?

The queue details in mnesia:
Eshell V6.2 (abort with ^G)
(rabbit@rabbitmqNode0)1> rd(resource, {virtual_host, kind, name}).
resource
(rabbit@rabbitmqNode0)2> rd(amqqueue, {name, durable, auto_delete, exclusive_owner = none,arguments,pid,slave_pids, sync_slave_pids,down_slave_nodes,policy,gm_pids,decorators,state}).
amqqueue
(rabbit@rabbitmqNode0)3> QN2= #resource{virtual_host= <<"/">>, kind=queue,name= <<"q-servicechain-plugin">>}.

resource{virtual_host = <<"/">>,kind = queue,

      name = <<"q-servicechain-plugin">>}

(rabbit@rabbitmqNode0)4>
(rabbit@rabbitmqNode0)4> rabbit_misc:dirty_read({rabbit_queue, QN2}).
{ok,#amqqueue{name = #resource{virtual_host = <<"/">>,
kind = queue,name = <<"q-servicechain-plugin">>},
durable = false,auto_delete = false,exclusive_owner = none,
arguments = [],pid = <3079.1517.0>, ---- pid
slave_pids = [<0.823.0>], ----slave_pids
sync_slave_pids = [],down_slave_nodes = [],
policy = [{vhost,<<"/">>},
{name,<<"ha_length_ttl">>},
{pattern,<<"^(?!metering.sample).+">>},
{'apply-to',<<"queues">>},
{definition,[{<<"ha-mode">>,<<"all">>},
{<<"ha-sync-mode">>,<<"automatic">>},
{<<"max-length">>,59600},
{<<"message-ttl">>,86400000}]},
{priority,1}],
gm_pids = [{<0.828.0>,<0.823.0>}],
decorators = [],state = live}}
(rabbit@rabbitmqNode0)5> rabbit_misc:is_process_alive(pid(3079,1517,0)).
False

1: here, if the queue pid won’t be alive, the declare will block.
with(Name, F, E) ->
case lookup(Name) of
{ok, Q = #amqqueue{state = crashed}} ->
E({absent, Q, crashed});
{ok, Q = #amqqueue{pid = QPid}} ->
%% We check is_process_alive(QPid) in case we receive a
%% nodedown (for example) in F() that has nothing to do
%% with the QPid. F() should be written s.t. that this
%% cannot happen, so we bail if it does since that
%% indicates a code bug and we don't want to get stuck in
%% the retry loop.
rabbit_misc:with_exit_handler(
fun () -> false = rabbit_mnesia:is_process_alive(QPid),
timer:sleep(25),
with(Name, F, E)
end, fun () -> F(Q) end);
{error, not_found} ->
E(not_found_or_absent_dirty(Name))
end.

why the pid will not alive?
terminate(Reason,
State = #state { name = QName,
backing_queue = BQ,
backing_queue_state = BQS }) ->
%% Backing queue termination. The queue is going down but
%% shouldn't be deleted. Most likely safe shutdown of this
%% node.
{ok, Q = #amqqueue{sync_slave_pids = SSPids}} =
rabbit_amqqueue:lookup(QName),
case SSPids =:= [] andalso
rabbit_policy:get(<<"ha-promote-on-shutdown">>, Q) =/= <<"always">> of
true -> %% Remove the whole queue to avoid data loss
rabbit_mirror_queue_misc:log_warning(
QName, "Stopping all nodes on master shutdown since no "
"synchronised slave is available~n", []),
stop_all_slaves(Reason, State); ----- here master down, but no synchronized
----slaves, will stop all slaves, but the pid will not modified.
false -> %% Just let some other slave take over.
ok
end,
State #state { backing_queue_state = BQ:terminate(Reason, BQS) }.
I also found that to this type queues(no synchronized slaves), rabbitmq will delete them on function:
on_node_down(Node) ->
rabbit_misc:execute_mnesia_tx_with_tail(
fun () -> QsDels =
qlc:e(qlc:q([{QName, delete_queue(QName)} ||
#amqqueue{name = QName, pid = Pid,
slave_pids = []}
<- mnesia:table(rabbit_queue),
node(Pid) == Node andalso
not rabbit_mnesia:is_process_alive(Pid)])), --when the 3 conditions all be
-----satisfied, the queue will be delete
---but sometime the slave_pids will not be [].
{Qs, Dels} = lists:unzip(QsDels),
T = rabbit_binding:process_deletions(
lists:foldl(fun rabbit_binding:combine_deletions/2,
rabbit_binding:new_deletions(), Dels)),
fun () ->
T(),
lists:foreach(
fun(QName) ->
ok = rabbit_event:notify(queue_deleted,
[{name, QName}])
end, Qs)
end
end).

4: why the slave_pids is not []?
on_node_up() ->
QNames =
rabbit_misc:execute_mnesia_transaction(
fun () ->
mnesia:foldl(
fun (Q = #amqqueue{name = QName,
pid = Pid,
slave_pids = SPids}, QNames0) ->
%% We don't want to pass in the whole
%% cluster - we don't want a situation
%% where starting one node causes us to
%% decide to start a mirror on another
PossibleNodes0 = [node(P) || P <- [Pid | SPids]],
PossibleNodes =
case lists:member(node(), PossibleNodes0) of
true -> PossibleNodes0;
false -> [node() | PossibleNodes0]
end,
{_MNode, SNodes} = suggested_queue_nodes(
Q, PossibleNodes),
case lists:member(node(), SNodes) of
true -> [QName | QNames0];
false -> QNames0
end
end, [], rabbit_queue)
end),
[add_mirror(QName, node(), async) || QName <- QNames], --- here will create a slave. In my
-----environment,when the master down, the other node just up and create a slave,
---so the queue can not be deleted.
ok.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2015-10-08T07:47:56Z

It's not clear what you are trying to do. Please provide the exact steps you take and link to specific areas of the codebase if you have questions. I suggest doing that on rabbitmq-users: if there is an issue, we'll file one with a more specific description. Thank you.

michaelklishin · 2015-10-08T10:13:24Z

I'll leave a note from the mailing list discussion of this. This may be #224, which was fixed in 3.5.5.

binarin · 2015-11-19T14:18:36Z

Situation itself is triggered by some other bug (but I'm not sure this is #224, there were no similar log records and no processes with similar stacktraces).

Loop happends inside rabbit_channel:handle_method(#'queue.declare'{}, ...).
It calls maybe_stat/2, which in turn calls rabbit_amqueue:stat/1. This call
results in nodedown exit, and it finally triggers infinite loop in rabbit_amqqueue:with/3.

While something is wrong in different part of the system, I don't think it's a good idea to loop forever at this point - as a result we have server which doesn't respond to rabbitmqctl list_channels and also a client that is forever blocked on queue.declare.

So at least some safeguard should be added that limits number of rabbit_amqqueue:with/3 iterations - with some reasonable number of attemts or with some overall timeout.

dims · 2015-11-20T15:09:29Z

@binarin do we ask to reopen this bug or log a new one?

binarin · 2015-11-20T15:36:13Z

@dims I'm going to provide a separate PR, for further discussion.

This is a manual rebase of rabbitmq-common #349 rabbitmq/rabbitmq-common@1c09c0f

michaelklishin closed this as completed Oct 8, 2015

michaelklishin self-assigned this Oct 8, 2015

michaelklishin added the mailing list material This belongs to the mailing list (rabbitmq-users on Google Groups) label Oct 8, 2015

michaelklishin added this to the n/a milestone Oct 8, 2015

binarin mentioned this issue Nov 20, 2015

Add retry limit to rabbit_amqqueue:with/2,3 rabbitmq/rabbitmq-common#26

Merged

pjk25 added a commit that referenced this issue Jul 16, 2021

Optimisation for 'delegate'

14cedf9

This is a manual rebase of rabbitmq-common #349 rabbitmq/rabbitmq-common@1c09c0f

pjk25 mentioned this issue Jul 16, 2021

Optimisation for 'delegate' #3197

Closed

pjk25 added a commit that referenced this issue Dec 6, 2021

Optimisation for 'delegate'

953b5be

This is a manual rebase of rabbitmq-common #349 rabbitmq/rabbitmq-common@1c09c0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queue process nerver be alive so queue declare blocked #349

queue process nerver be alive so queue declare blocked #349

baoyonglei commented Oct 8, 2015

michaelklishin commented Oct 8, 2015

michaelklishin commented Oct 8, 2015

binarin commented Nov 19, 2015

dims commented Nov 20, 2015

binarin commented Nov 20, 2015

queue process nerver be alive so queue declare blocked #349

queue process nerver be alive so queue declare blocked #349

Comments

baoyonglei commented Oct 8, 2015

resource{virtual_host = <<"/">>,kind = queue,

michaelklishin commented Oct 8, 2015

michaelklishin commented Oct 8, 2015

binarin commented Nov 19, 2015

dims commented Nov 20, 2015

binarin commented Nov 20, 2015