Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue process nerver be alive so queue declare blocked #349

Closed
baoyonglei opened this issue Oct 8, 2015 · 5 comments
Closed

queue process nerver be alive so queue declare blocked #349

baoyonglei opened this issue Oct 8, 2015 · 5 comments
Assignees
Labels
mailing list material This belongs to the mailing list (rabbitmq-users on Google Groups)
Milestone

Comments

@baoyonglei
Copy link

I use rabbitmq of version 3.4.3, and have a rabbitmq cluster consists of two nodes; I have a testcase which will run a script on each node, the script will restart rabbitmq with a interval of 30 seconds.
After a while since the testcase begin, sometimes half an hour, sometimes 2 hours, or 3 hours. Some queues will happen to something wrong, when I decare the queue, will block and never return.
I also have tried version 3.5.4, the issue exists too.

I have debugged the issue for a time, and I found something. Can you help have a check whether this is a real bug?

The queue details in mnesia:
Eshell V6.2 (abort with ^G)
(rabbit@rabbitmqNode0)1> rd(resource, {virtual_host, kind, name}).
resource
(rabbit@rabbitmqNode0)2> rd(amqqueue, {name, durable, auto_delete, exclusive_owner = none,arguments,pid,slave_pids, sync_slave_pids,down_slave_nodes,policy,gm_pids,decorators,state}).
amqqueue
(rabbit@rabbitmqNode0)3> QN2= #resource{virtual_host= <<"/">>, kind=queue,name= <<"q-servicechain-plugin">>}.

resource{virtual_host = <<"/">>,kind = queue,

      name = <<"q-servicechain-plugin">>}

(rabbit@rabbitmqNode0)4>
(rabbit@rabbitmqNode0)4> rabbit_misc:dirty_read({rabbit_queue, QN2}).
{ok,#amqqueue{name = #resource{virtual_host = <<"/">>,
kind = queue,name = <<"q-servicechain-plugin">>},
durable = false,auto_delete = false,exclusive_owner = none,
arguments = [],pid = <3079.1517.0>, ---- pid
slave_pids = [<0.823.0>], ----slave_pids
sync_slave_pids = [],down_slave_nodes = [],
policy = [{vhost,<<"/">>},
{name,<<"ha_length_ttl">>},
{pattern,<<"^(?!metering.sample).+">>},
{'apply-to',<<"queues">>},
{definition,[{<<"ha-mode">>,<<"all">>},
{<<"ha-sync-mode">>,<<"automatic">>},
{<<"max-length">>,59600},
{<<"message-ttl">>,86400000}]},
{priority,1}],
gm_pids = [{<0.828.0>,<0.823.0>}],
decorators = [],state = live}}
(rabbit@rabbitmqNode0)5> rabbit_misc:is_process_alive(pid(3079,1517,0)).
False

1: here, if the queue pid won’t be alive, the declare will block.
with(Name, F, E) ->
case lookup(Name) of
{ok, Q = #amqqueue{state = crashed}} ->
E({absent, Q, crashed});
{ok, Q = #amqqueue{pid = QPid}} ->
%% We check is_process_alive(QPid) in case we receive a
%% nodedown (for example) in F() that has nothing to do
%% with the QPid. F() should be written s.t. that this
%% cannot happen, so we bail if it does since that
%% indicates a code bug and we don't want to get stuck in
%% the retry loop.
rabbit_misc:with_exit_handler(
fun () -> false = rabbit_mnesia:is_process_alive(QPid),
timer:sleep(25),
with(Name, F, E)
end, fun () -> F(Q) end);
{error, not_found} ->
E(not_found_or_absent_dirty(Name))
end.

  1. why the pid will not alive?
    terminate(Reason,
    State = #state { name = QName,
    backing_queue = BQ,
    backing_queue_state = BQS }) ->
    %% Backing queue termination. The queue is going down but
    %% shouldn't be deleted. Most likely safe shutdown of this
    %% node.
    {ok, Q = #amqqueue{sync_slave_pids = SSPids}} =
    rabbit_amqqueue:lookup(QName),
    case SSPids =:= [] andalso
    rabbit_policy:get(<<"ha-promote-on-shutdown">>, Q) =/= <<"always">> of
    true -> %% Remove the whole queue to avoid data loss
    rabbit_mirror_queue_misc:log_warning(
    QName, "Stopping all nodes on master shutdown since no "
    "synchronised slave is available~n", []),
    stop_all_slaves(Reason, State); ----- here master down, but no synchronized
    ----slaves, will stop all slaves, but the pid will not modified.
    false -> %% Just let some other slave take over.
    ok
    end,
    State #state { backing_queue_state = BQ:terminate(Reason, BQS) }.
  2. I also found that to this type queues(no synchronized slaves), rabbitmq will delete them on function:
    on_node_down(Node) ->
    rabbit_misc:execute_mnesia_tx_with_tail(
    fun () -> QsDels =
    qlc:e(qlc:q([{QName, delete_queue(QName)} ||
    #amqqueue{name = QName, pid = Pid,
    slave_pids = []}
    <- mnesia:table(rabbit_queue),
    node(Pid) == Node andalso
    not rabbit_mnesia:is_process_alive(Pid)])), --when the 3 conditions all be
    -----satisfied, the queue will be delete
    ---but sometime the slave_pids will not be [].
    {Qs, Dels} = lists:unzip(QsDels),
    T = rabbit_binding:process_deletions(
    lists:foldl(fun rabbit_binding:combine_deletions/2,
    rabbit_binding:new_deletions(), Dels)),
    fun () ->
    T(),
    lists:foreach(
    fun(QName) ->
    ok = rabbit_event:notify(queue_deleted,
    [{name, QName}])
    end, Qs)
    end
    end).

4: why the slave_pids is not []?
on_node_up() ->
QNames =
rabbit_misc:execute_mnesia_transaction(
fun () ->
mnesia:foldl(
fun (Q = #amqqueue{name = QName,
pid = Pid,
slave_pids = SPids}, QNames0) ->
%% We don't want to pass in the whole
%% cluster - we don't want a situation
%% where starting one node causes us to
%% decide to start a mirror on another
PossibleNodes0 = [node(P) || P <- [Pid | SPids]],
PossibleNodes =
case lists:member(node(), PossibleNodes0) of
true -> PossibleNodes0;
false -> [node() | PossibleNodes0]
end,
{_MNode, SNodes} = suggested_queue_nodes(
Q, PossibleNodes),
case lists:member(node(), SNodes) of
true -> [QName | QNames0];
false -> QNames0
end
end, [], rabbit_queue)
end),
[add_mirror(QName, node(), async) || QName <- QNames], --- here will create a slave. In my
-----environment,when the master down, the other node just up and create a slave,
---so the queue can not be deleted.
ok.

@michaelklishin
Copy link
Member

It's not clear what you are trying to do. Please provide the exact steps you take and link to specific areas of the codebase if you have questions. I suggest doing that on rabbitmq-users: if there is an issue, we'll file one with a more specific description. Thank you.

@michaelklishin michaelklishin self-assigned this Oct 8, 2015
@michaelklishin michaelklishin added the mailing list material This belongs to the mailing list (rabbitmq-users on Google Groups) label Oct 8, 2015
@michaelklishin michaelklishin added this to the n/a milestone Oct 8, 2015
@michaelklishin
Copy link
Member

I'll leave a note from the mailing list discussion of this. This may be #224, which was fixed in 3.5.5.

@binarin
Copy link
Contributor

binarin commented Nov 19, 2015

Situation itself is triggered by some other bug (but I'm not sure this is #224, there were no similar log records and no processes with similar stacktraces).

Loop happends inside rabbit_channel:handle_method(#'queue.declare'{}, ...).
It calls maybe_stat/2, which in turn calls rabbit_amqueue:stat/1. This call
results in nodedown exit, and it finally triggers infinite loop in rabbit_amqqueue:with/3.

While something is wrong in different part of the system, I don't think it's a good idea to loop forever at this point - as a result we have server which doesn't respond to rabbitmqctl list_channels and also a client that is forever blocked on queue.declare.

So at least some safeguard should be added that limits number of rabbit_amqqueue:with/3 iterations - with some reasonable number of attemts or with some overall timeout.

@dims
Copy link

dims commented Nov 20, 2015

@binarin do we ask to reopen this bug or log a new one?

@binarin
Copy link
Contributor

binarin commented Nov 20, 2015

@dims I'm going to provide a separate PR, for further discussion.

pjk25 added a commit that referenced this issue Jul 16, 2021
This is a manual rebase of rabbitmq-common #349

rabbitmq/rabbitmq-common@1c09c0f
pjk25 added a commit that referenced this issue Dec 6, 2021
This is a manual rebase of rabbitmq-common #349

rabbitmq/rabbitmq-common@1c09c0f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mailing list material This belongs to the mailing list (rabbitmq-users on Google Groups)
Projects
None yet
Development

No branches or pull requests

4 participants