-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue process nerver be alive so queue declare blocked #349
Comments
It's not clear what you are trying to do. Please provide the exact steps you take and link to specific areas of the codebase if you have questions. I suggest doing that on rabbitmq-users: if there is an issue, we'll file one with a more specific description. Thank you. |
I'll leave a note from the mailing list discussion of this. This may be #224, which was fixed in 3.5.5. |
Situation itself is triggered by some other bug (but I'm not sure this is #224, there were no similar log records and no processes with similar stacktraces). Loop happends inside rabbit_channel:handle_method(#'queue.declare'{}, ...). While something is wrong in different part of the system, I don't think it's a good idea to loop forever at this point - as a result we have server which doesn't respond to So at least some safeguard should be added that limits number of rabbit_amqqueue:with/3 iterations - with some reasonable number of attemts or with some overall timeout. |
@binarin do we ask to reopen this bug or log a new one? |
@dims I'm going to provide a separate PR, for further discussion. |
This is a manual rebase of rabbitmq-common #349 rabbitmq/rabbitmq-common@1c09c0f
This is a manual rebase of rabbitmq-common #349 rabbitmq/rabbitmq-common@1c09c0f
I use rabbitmq of version 3.4.3, and have a rabbitmq cluster consists of two nodes; I have a testcase which will run a script on each node, the script will restart rabbitmq with a interval of 30 seconds.
After a while since the testcase begin, sometimes half an hour, sometimes 2 hours, or 3 hours. Some queues will happen to something wrong, when I decare the queue, will block and never return.
I also have tried version 3.5.4, the issue exists too.
I have debugged the issue for a time, and I found something. Can you help have a check whether this is a real bug?
The queue details in mnesia:
Eshell V6.2 (abort with ^G)
(rabbit@rabbitmqNode0)1> rd(resource, {virtual_host, kind, name}).
resource
(rabbit@rabbitmqNode0)2> rd(amqqueue, {name, durable, auto_delete, exclusive_owner = none,arguments,pid,slave_pids, sync_slave_pids,down_slave_nodes,policy,gm_pids,decorators,state}).
amqqueue
(rabbit@rabbitmqNode0)3> QN2= #resource{virtual_host= <<"/">>, kind=queue,name= <<"q-servicechain-plugin">>}.
resource{virtual_host = <<"/">>,kind = queue,
(rabbit@rabbitmqNode0)4>
(rabbit@rabbitmqNode0)4> rabbit_misc:dirty_read({rabbit_queue, QN2}).
{ok,#amqqueue{name = #resource{virtual_host = <<"/">>,
kind = queue,name = <<"q-servicechain-plugin">>},
durable = false,auto_delete = false,exclusive_owner = none,
arguments = [],pid = <3079.1517.0>, ---- pid
slave_pids = [<0.823.0>], ----slave_pids
sync_slave_pids = [],down_slave_nodes = [],
policy = [{vhost,<<"/">>},
{name,<<"ha_length_ttl">>},
{pattern,<<"^(?!metering.sample).+">>},
{'apply-to',<<"queues">>},
{definition,[{<<"ha-mode">>,<<"all">>},
{<<"ha-sync-mode">>,<<"automatic">>},
{<<"max-length">>,59600},
{<<"message-ttl">>,86400000}]},
{priority,1}],
gm_pids = [{<0.828.0>,<0.823.0>}],
decorators = [],state = live}}
(rabbit@rabbitmqNode0)5> rabbit_misc:is_process_alive(pid(3079,1517,0)).
False
1: here, if the queue pid won’t be alive, the declare will block.
with(Name, F, E) ->
case lookup(Name) of
{ok, Q = #amqqueue{state = crashed}} ->
E({absent, Q, crashed});
{ok, Q = #amqqueue{pid = QPid}} ->
%% We check is_process_alive(QPid) in case we receive a
%% nodedown (for example) in F() that has nothing to do
%% with the QPid. F() should be written s.t. that this
%% cannot happen, so we bail if it does since that
%% indicates a code bug and we don't want to get stuck in
%% the retry loop.
rabbit_misc:with_exit_handler(
fun () -> false = rabbit_mnesia:is_process_alive(QPid),
timer:sleep(25),
with(Name, F, E)
end, fun () -> F(Q) end);
{error, not_found} ->
E(not_found_or_absent_dirty(Name))
end.
terminate(Reason,
State = #state { name = QName,
backing_queue = BQ,
backing_queue_state = BQS }) ->
%% Backing queue termination. The queue is going down but
%% shouldn't be deleted. Most likely safe shutdown of this
%% node.
{ok, Q = #amqqueue{sync_slave_pids = SSPids}} =
rabbit_amqqueue:lookup(QName),
case SSPids =:= [] andalso
rabbit_policy:get(<<"ha-promote-on-shutdown">>, Q) =/= <<"always">> of
true -> %% Remove the whole queue to avoid data loss
rabbit_mirror_queue_misc:log_warning(
QName, "Stopping all nodes on master shutdown since no "
"synchronised slave is available~n", []),
stop_all_slaves(Reason, State); ----- here master down, but no synchronized
----slaves, will stop all slaves, but the pid will not modified.
false -> %% Just let some other slave take over.
ok
end,
State #state { backing_queue_state = BQ:terminate(Reason, BQS) }.
on_node_down(Node) ->
rabbit_misc:execute_mnesia_tx_with_tail(
fun () -> QsDels =
qlc:e(qlc:q([{QName, delete_queue(QName)} ||
#amqqueue{name = QName, pid = Pid,
slave_pids = []}
<- mnesia:table(rabbit_queue),
node(Pid) == Node andalso
not rabbit_mnesia:is_process_alive(Pid)])), --when the 3 conditions all be
-----satisfied, the queue will be delete
---but sometime the slave_pids will not be [].
{Qs, Dels} = lists:unzip(QsDels),
T = rabbit_binding:process_deletions(
lists:foldl(fun rabbit_binding:combine_deletions/2,
rabbit_binding:new_deletions(), Dels)),
fun () ->
T(),
lists:foreach(
fun(QName) ->
ok = rabbit_event:notify(queue_deleted,
[{name, QName}])
end, Qs)
end
end).
4: why the slave_pids is not []?
on_node_up() ->
QNames =
rabbit_misc:execute_mnesia_transaction(
fun () ->
mnesia:foldl(
fun (Q = #amqqueue{name = QName,
pid = Pid,
slave_pids = SPids}, QNames0) ->
%% We don't want to pass in the whole
%% cluster - we don't want a situation
%% where starting one node causes us to
%% decide to start a mirror on another
PossibleNodes0 = [node(P) || P <- [Pid | SPids]],
PossibleNodes =
case lists:member(node(), PossibleNodes0) of
true -> PossibleNodes0;
false -> [node() | PossibleNodes0]
end,
{_MNode, SNodes} = suggested_queue_nodes(
Q, PossibleNodes),
case lists:member(node(), SNodes) of
true -> [QName | QNames0];
false -> QNames0
end
end, [], rabbit_queue)
end),
[add_mirror(QName, node(), async) || QName <- QNames], --- here will create a slave. In my
-----environment,when the master down, the other node just up and create a slave,
---so the queue can not be deleted.
ok.
The text was updated successfully, but these errors were encountered: