Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash report when broker stops #1004

Closed
Jooyyce opened this issue Dec 14, 2018 · 13 comments · Fixed by #1027
Closed

Crash report when broker stops #1004

Jooyyce opened this issue Dec 14, 2018 · 13 comments · Fixed by #1027
Assignees
Labels
Milestone

Comments

@Jooyyce
Copy link

Jooyyce commented Dec 14, 2018

VerneMQ version:

1.2.3

Problem:

When we stop VerneMQ broker (with systemctl stop vernemq.service or vernemq stop commands), we have the following crash report in console.log file:

AAAA-MM-DD HH:MM:SS.XXX [error] <0.482.0>@vmq_cluster:is_ready:69 CRASH REPORT Process <0.482.0> with 0 neighbours crashed with reason: bad argument in call to ets:lookup(vmq_status, ready) in vmq_cluster:is_ready/0 line 68

Each crash report matches one failed Erlang process associated with one MQTT client (<0.482.0> is a Erlang PID). In any case, the broker continues to operate. Only the client's process fails. Other clients are not affected by this problem.

The configuration file is not in question. We can reproduce crash report with default config file of VerneMQ with the following changes:

  • Allow_anonymous
  • Tcp_listener = 0.0.0.0:1883 (instead of 127.0.0.1:1883)

The MQTT client seems to be the cause of the error. A process does not exit correctly when the broker stops or disconnects abruptly.

@Jooyyce
Copy link
Author

Jooyyce commented Dec 14, 2018

We can highlight the problem on version 1.6.2 of VerneMQ and a simple MQTT client (using Paho MQTT).
The AutoReconnect parameter seems to cause crash reports:

when it is set to false: no crash report
when it is set to true: crash report

We use Wireshark to understand network exchanges and differences between this two types of connections.
A TCP connection is initiated before the MQTT connection request. This connection is kept active until the VerneMQ broker closes (the client receives a reset frame of the connection from the broker).
If we have set AutoReconnect to true: at this moment, the client tries to establish a new TCP connection and a new MQTT connection. The TCP connection is accepted by the broker but not the MQTT connection: no Connect Ack (it is usual... No more broker). The broker being closed must receive the connection request but is unable to find an available cluster or node. So the customer process crashes.

@larshesel
Copy link
Contributor

Hi, thanks for the report. Looks like there's a case we're not handling while shutting down VerneMQ. Do you have the output from the crash.log file available as well?

@m-manns
Copy link

m-manns commented Dec 14, 2018

Hi,
I work with @Jooyyce and follow this issue.

You can find the outputs of the crash.log file below: (VerneMQ 1.6.2)

2018-12-13 15:03:45 =CRASH REPORT====
  crasher:
    initial call: vmq_ranch:init/5
    pid: <0.464.0>
    registered_name: []
    exception error: bad argument: [{ets,lookup,[vmq_status,ready],[]},{vmq_cluster,is_ready,0,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_cluster.erl"},{line,69}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,121}]},{vmq_mqtt_fsm,check_user,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,556}]},{vmq_mqtt_fsm,init,3,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,144}]},{vmq_mqtt_pre_init,data_in,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_pre_init.erl"},{line,67}]},{vmq_ranch,handle_message,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_ranch.erl"},{line,174}]},{vmq_ranch,loop_,1,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_ranch.erl"},{line,117}]}]
    ancestors: [<0.332.0>,<0.331.0>,ranch_sup,<0.108.0>]
    message_queue_len: 0
    messages: []
    links: [<0.332.0>,#Port<0.6733>]
    dictionary: [{rand_seed,{#{jump => #Fun<rand.16.15449617>,max => 288230376151711743,next => #Fun<rand.15.15449617>,type => exsplus},[285222742060469046|74841631306680564]}},{max_msg_size,0}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1153
  neighbours:
2018-12-13 15:03:45 =ERROR REPORT====
Ranch listener {{0,0,0,0},1883} had connection process started with vmq_ranch:start_link/4 at <0.464.0> exit with reason: {badarg,[{ets,lookup,[vmq_status,ready],[]},{vmq_cluster,is_ready,0,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_cluster.erl"},{line,69}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,121}]},{vmq_mqtt_fsm,check_user,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,556}]},{vmq_mqtt_fsm,init,3,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,144}]},{vmq_mqtt_pre_init,data_in,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_mqtt_pre_init.erl"},{line,67}]},{vmq_ranch,handle_message,2,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_ranch.erl"},{line,174}]},{vmq_ranch,loop_,1,[{file,"/opt/vernemq/distdir/1.6.2/_build/default/lib/vmq_server/src/vmq_ranch.erl"},{line,117}]}]}

Feel free to ask us for more information. Thanks!

@larshesel
Copy link
Contributor

If I understand it correctly I think we'd just have to protect the ets table lookup so this doesn't cause errors to be logged when shutting down the broker, but just for my further understanding, does this cause an operational problem for you?

@bestouff
Copy link

(I'm also in the team)
We count on VerneMQ for being our secure gateway between our cloud and some client boxes in the wild. Seeing it crashing on operation - even if it's harmless in theory - is not reassuring.

@larshesel larshesel added this to the 1.6.3 milestone Dec 17, 2018
@larshesel larshesel added the bug label Dec 18, 2018
@dergraf
Copy link
Contributor

dergraf commented Jan 7, 2019

@bestouff @m-manns is this in a clustered setup?

@bestouff
Copy link

bestouff commented Jan 7, 2019

No, for now there's a single broker.

@larshesel
Copy link
Contributor

@dergraf I think this makes sense - we don't shut down the listeners before tearing down the server processes.

I think the ideal solution would be to have the listeners (the client ones - we probably don't want to include the internal ones?) be the last thing to be started when the broker starts and the first thing to be stopped when the broker shuts down.

I was wondering if we could split vmq_config into two and have the listener part be it's own child of vmq_server_sup and be the last to be started and the first to be stopped...

@larshesel
Copy link
Contributor

We did a partial fix for this by making sure to stop MQTT listeners before starting the shutdown of the rest of VerneMQ. It's still possible that an active connection can trigger the error if it subscribes/unsubscribes while VerneMQ is shutting down, but this edge case should be handled when we move the sessions themselves into the supervision hierarchy together with the listeners.

@bestouff
Copy link

bestouff commented Jan 9, 2019

Ok thanks. In which release will we be able to see it ?

@larshesel
Copy link
Contributor

1.6.3 or 1.7.0.

@sriram05
Copy link

@larshesel I am also experiencing a similar issue. When there is two broker nodes are running successfully behind the haproxy and if i stop one (vernemq stop) all the client connections are refused and further no more clients could not establish a connection.

@larshesel
Copy link
Contributor

@sriram05 please open a separate issue descrbing the exact details of your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants