Processing a received PUBCOMP after client reconnect crashes the session #762

dergraf · 2018-07-10T07:59:31Z

Expected behavior

A client dies, but it has unacked PUBREL frames. Once the client reconnects VerneMQ resends the PUBREL and expects a PUBCOMP. When VerneMQ processes the PUBCOMP it removes the PUBREL from the "waiting Acks".

Actual behaviour

A client dies, but it has unacked PUBREL frames. Once the client reconnects VerneMQ resends the PUBREL and expects a PUBCOMP. When VerneMQ processes the PUBCOMP the session crashes.

This case should be covered in https://github.com/erlio/vernemq/blob/master/apps/vmq_server/test/vmq_publish_SUITE.erl#L177, however the testcase doesn't check that the session is actually still alive after sending the PUBCOMP.

The Crash log:

2018-07-10 09:48:32 =CRASH REPORT====
  crasher:
    initial call: vmq_ranch:init/5
    pid: <0.3415.0>
    registered_name: []
    exception error: {{case_clause,<<98,2,0,1>>},[{vmq_mqtt_fsm,connected,2,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,350}]},{vmq_mqtt_fsm,in,3,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,168}]},{vmq_mqtt_fsm,data_in,3,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,142}]},{vmq_ranch,handle_message,2,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_ranch.erl"},{line,174}]},{vmq_ranch,loop_,1,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_ranch.erl"},{line,117}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.3385.0>,<0.3384.0>,ranch_sup,<0.3125.0>]
    message_queue_len: 0
    messages: []
    links: [<0.3385.0>,#Port<0.332260>]
    dictionary: [{mqtt_pubrel_sent,#Ref<0.1645208890.790233092.75259>},{mqtt_publish_sent,#Ref<0.1645208890.790233092.75256>},{mqtt_connect_received,#Ref<0.1645208890.790233092.75240>},{max_msg_size,0},{bytes_sent,#Ref<0.1645208890.790233092.75239>},{bytes_received,#Ref<0.1645208890.790233092.75238>},{rand_seed,{#{jump => #Fun<rand.16.15449617>,max => 288230376151711743,next => #Fun<rand.15.15449617>,type => exsplus},[215213621333760556|442190220450265]}},{mqtt_connack_accepted_sent,#Ref<0.1645208890.790233092.75250>},{socket_open,#Ref<0.1645208890.790233092.75235>},{mqtt_pubcomp_received,#Ref<0.1645208890.790233092.75245>}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1627

The text was updated successfully, but these errors were encountered:

dergraf · 2018-07-10T15:16:14Z

Fixed in PR #763

- Store the *parsed* PUBREL inside the waiting acks instead of the serialized binary. This enables to reuse the same functionality for handling retried PUBRELs, without caring if it is retried because of a timeout or because of a reconnect. Cleanup places which relied on the serialized binary. - Fixing above enabled the possibility to implement a small performance improvement where we could get rid of using `length()` on the outgoing batch of frames for incrementing the `mqtt_publishes_sent` counter. - Improved `vmq_publish_SUITE.erl` to test that a client responds to PINGREQs before closing the Socket. This validates that 1st. the client hasn't crashed in the meantime (e.g. because not being able to handle the PUBCOMP, see bug vernemq#762), and 2nd that the last received frame on this socket is the PINGRESP. - The improvement above discovered an edgecase with a PUBREL frame that has been retried too fast, indicating a bug in the retry mechanism. This has been fixed with tagging the message IDs inside the retry queue. As a result the retry mechanism can differentiate between retrying a PUBLISH and a PUBREL. Without this fix a retried, already acked PUBLISH, would have retried an unacked PUBREL right away. - Removing several function clauses that were forgotten when fixing vernemq#750

…ry mechanism (#763) * Fix QoS2 retry for reconnecting clients, and retry mechanism improvement - Store the *parsed* PUBREL inside the waiting acks instead of the serialized binary. This enables to reuse the same functionality for handling retried PUBRELs, without caring if it is retried because of a timeout or because of a reconnect. Cleanup places which relied on the serialized binary. - Fixing above enabled the possibility to implement a small performance improvement where we could get rid of using `length()` on the outgoing batch of frames for incrementing the `mqtt_publishes_sent` counter. - Improved `vmq_publish_SUITE.erl` to test that a client responds to PINGREQs before closing the Socket. This validates that 1st. the client hasn't crashed in the meantime (e.g. because not being able to handle the PUBCOMP, see bug #762), and 2nd that the last received frame on this socket is the PINGRESP. - The improvement above discovered an edgecase with a PUBREL frame that has been retried too fast, indicating a bug in the retry mechanism. This has been fixed with tagging the message IDs inside the retry queue. As a result the retry mechanism can differentiate between retrying a PUBLISH and a PUBREL. Without this fix a retried, already acked PUBLISH, would have retried an unacked PUBREL right away. - Removing several function clauses that were forgotten when fixing #750 * rename deliver_bin to deliver_pubrel

…ry mechanism (vernemq#763) * Fix QoS2 retry for reconnecting clients, and retry mechanism improvement - Store the *parsed* PUBREL inside the waiting acks instead of the serialized binary. This enables to reuse the same functionality for handling retried PUBRELs, without caring if it is retried because of a timeout or because of a reconnect. Cleanup places which relied on the serialized binary. - Fixing above enabled the possibility to implement a small performance improvement where we could get rid of using `length()` on the outgoing batch of frames for incrementing the `mqtt_publishes_sent` counter. - Improved `vmq_publish_SUITE.erl` to test that a client responds to PINGREQs before closing the Socket. This validates that 1st. the client hasn't crashed in the meantime (e.g. because not being able to handle the PUBCOMP, see bug vernemq#762), and 2nd that the last received frame on this socket is the PINGRESP. - The improvement above discovered an edgecase with a PUBREL frame that has been retried too fast, indicating a bug in the retry mechanism. This has been fixed with tagging the message IDs inside the retry queue. As a result the retry mechanism can differentiate between retrying a PUBLISH and a PUBREL. Without this fix a retried, already acked PUBLISH, would have retried an unacked PUBREL right away. - Removing several function clauses that were forgotten when fixing vernemq#750 * rename deliver_bin to deliver_pubrel

…ry mechanism (#763) * Fix QoS2 retry for reconnecting clients, and retry mechanism improvement - Store the *parsed* PUBREL inside the waiting acks instead of the serialized binary. This enables to reuse the same functionality for handling retried PUBRELs, without caring if it is retried because of a timeout or because of a reconnect. Cleanup places which relied on the serialized binary. - Fixing above enabled the possibility to implement a small performance improvement where we could get rid of using `length()` on the outgoing batch of frames for incrementing the `mqtt_publishes_sent` counter. - Improved `vmq_publish_SUITE.erl` to test that a client responds to PINGREQs before closing the Socket. This validates that 1st. the client hasn't crashed in the meantime (e.g. because not being able to handle the PUBCOMP, see bug #762), and 2nd that the last received frame on this socket is the PINGRESP. - The improvement above discovered an edgecase with a PUBREL frame that has been retried too fast, indicating a bug in the retry mechanism. This has been fixed with tagging the message IDs inside the retry queue. As a result the retry mechanism can differentiate between retrying a PUBLISH and a PUBREL. Without this fix a retried, already acked PUBLISH, would have retried an unacked PUBREL right away. - Removing several function clauses that were forgotten when fixing #750 * rename deliver_bin to deliver_pubrel

dergraf changed the title ~~A QoS2 publish (from broker to cluster)~~ Processing a received PUBCOMP after client reconnect crashes the session Jul 10, 2018

dergraf added bug Level:Intermediate labels Jul 10, 2018

dergraf mentioned this issue Jul 10, 2018

Fix QoS2 retry for reconnecting clients, and refactor and cleanup retry mechanism #763

Merged

dergraf closed this as completed Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing a received PUBCOMP after client reconnect crashes the session #762

Processing a received PUBCOMP after client reconnect crashes the session #762

dergraf commented Jul 10, 2018

dergraf commented Jul 10, 2018

Processing a received PUBCOMP after client reconnect crashes the session #762

Processing a received PUBCOMP after client reconnect crashes the session #762

Comments

dergraf commented Jul 10, 2018

Expected behavior

Actual behaviour

dergraf commented Jul 10, 2018