Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing a received PUBCOMP after client reconnect crashes the session #762

Closed
dergraf opened this issue Jul 10, 2018 · 1 comment
Closed

Comments

@dergraf
Copy link
Contributor

dergraf commented Jul 10, 2018

Expected behavior

A client dies, but it has unacked PUBREL frames. Once the client reconnects VerneMQ resends the PUBREL and expects a PUBCOMP. When VerneMQ processes the PUBCOMP it removes the PUBREL from the "waiting Acks".

Actual behaviour

A client dies, but it has unacked PUBREL frames. Once the client reconnects VerneMQ resends the PUBREL and expects a PUBCOMP. When VerneMQ processes the PUBCOMP the session crashes.

This case should be covered in https://github.com/erlio/vernemq/blob/master/apps/vmq_server/test/vmq_publish_SUITE.erl#L177, however the testcase doesn't check that the session is actually still alive after sending the PUBCOMP.

The Crash log:

2018-07-10 09:48:32 =CRASH REPORT====
  crasher:
    initial call: vmq_ranch:init/5
    pid: <0.3415.0>
    registered_name: []
    exception error: {{case_clause,<<98,2,0,1>>},[{vmq_mqtt_fsm,connected,2,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,350}]},{vmq_mqtt_fsm,in,3,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,168}]},{vmq_mqtt_fsm,data_in,3,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_mqtt_fsm.erl"},{line,142}]},{vmq_ranch,handle_message,2,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_ranch.erl"},{line,174}]},{vmq_ranch,loop_,1,[{file,"/home/graf/Projects/vernemq/_build/test/lib/vmq_server/src/vmq_ranch.erl"},{line,117}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.3385.0>,<0.3384.0>,ranch_sup,<0.3125.0>]
    message_queue_len: 0
    messages: []
    links: [<0.3385.0>,#Port<0.332260>]
    dictionary: [{mqtt_pubrel_sent,#Ref<0.1645208890.790233092.75259>},{mqtt_publish_sent,#Ref<0.1645208890.790233092.75256>},{mqtt_connect_received,#Ref<0.1645208890.790233092.75240>},{max_msg_size,0},{bytes_sent,#Ref<0.1645208890.790233092.75239>},{bytes_received,#Ref<0.1645208890.790233092.75238>},{rand_seed,{#{jump => #Fun<rand.16.15449617>,max => 288230376151711743,next => #Fun<rand.15.15449617>,type => exsplus},[215213621333760556|442190220450265]}},{mqtt_connack_accepted_sent,#Ref<0.1645208890.790233092.75250>},{socket_open,#Ref<0.1645208890.790233092.75235>},{mqtt_pubcomp_received,#Ref<0.1645208890.790233092.75245>}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1627
@dergraf dergraf changed the title A QoS2 publish (from broker to cluster) Processing a received PUBCOMP after client reconnect crashes the session Jul 10, 2018
@dergraf
Copy link
Contributor Author

dergraf commented Jul 10, 2018

Fixed in PR #763

@dergraf dergraf closed this as completed Jul 10, 2018
dergraf added a commit to dergraf/vernemq that referenced this issue Jul 23, 2018
- Store the *parsed* PUBREL inside the waiting acks instead of the
serialized binary.  This enables to reuse the same functionality for
handling retried PUBRELs, without caring if it is retried because of a
timeout or because of a reconnect. Cleanup places which relied on the
serialized binary.
- Fixing above enabled the possibility to implement a small performance
improvement where we could get rid of using `length()` on the outgoing
batch of frames for incrementing the `mqtt_publishes_sent` counter.
- Improved `vmq_publish_SUITE.erl` to test that a client responds to
PINGREQs before closing the Socket. This validates that 1st. the client
hasn't crashed in the meantime (e.g. because not being able to handle
the PUBCOMP, see bug vernemq#762), and 2nd that the last received frame on this
socket is the PINGRESP.
- The improvement above discovered an edgecase with a PUBREL frame that
has been retried too fast, indicating a bug in the retry mechanism. This
has been fixed with tagging the message IDs inside the retry queue. As a
result the retry mechanism can differentiate between retrying a PUBLISH
and a PUBREL. Without this fix a retried, already acked PUBLISH, would
have retried an unacked PUBREL right away.
- Removing several function clauses that were forgotten when fixing vernemq#750
dergraf added a commit that referenced this issue Jul 23, 2018
…ry mechanism (#763)

* Fix QoS2 retry for reconnecting clients, and retry mechanism improvement

- Store the *parsed* PUBREL inside the waiting acks instead of the
serialized binary.  This enables to reuse the same functionality for
handling retried PUBRELs, without caring if it is retried because of a
timeout or because of a reconnect. Cleanup places which relied on the
serialized binary.
- Fixing above enabled the possibility to implement a small performance
improvement where we could get rid of using `length()` on the outgoing
batch of frames for incrementing the `mqtt_publishes_sent` counter.
- Improved `vmq_publish_SUITE.erl` to test that a client responds to
PINGREQs before closing the Socket. This validates that 1st. the client
hasn't crashed in the meantime (e.g. because not being able to handle
the PUBCOMP, see bug #762), and 2nd that the last received frame on this
socket is the PINGRESP.
- The improvement above discovered an edgecase with a PUBREL frame that
has been retried too fast, indicating a bug in the retry mechanism. This
has been fixed with tagging the message IDs inside the retry queue. As a
result the retry mechanism can differentiate between retrying a PUBLISH
and a PUBREL. Without this fix a retried, already acked PUBLISH, would
have retried an unacked PUBREL right away.
- Removing several function clauses that were forgotten when fixing #750

* rename deliver_bin to deliver_pubrel
dergraf added a commit to dergraf/vernemq that referenced this issue Aug 21, 2018
…ry mechanism (vernemq#763)

* Fix QoS2 retry for reconnecting clients, and retry mechanism improvement

- Store the *parsed* PUBREL inside the waiting acks instead of the
serialized binary.  This enables to reuse the same functionality for
handling retried PUBRELs, without caring if it is retried because of a
timeout or because of a reconnect. Cleanup places which relied on the
serialized binary.
- Fixing above enabled the possibility to implement a small performance
improvement where we could get rid of using `length()` on the outgoing
batch of frames for incrementing the `mqtt_publishes_sent` counter.
- Improved `vmq_publish_SUITE.erl` to test that a client responds to
PINGREQs before closing the Socket. This validates that 1st. the client
hasn't crashed in the meantime (e.g. because not being able to handle
the PUBCOMP, see bug vernemq#762), and 2nd that the last received frame on this
socket is the PINGRESP.
- The improvement above discovered an edgecase with a PUBREL frame that
has been retried too fast, indicating a bug in the retry mechanism. This
has been fixed with tagging the message IDs inside the retry queue. As a
result the retry mechanism can differentiate between retrying a PUBLISH
and a PUBREL. Without this fix a retried, already acked PUBLISH, would
have retried an unacked PUBREL right away.
- Removing several function clauses that were forgotten when fixing vernemq#750

* rename deliver_bin to deliver_pubrel
dergraf added a commit that referenced this issue Aug 21, 2018
…ry mechanism (#763)

* Fix QoS2 retry for reconnecting clients, and retry mechanism improvement

- Store the *parsed* PUBREL inside the waiting acks instead of the
serialized binary.  This enables to reuse the same functionality for
handling retried PUBRELs, without caring if it is retried because of a
timeout or because of a reconnect. Cleanup places which relied on the
serialized binary.
- Fixing above enabled the possibility to implement a small performance
improvement where we could get rid of using `length()` on the outgoing
batch of frames for incrementing the `mqtt_publishes_sent` counter.
- Improved `vmq_publish_SUITE.erl` to test that a client responds to
PINGREQs before closing the Socket. This validates that 1st. the client
hasn't crashed in the meantime (e.g. because not being able to handle
the PUBCOMP, see bug #762), and 2nd that the last received frame on this
socket is the PINGRESP.
- The improvement above discovered an edgecase with a PUBREL frame that
has been retried too fast, indicating a bug in the retry mechanism. This
has been fixed with tagging the message IDs inside the retry queue. As a
result the retry mechanism can differentiate between retrying a PUBLISH
and a PUBREL. Without this fix a retried, already acked PUBLISH, would
have retried an unacked PUBREL right away.
- Removing several function clauses that were forgotten when fixing #750

* rename deliver_bin to deliver_pubrel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant