-
-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Puzzle: Sometimes Retained messages seem to be skipped #1633
Comments
We need to check for message_queue length in the mailbox of the Retain SRV process & and if this is in fact a growing mailbox, explore options to introduce concurrency. |
Ok, the above (mailbox overload) is confirmed while disk latency seems okay. |
Suppose we |
I have seen this retain issue on a topic that is rarely updated (2-3 times per day), though the broker cluster is busy processing 10-15kpps. Difficult to detect on topics that are updated frequently so it may be happening on others. |
We have been noticing these "skips" during message bursts. Last example was that there were three messages (qos 1) published by the same client to the same topic (in very short time window, like few milliseconds). Messages did go through in the order they were published but when the topic was subscribed afterwards, we got the second/middle one (not the third/last as expected). I am guessing that the root cause could be that multiple threads can handle messages published by a single client (to a single topic) and retaining those messages may finish/happen in different (unexpected) order. We also have a lot of retained topics/messages (like ~100k). Some are changing all the time (many times per second) and others only once per day or so. |
@coder4803 |
Regardless of the fix, something that would be useful is some additional prometheus metrics around what's happening with the retain server. We are already graphing #msgs and bytes, however this doesn't give any insights into an overload situation. It would be great to see CRUD metrics as well as the size of the backlog for the retain server to aid troubleshooting/alerting. cheers |
@elduds @coder4803 there's metrics (timings) for the metadata layer, in case you use SWC. Which only mirrors the storage level performance of those messages, and it's for all the metadata in general. I think the long-term solution to this is integration into OpenTelemetry (already working on it). This will improve the observability of the Retain SRV. It will obviously not resolve the overload issues, still need to find a solution/refactoring for those. |
I tested the Retain server a little bit over the weekend. I only did this with a couple hundred topics but high update frequency. |
Hi everybody, I discovered that the reason for this is a race condition in vmq_retain_srv. What basically happens is this:
Hope this helps. |
@ico77 thanks for your analysis. I've seen a couple race conditions between memory and stores, so I'm gonna certainly check on your hypothesis. I was under the impression that the 1 second persister process was explicitly done to adress issues like that but we'll see. EDIT: another question is whether this is a insert/delete race, or also a insert/insert race. |
No, sadly I haven't proved it by modifying code. I came to the conclusion described above by activating tracing on vmq_retain_srv:insert, vmq_retain_srv:delete and vmq_retain_srv:handle_info function calls. The traces revealed that some retained messages were processed twice and some were never processed. |
@ico77 okay, well that seems to validate your analysis. |
@ico77 one additional point: did you see the race condition for single node, or cluster only, or both cases? |
I was using a single node while tracing, didn't try it with a cluster. |
I managed to find the file where I saved an excerpt from the trace output. This trace was made during high load (probably around 10K msg/s). For each randomly generated retain message a random topic was generated, and then, an empty payload message was generated 1 second later and sent to the same topic as the original message. There is a chance that the same topic was generated for multiple retained messages, in which case the trace output can not, with 100% certainty, lead to the conclusion that I described. But the chance of that happening was ~1:60 000 000 during the test, so I guess the trace can be used to point us in the right direction. I saw a lot of output like this in the trace: 1 insert, 1 delete and then the 2 callbacks. The 2 callbacks are interesting, they show that handle_info(updated) was called twice and in the second call the retained message was updated with the same value which should be impossible with the test data I generated. The second callback should have been a handle_info(deleted).
|
This would explain exactly the behaviour I have seen. It is not as if there is some giant backlog where retained messages eventually update to a correct value from some time in the past albeit minutes or hours late, I am seeing that some topics are never updated with the most recent value. This is especially noticeable on one topic whose publish pattern is:
Very often, the retained message on that topic has a payload of v:1 where that should only ever be the case for 2 seconds every 8 hours or so, however until the next cycle it never updates, and it's a crapshoot as to whether the next cycle will reset it back to 0 or not. Would it be useful for me to get some trace output similar to @ico77 ? How might I do this? thanks |
@elduds you can enable tracing like this:
The trace will be visible in the erlang.log files. When you are finished, type: |
@ico77 @elduds Something like I doubt you'll directly see the issue happening. You'd have to trace at the point where the second retain gets sets. |
OK, these brokers are doing c. 7-10kpps, so it is unlikely I will catch this. May do some experimentation in development if it can be triggered without high load |
When this was originally implemented, we seemed to have been aware of racing, and accepted double message inserts (of same message) as a tradeoff: fe493dd The thing is, we need to be subscribed to events from the RETAIN_DB, as those events could come from other nodes (and then they need to be added to the cache too). |
@elduds do you ever do explicit retain 'deletes', that is sending a retain message with payload <<>> (empty payload)? |
No we are never deleting retained messages in this way |
@elduds thanks for confirmation. Reviewing better approaches currently. And maybe we don't need to cache for local reads at all. |
This fold is suspicious:
Because even if ets:update_counter is atomic, the foldl iterates through global state, which might interleave.
In other words: this explanation would be wrong then:
|
The Ideally, I'd like to have this configurable (number of retained messages to keep indexed in RAM). Within the retain server, we still need to subscribe to the on_disk store ( |
Currently looking into a couple of options and also different Erlang cache implementations. The tricky thing is to come up with an alternative that is not replacing the existing approach with some other evil :) Most likely anythnig with 'read+update' operations on the cache path is not concurrency safe with ETS. |
Have you made any progress? There is a fixes -tag added in January but the ticket is still open. We are soon running into a situation where we can not tolerate those retained message skips and we have to find some solution to this. |
@coder4803 are you with the same use case as @elduds and/or @ico77 ? |
Hello @ioolkos , I think my first message describes our problem well (and yes, it is quite similar to the cases described by @elduds and @ico77 although we basically never delete a retained message). @coder4803 wrote: So if the subscription is done before publish, all messages go through as expected. But if the topic is subscribed afterwards, we may get an older message (not the latest). This seems to happen at least with message bursts to the same topic. |
We faced a similar issue and can reproduce it with use of a simple setup. Details are as follows. Environment
TestThere are two clients: client A publishes retained messages, client B subscribes on the target topic and receives appropriate messages. Client A: Publish message [2,2,2,2,2,2] Actual behaviorThe test program fails after some time. A number of iterations before the failure depends on following points.
Expected behaviorThe program does not fail after several days of execution even if the delay between "Client B: Receive message" and "Client B: Disconnect" events is 100ms or less. NotesSometimes client B cannot receive the first message, so it can be required to restart the program. The test program is attached. |
@nikolchina Thanks, what do you mean by "fails after some time". How exactly? (the original issue was fixed so this must be different) 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
I am afraid that the fix did not fully solve the issue. We have been using 1.12.5 and it still seems to occur from time to time. If multiple messages (usually two in our case) are published into a same topic in a short time window, the latest message is not always retained on the broker (but one of the earlier messages is). Messages are otherwise delivered correctly (online clients will receive them as expected). I did not test @nikolchina's test application but the described sequence is exactly what should trigger the problem eventually. If you keep the test application running for a while, problem should occur. If not, I believe adding some background load for the broker would help (like 10k/sec retained publishes to different topics). |
I was involved into this investigation a bit, so I can comment.
We actually see that an unexpected retained message is read by a client on a reconnect. This happens after several or multiple iterations of the test. The exact test is described in @nikolchina comment. The number of iterations before the failure depends on the broker configuration and the test program. If there is a cluster with two nodes and involved clients are connected to different nodes, the failure can be observed in 20 seconds or so. If a single node is in use instead, it will be required to wait minutes or hours before the new occurrence. The timing is important in this case, so it is better to start VerneMQ and the test program on the same machine. |
I had a quick look into this, as it might hit us as well: I was not able to reproduce the "local" case, not even with some load. I had the example code running on a VM for over two weeks without any issues. The distributed case is easily explained. The "publish" subsystem and the "retain" subsystem are independent and currently do not "share" any information. The retain system is eventually consistent. Timing issues are possible. So what happens is this: In a two node scenario with messages A and B the following happens: Client publishes to Node 1 publish retained message "A" to Node 1 This can be only observerd in scenarios where a retained message is published at some point in time, and the subscriber subscribes just at the moment right after. This is a general problem of lock-free distributed systems. In a n-node scenario one could create even crazier scenarios. What makes it look strange is that the publish system is faster then the retain system. So what could be done in this particular scenario:
I tried to play with option 2, it seems possible but would have some drawbacks:
|
@mths1 as you correctly state, Retained messages are part of the metadata layer in VerneMQ which is eventually consistent. We deemed this acceptable for the reason that in high frequency publishs, the "value" of a Retained message for a new subscriber will be very short: it will immediately receive normal messages, that is "updates" to the original Retained message. We could, of course, explore fully consistent alternatives. Ra and Khepri (https://github.com/rabbitmq/khepri) comes to mind. |
Hello @ioolkos, Do I correctly understand that 'eventually consistent' is a feature of Retained messages layer by design? Existing plugins vmq_plumtree and vmq_swc do not change this behavior. They just make a work more stable and bugless. Can the used backend DB solve or minimize this issue? |
@revikk yes, all the synchronisation of metadata state (including Retain messages) are eventually consistent. This does not mean "slow" synchronisation. But it means it can be beaten by observing wallclock, that is consider all your components under linear time. Consider the following observation fallacy, where a publisher is on node A, and a subscriber on node B:
Considerations about the decoupled nature of Pub/Sub led us to reason that eventual consistency is acceptable for retained messages. A consistent implementation would be possible with the usual protocols (Raft, etc.) 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
@ioolkos I disagree about your definition of acceptable: does it mean that VerneMQ is an acceptable MQTT broker implementation just in case of high frequency publishes? And therefore it is not suitable for not-so-frequent publishes use cases, such as @revikk's one and many others? BTW, in case of high frequency publishes, you could avoid to use retained messages at all: just wait for a while, and subscriber will receive live messages. So probably it could be acceptable not to implement support for retained messages at all, isn't it? The decoupled nature of Pub/Sub has nothing to do with the fact that a subscriber turns out to be not aligned with the status of a topic which holds a retained message. In the case 2, there is no reason to be unhappy about subscriber receiving old Retain value, as long as it will then receive also the message (as live one) published 1 ms before its subscription. But if it receives just the old Retain value while any another subscriber subscribing a few seconds later receive the latest (of course), this is not good. Just to summarize, (if I correctly understood) your conclusion is: a subscriber shall not rely on a message received as retained, as it may not be the latest published on the subscribed topic (that it will never receive). Another point not clear to me is: does this issue affect only multi-node scenarios, or also single-node? |
@aoiremax Thanks, disagreeing is perfectly fine, the question is a little bit what to do with it. Should you be working on concrete experiments/implementation, let me know. 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
@aoiremax It affects only multi node scenarios. I was unable to reproduce this in a single node environment (and a code review didn't show anything different). I have played with some mitigation ideas and the one that worked best without huge changes in the code base is what I would call "delayed registration". It waits n times the cycle time for new retained messages and only then continues with adding the subscriber. Messages that arrive in-between are still part of the online queue and will be delivered (as the delay happens after the queue has been already initialized). I did not fully test it and there are some downsides in the current implementation (like that the retained messages might not be the first message the subscriber gets) but if it is something people are interested in I could continue preparing a pull request. |
Here is the branch in case someone wants to to give it a try https://github.com/mths1/vernemq/tree/feature/delayed_retained (poorly tested). It will "ensure" (based on time) that the latest retain message is shown and no message loss occurs. You might receive the retained message twice (as retained and the actual one that arrived before the retained one did). |
@mths1 thanks, will take a look at it! 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
I think we are having issues with this as well. A case where this is specifically harming us is when we subscribe very closely to the very first retain message sent to a topic. In some cases we simply never get anything at all and have to work around it by resubscribing after some set amount of time without getting a message. |
@nikolchina If a new subscriber connects to a specific cluster node and issues a subscription to the topic, the retain store will do a local read and deliver the retain message with the retain flag set to 1. 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
@nikolchina : |
Hi @mths1 ,
|
Hi @nikolchina ,
|
Hi @mths1 ,
|
@nikolchina : This typically happens when the vernemq config is not correct. Running 'vernemq console' should give more information. Maybe you can also share your config file. |
@mths1, thanks, there were some parameters in old config which are out of date for now. If publisher sends some messages during subscriber connecting, one of messages will be delivered as retained and all of them will be delivered consequently after retained message as it is represented on a diagram. |
@nikolchina : Great, that it works for you! Btw, of course your code is somewhat artificial, as your publisher publishes very frequently and thus you see m4 (retrained), m3 (because m4 came within the sync period as a new retrained message), m4 (again from the publisher) and then m5. The typical use case I aimed for is that you have a sensor that sends data very infrequently, like once a day (and you do not want or cannot have persistent sessions). In mid- to high frequency use cases, you typically won't use retrained messages (or having a very slim change of getting an old retrained messages does not matter) and if every messag counts than persistant sessions are typically better suited. So, typically you would just get m4 or in some special situation m4, m4 (the retrain server currently does not have a message id for deduplication, so we cannot tell if the second m4 is actually the same message or another message with the same content). |
@mths1 , |
Hello @nikolchina : You can set the parameter on all nodes to syncwait. There is no strong requirement to do this only on subscriber nodes. |
A user has sometimes seen Retained messages "skipped". The cluster will deliver the message on the normal subscription path, but for that topic, the cluster will still show the second-last retained message.
Looking for hypothesis/code path for this "puzzle". As the Retained message is missing on all cluster nodes, this doesn't seem to be related to synchronization (but it's not excluded).
Look into: Retain Cache, Retain SRV, Overload, State Synchronisation (SWC)
The text was updated successfully, but these errors were encountered: