-
-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot get retained message from other node in cluster mode #507
Comments
@edwinAtWiz thanks for reporting! So the situation is this: We have a 2-node VerneMQ cluster (A, B).
Yes, this seems weird. What could possibly interfere here is queue migration happening, ie node A moving 1000 live queues to node B at the same moment. |
@ioolkos Kind of. In Node C, i just write a simple for-loop to publish 1000 messages (test/1, test/2 ... test/1000) to Node A. In Node A, the webhook plugin also added the MQTT.js module to publish the message locally again as retained (save/1, save/2 ... save/1000). In Node A, all retained messages could be received by subscribing the save/# topic, but i cannot get all messages from Node B. Also I have tested with directly publish the retained message from Node C to Node A with topic save/1, but the result is same as the test mentioned above. The message lost situation could be mitigated if i set the request limit. (e.g. lower the req./s) |
Ok, thanks, so this excludes that the re-publishing of the topic in the webhook plugin is involved in this problem. |
I'm going to see if I can find time to reproduce/test this today. Anything of interest in the logs btw? |
@ioolkos Thanks for your help! I confirm that there is no error and crash before / after the data insertion, and nothing is append to the console file after the insertion action, The following file contains all info when the cluster is started and connected. |
One additional questions: what is the client ID structure? are the client IDs always the same or random? (are they the same for the subscribers on node A and node B?) |
Maybe related... we use a write cache for retained messages.. If you publish a retained message it would take at least 1 second until this retained message is committed to our clustered metadata storage. An unfortunate side effect of this load-prevention mechanism is that within "this" second a new subscriber on a different node will miss the retained message. |
I have not yet found the reason for this, but I guess I'm not testing exactly the same scenario you are. |
@ioolkos You mean your testing setup could not reproduce the issue? May I know the testing setup you have for this? |
I tested with singled out subscriptions (to 'save/clientid'), not to 'save/#' as you have. But I'll test that next. |
I did a couple of very basic tests, publishing from 1000 clients, and subscribing from 1 consumer to 'testtopic/#' with mosquitto_sub (and then counting the nr. of lines of it's output). I get the same number of retained messages on both nodes A and B. |
Before data insertion: Node B: Node A will be more or less the same but gauge.router_subscriptions is 0 After the insertion: Node A: |
@ioolkos the above trial is pulishing 1000 messages to Node A, only 604 messages received in Node B. for(var i = 0; i < 1000; i++) { |
Hm, thanks! I wonder on the other hand why node B thinks it has published 1604 messages though. |
@ioolkos I have the same thinking that 1604 = 1000 + 604 .. but I have no idea what those metrics represent for. I tried it again, after the message published, in Node B the counter.queue_message_in and counter.queue_message_out are 1000, but when once I subscribed to save/#, the value changed from 1000 to 621, and next is 2242 and so on. |
Just to make sure. What happens when you use a completely fresh topic? (instead of 'save/#') |
@ioolkos what do you mean of the fresh topic? you mean other keywords? |
@edwinAtWiz yes, I meant topics you've not used before, so that they get the retained msg for the first time. |
@ioolkos I change the topic from save/1, save/2 ... to msg/stor/1 , msg/stor/2 ... etc. The result is just the same. |
I can reproduce this with two cleanly built and clustered VerneMQ instances. Only thing I did was to
I then checked the retain count on node A:
And some seconds later on node B:
I'll look into it. |
@edwinAtWiz I've created a fix for this issue, see #509. It would be much appreciated if you could test it out and see if this fixes the issue you're seeing. |
The fix was merged to master. Could you test if it works for you? |
@larshesel I am still testing with that, currently it works as expect! Many thanks for your help. |
Good to hear! Thanks for finding this, @edwinAtWiz and @larshesel! |
Environment
Expected behavior
All retained message should be stored in all nodes within the cluster
Actual behaviour
I have setup the cluster with 2 nodes in AWS ec2 instance, each node implement the webhook plugin in NodeJS to retain the message and publish to other topic like:
original topic: test/1, test/2 ... test/1000
retained to: save/1, save/2 ... save/1000
On the other hand, I have setup 2 nodes C and D to do publish and subscribe by using MQTT.js, the flow will be like:
noted: when i publish the message using mosquitto_pub to Node A with slow insertion rate (~ 2/3 req. per second), the total retained message lost in Node B will be much less (test 2 times, publish 2000 messages, < 10 messages lost, or no lost)
The text was updated successfully, but these errors were encountered: