Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in all commands that expect a reply (sub/unsub/publish w QoS >0) #164

Open
davydnorris opened this issue May 4, 2019 · 7 comments

Comments

@davydnorris
Copy link

I'm currently tracking down a problem with publishing and I think I've hit a possible race condition in the send where a second publish message sent at the wrong time will stop the callbacks from the first happening. This is happening for QOS > 0 - when I look I can see the data from the first send actually makes it to the platform, but it appears the code for the rest of the QOS1 or 2 transaction never happens.

I am trying to do transaction counting in order to work out when I can sleep and if I leave the unit running it all seems fine, which is a bit strange because initially I send 3-4 publish messages all at once and it works fine. These gradually become separated over time and it's when they start to drift apart that they mess each other up. The big problem for me is that the publish callback is not getting called and so my transaction count is not being decremented properly.

I'll capture a log and attach.

@davydnorris davydnorris changed the title Possible race condition in publish? Possible race condition in publish? CONFIRMED May 5, 2019
@davydnorris
Copy link
Author

davydnorris commented May 5, 2019

OK I've now looked at the code and a race condition is occurring when QOS > 0 and the network latency is larger than the time between publish commands. If you publish several messages in quick succession then this doesn't have to be very large at all, and in fact my code is only working by good luck rather than on purpose - if I was publishing several messages with different QOSs then this would break very quickly, and right now if a message fails to publish it will likely report the wrong one has failed.

The problem occurs because the MQTT client only maintains one pending message - when you publish more than one message with QOS > 0 close together, you will end up with more than one message in a pending state waiting for an ACK and you lose the state for all but the last message published.

This is a pretty big architectural/design issue - to solve this you'd need to either:

  • make mqtt publishing a strictly serial process and only publish the next message once the first is fully complete. If you're on a slow network this would introduce performance overheads
  • save pending message state in a list/queue instead of a single variable, and then add logic to track the state of each message and pop it off the list once complete.

Either way it's quite a big change to the code.

@davydnorris davydnorris changed the title Possible race condition in publish? CONFIRMED Race condition in all commands that expect a reply (sub/unsub/publish w QoS >0) May 9, 2019
@davydnorris
Copy link
Author

Turns out the library has a problem with any command that requires a reply. If the timing between messages isn't right you lose all but the last message requiring a response.

The rewritten MQTT in the current RTOS SDK is really neat, and they've implemented an outbox for messages, which is a nice solution, but there's no corresponding version for NonOS.

So I'm currently in the process of porting the new library over

@maverickchongo
Copy link

@davydnorris Have you been able to get this fixed, I was wondering whether you had a repo with a fix as I am experiencing the same issue.

@davydnorris
Copy link
Author

@maverickchongo - no, at this point I have given up on this repo as it seems to no longer be actively maintained. I pulled down the ESP32 and ESP8266 RTOS MQTT code and got it to the point where it was almost working with NonOS but they have used the lwip socket API and blocking calls for the transport, which is actually a really badly performing approach.

So I have temporarily gone back to QoS 0 to get my product out the door, and then I am going to write my own completely transport agnostic MQTT based on ideas from the other libraries out there, and on messaging protocol best practice. It will use a non blocking transport layer that works with a generic inbox and outbox. These will both be processed independently using event queues and loops - that will mean both RTOS and NonOS can use the same library.

@someburner
Copy link
Contributor

@davydnorris Just ran into this issue myself. Will your new project be open source? If so I'd be interested in helping out and/or testing for NON-OS

@davydnorris
Copy link
Author

davydnorris commented Aug 5, 2019

@someburner it absolutely will be open, and I'd love any help offered. Just got to get my sensor box out on some poles for a client and then I'll be looking at this.

The sources I've pulled to date are this library, the new RTOS and IDF versions (both ESP8266 and ESP32), the Eclipse paho sources, and the WolfMQTT sources.

My architectural goals for the new library are:

  1. completely thread agnostic.
  2. inbox and outbox use bip buffers so messages or message pieces are all contiguous
  3. because of 2 above, no message copying, no double allocating, everything is in the buffers
  4. because of 3 above, it will need mqtt_message_create(size) and mqtt_message_dispose(id) functions, but I figure that's an acceptable compromise
  5. transport layer agnostic, even more so than the new IDF version
  6. main control loop agnostic, so can be used with callback based, event based, thread based, RTOS or NonOS
  7. non-blocking

@davydnorris
Copy link
Author

Of course, if this lib could easily be hacked to include an inbox/outbox so that QoS 1 and 2 could be implemented properly then that may do us for now, but there's a lot that is good about the new IDF implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants