Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent lock binaryState/state in MQTT #51

Closed
alexdelprete opened this issue Nov 28, 2022 · 136 comments
Closed

Inconsistent lock binaryState/state in MQTT #51

alexdelprete opened this issue Nov 28, 2022 · 136 comments

Comments

@alexdelprete
Copy link
Collaborator

alexdelprete commented Nov 28, 2022

Today I noticed that lock sensor in HA was reporting unlocked when Nuki was locked, I checked MQTT and noticed this:

image

You will notice that lock/binaryState is unlocked but lock/state is locked.

I restarted Nuki Hub to force it to reset the states from the lock and everything was good again.

BTW: is it possible to add a restart button in the UI? right now I go into settings and hit save to force it to restart)

@alexdelprete alexdelprete changed the title Inconsistent lock state Inconsistent lock binaryState/state in MQTT Nov 28, 2022
@technyon
Copy link
Owner

Thanks, I'll look into it.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Nov 28, 2022

Thanks to you.

Another weird thing, that might be related: I found some strange topics in MQTT that were similar to nuki_hub topics, but with parts missing: _hub_hub with battery info, then I found another weird topic, I made a screenshot yesterday but can't find it...I'll'attach it as soon as I find it. It's like there was some problem in sending data to MQTT from Nuki Hub. It only happened 2 times...I'm on latest version of the firmware.

UPDATE: found the screenshot. :)

image

@mundschenk-at
Copy link
Collaborator

mundschenk-at commented Nov 28, 2022

Now that you mention it, I have several of those too. Looks like there is some memory corruption going on somewhere (which would also explain why Mosquitto complains about malformed messages every once in a while).
Screenshot MQTT topics

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Nov 28, 2022

I hope it's a sw bug that will be solved. The inconsistency in the status of the Lock I noticed could also depend on this.

@technyon
Copy link
Owner

There's nothing obvious, the state and binaryState are always updated together. If it's an MQTT problem, it'll get more complicated. I don't have any such malformed topics, but I'm not using mosquitto. As far as I can see the correct topic is passed to the MQTT stack (pubsubclient), what happens from there I don't know since it's not my code.

@mundschenk-at
Copy link
Collaborator

Is it possible that memory on the ESP32 is getting corruped, maybe by a parallel thread? (I am thinking of the WiFi stack in particular, as that would be something you would not experience on your own device?)

@technyon
Copy link
Owner

I've reviewed the code building MQTT paths. It's a rather simple method, still there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Nov 29, 2022

I don't have any such malformed topics, but I'm not using mosquitto.

I don't use mosquitto either, I use EMQX v5. So same issue with 2 different brokers.

there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.

That's what I hoped: some sw bug that could easily be solved. Thank you.

Strange that it did it never occur to you, and also strange why did it occur to me only after weeks of using Nuki Hub. Usually these stupid bugs trigger more often. :)

If this fixes the actual issue with the binary state I don't know, please test.

That inconsistency condition only happened 1 time since I installed Nuki Hub, so I will monitor things and report back in case I observe it again. What allows me to be optimistic is that I observed those malformed topics and the inconsistency in the same time-frame.

Will keep you poste, thank you for reacting so quickly.

@technyon
Copy link
Owner

After fixing those errors, I'm somewhat surprised that it did work before. I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Nov 30, 2022

I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.

I think you have to follow a strict protocol/spec in order for an MQTT broker to accept data, so if the integrated broker is a true MQTT broker, it means it received the topics malformed, but correct from an MQTT spec perspective, and so it should process them as EMQX and Mosquitto did.

The fact that it worked before also for me for almost a month will remain a magic mistery. :)

I have some other issues with the web interface (also with previous 6.0 fw), pages render with no formatting often, if I refresh a couple of times I see the formatting (black background, etc.), but sometimes it throws an http error (a connection reset error) then after 2-3s it appears again. It seems like http connections are not managed very well...that's the feeling.

Thanks a lot for your support.

@technyon
Copy link
Owner

Probably related to issue #48, let's follow up there.

@giejay
Copy link

giejay commented Dec 2, 2022

Im still seeing this issue:

front-door/lock/state unlocked
front-door/lock/binaryState locked

Using 6.1

@alexdelprete
Copy link
Collaborator Author

Im still seeing this issue:

didn't occur to me yet, after 6.1. But I used nuki_hub for quite some time without the issue, then one day it happened...when I started seeing malformed topics in mqtt.

Do you have malformed topics? open all nuki_hub subtopics to check for weird names...

@technyon
Copy link
Owner

technyon commented Dec 3, 2022

Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.

@giejay
Copy link

giejay commented Dec 3, 2022

For me it seems te be related to having multiple nuki locks: #53 I have moved one back to another project and now I don't see this anymore.

@alexdelprete
Copy link
Collaborator Author

Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.

Correct, we don't know, that's why I asked, so we can check for patterns.

@mundschenk-at
Copy link
Collaborator

I have updated my device to 6.1 and cleaned out all the messed up topics using mosquitto_pub. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?

@technyon
Copy link
Owner

technyon commented Dec 4, 2022

The question is where this happens. As far as I can tell, the paths should be built correctly now. After that comes the MQTT library. Just to verify everything is correct on my side, I've built a version that outputs all paths to the serial console.

nuki_hub-mqtt-dbg-1.zip

@alexdelprete
Copy link
Collaborator Author

I have updated my device to 6.1 and cleaned out all the messed up topics using mosquitto_pub. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?

Since 6.1 I haven't seen malformed topics, but I guess it's too soon to say it's fixed. I'll keep monitoring it and keep you posted.

@mundschenk-at
Copy link
Collaborator

Unfortunately, for me they are still there. To make sure they are not artifacts, I cleaned up again after posting #51 (comment) and this is a screenshot of today:
055F71C7-B800-415A-B7FD-BC2AD1C39B4B

@Mincka
Copy link
Contributor

Mincka commented Jan 5, 2023

Few other examples from 6.7:
image
I am cleaning everything again. Could it be related to automated or triggered restarts?

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 5, 2023

I am cleaning everything again. Could it be related to automated or triggered restarts?

So the bug is still there. I just checked mine (6.8), and very interestingly, the same Topic is malformed, can't be a coincidence:

image

I also noticed this rssice topic with a child topic...doesn't seem normal:

image

@Mincka
Copy link
Contributor

Mincka commented Jan 11, 2023

I attempted to find issues in the code but the only thing I found that could cause memory issues is a RSSI value that would not be not integer (not even sure that this possible). The itoa() would cause unexpected results.
https://github.com/technyon/nuki_hub/blob/master/Network.cpp#L376

Or maybe it's another bug in the Network::buildMqttPath function.

It would mean that checking for end of strings with "0x00" would not always work or that the fact that the generated path itself would cause issues because its array is too big and padded with "0x00". I am not familiar with this method of initializing string arrays with 0x00 so sorry if I say stupid things.
https://github.com/technyon/nuki_hub/blame/master/Network.cpp#L303

@mundschenk-at
Copy link
Collaborator

@Mincka In C (and derived languages like C++), strings are pointers to zero-terminated arrays of characters.

@technyon
Copy link
Owner

It's a mystery to me why this is happening. There's nothing too special about the rssi node, it's published via the publishInt() method, which is used by for a few other nodes too (batteryDrain, lockDistance, ledBrightness). Maybe the bug is triggered more easily with the rssi node because it's constantly updated ... same goes for wifiRssi and presenceDetection though.

@mundschenk-at
Copy link
Collaborator

I have the issue with other nodes as well (see previous screenshots). It is not limited to any single topic.

@Mincka
Copy link
Contributor

Mincka commented Jan 11, 2023

@technyon Do you have this issue yourself? Maybe we have something else in common that can be causing this. @giejay was thinking about something related to multiple locks. I have two locks with one ESP M5 Atom per lock. Do you have multiple locks @mundschenk-at and @alexdelprete ?

@mundschenk-at
Copy link
Collaborator

Only one lock. Also, the ESP can only be paired to one lock anyway.

@technyon
Copy link
Owner

I've thought about switching to to QoS 1, the library does support it ... it's just that QoS 0 is the default. It also supports QoS 2, there's some issue with it open though, I'd need to check.

@mundschenk-at
Copy link
Collaborator

At least QoS 1 sounds like a good idea for this usecase.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

I've thought about switching to to QoS 1, the library does support it ... it's just that QoS 0 is the default. It also supports QoS 2, there's some issue with it open though, I'd need to check.

QoS 2 is a little bit too much for resource constrained device, I think QoS 1 is a good compromise. Do we need to tackle duplication in our use case?

image

I'm studying a little bit MQTT because I never did before, and it's interesting. I wanted to understand better this Clean Sessions vs Persistent Sessions thing. And I think we definitely want to use Persistent Sessions, because it's true NH mainly publishes messages, but it's also a consumer of very important messages (commands) that it doesn't want to lose (e.g. after a sudden reboot), so with Persistent Sessions and QoS 1 we tell the broker to store those in a queue until they are consumed by NH:

image

@Mincka
Copy link
Contributor

Mincka commented Jan 23, 2023

@alexdelprete Any news about the official MQTT support by the way? Did they have such details in their specs?

Edit: Replying to myself.

From your post:
https://developer.nuki.io/uploads/short-url/tEQwbNLIuawOgr8ItMT5jLr2hYB.pdf

image

image

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

Any news about the official MQTT support by the way? Did they have such details in their specs?

I'm not following them anymore, had a harsh discussion with Jurgen after I asked him a simple question: will MQTT support be available for the Nuki Bridge? He didn't answer clearly. So I pushed him again and he didn't want to answer. So from what I understand, they only want to support MQTT on SL3 model, to push sales (old SL2 users will have to upgrade).

So I got a little bit upset and wrote in the forum that would be my last post supporting the initiative, and I also wrote that with a $10 device and Nuki Hub project we have the best bridge there is, with active developement and everything working fine (almost...). And I left them to their destiny...I don't like their approach, no transparency, they lie on functionalities (remember zigbee support? never happened in the end).

On the technical side: QoS=2 might be a little overkill for a low resource device, but we could experiment, maybe a configurable QoS option with 1/2 to test things and then decide? cc @technyon

@Mincka
Copy link
Contributor

Mincka commented Jan 23, 2023

Thanks for the feedback. Their hybrid approach with QoS=2 only for device control seems to make sense but is more complicated to implement. Is it possible that with QoS=1, we'd risk to have an unlock command sent twice by HA and lead to side effects if lock command is sent by a third-party in-between? I mean, maybe QoS=1 could do more harm than QoS=0 in some cases. Very hard to tell.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

I think the hybrid approach is the best one: in the end, that's why we have different QoS levels. Not all messages have the same intrinsic importance. As you said, Lock/Unlock should definitely have QoS=2 because of the risk of duplication. But the duplication can also be managed at the client level, ensuring you never send 2 times those commands by waiting for the ACK, that QoS=1 provides.

It's a bit like choosing UDP vs TCP.

@alexdelprete
Copy link
Collaborator Author

Is it possible that with QoS=1, we'd risk to have an unlock command sent twice by HA and lead to side effects if lock command is sent by a third-party in-between? I mean, maybe QoS=1 could do more harm than QoS=0 in some cases. Very hard to tell.

QoS=0 is the worst of all, duplication is an issue at all lower levels than QoS=2, it's not a specific issue of QoS=1. So for me the baseline should be QoS=1, and QoS=2 for very critical commands.

@technyon
Copy link
Owner

technyon commented Jan 23, 2023

I'm a bit hesitant to go to QOS 2 because:

arduino-libraries/ArduinoMqttClient#64

You can give QOS 1 a try though:

nuki_hub-7.1-qos-1.zip

I don't think receiving the lock action twice will cause any issues ... it's unlikely anyway everything is running over TCP.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

I'm a bit hesitant to go to QOS 2 because:

it should answer with PUBREL and wait for PUBCOMP. You are right, let's wait for them to anwer to that.

You can give QOS 1 a try though:

This has the CS flag set to false, right? QoS=1 goes with CS=False.

@mundschenk-at
Copy link
Collaborator

From what I see in EMQX (nice admin interface, and there's even a HA addon in the works, thanks for the recommendation, @alexdelprete), it does not.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

@mundschenk-at thanks. I didn't know about the addon, it would be great, do you have a link?

@technyon if it's not too much work, can you prepare a test fw with all options we discussed configurable?

  • RSSI frequency
  • CS True/False
  • QoS 0/1/2 (so I can also test with 2 and EMQX vs Mosquitto)

It would be good so we can play with the settings and find a good configuration.

@technyon
Copy link
Owner

Any chance you could compile the binaries yourself, would make a lot easier. Check the readme, I've provider a virtual machine to compile. RSSI interval is already configurable via the web interface. I've put QOS and CS into "Config.h", you just have to change the defines and recompile.

@mundschenk-at
Copy link
Collaborator

@alexdelprete It's still quite new, but in the Community repository: https://github.com/hassio-addons/addon-emqx

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 23, 2023

It's still quite new, but in the Community repository

Oh, I don't follow the addons because I use HA Core. Every app/service is installed outside of HA environment (docker and LXCs), and then integrated through integrations.

It's good that Frenck provided a better alternative MQTT broker.

@alexdelprete
Copy link
Collaborator Author

Any chance you could compile the binaries yourself, would make a lot easier. Check the readme, I've provider a virtual machine to compile. RSSI interval is already configurable via the web interface. I've put QOS and CS into "Config.h", you just have to change the defines and recompile.

Sure, I'll try. If I have problems I'll let you know. Thanks.

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 24, 2023

I've put QOS and CS into "Config.h", you just have to change the defines and recompile.

I have the VM up and running in Proxmox.

I logged as nuki, but can't find the Config.h file in the project folder. I looked for it in the whole branch, and it finds only a couple of other files.

UPDATE: nevermind, I forgot to git pull. I see it now. ;)

@alexdelprete
Copy link
Collaborator Author

@technyon Jan, where do I set this?

image

I need it to check what I have enabled/disabled while testing.

@Mincka
Copy link
Contributor

Mincka commented Jan 25, 2023

Change the flags here before compiling and the name of the version here with what you set, eg. 7.2-cs-false-qos-1.

@alexdelprete
Copy link
Collaborator Author

I had already changed Config.h, I missed Version.h. :)

Thanks.

@technyon
Copy link
Owner

technyon commented Jan 25, 2023

They could be merged into one file actually.

@alexdelprete
Copy link
Collaborator Author

Yeah...probably it would be a good choice. Let me know if you merge them.

@Mincka
Copy link
Contributor

Mincka commented Jan 25, 2023

So what are you testing first @alexdelprete ?

@alexdelprete
Copy link
Collaborator Author

alexdelprete commented Jan 25, 2023

I didn't allocate the time yet to do a proper test. For now I allocated time to import the VM in my Proxmox and customize the image because the settings weren't working for me. Now the VM is ok and I know how to compile etc., but I need to schedule the test activity. Probably by tomorrow. :)

I will let you know...the only worry I have is that actually I'm using 7.0-cs-false and I'm not having serious issues (reboots, etc.). So I don't know how meaningful will my test be. :)

@Mincka
Copy link
Contributor

Mincka commented Jan 25, 2023

I understand. If it ain't broke, don't fix it. ;) That's better with your stable setup because we'll know if potential issues are related to the QoS overload. In my case, I could certainly not deduce anything from playing with this yet.

@alexdelprete
Copy link
Collaborator Author

I'm a tinkerer...so I get bored in steady states...:)

I will play with settings and try to tell you if I notice some degradation. Also, I want to experiment with QoS=1.

@alexdelprete
Copy link
Collaborator Author

BTW: if you guys use discord, I setup a dedicated chat for Nuki Hub, you can join here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants