Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Check triggers node re-interview, which reports wrong security value afterwards #4184

Closed
3 of 11 tasks
darkbasic opened this issue Feb 3, 2022 · 13 comments · Fixed by #5571
Closed
3 of 11 tasks
Projects

Comments

@darkbasic
Copy link
Contributor

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within ZWaveJS2MQTT?

NO, my problem is NOT within ZWaveJS2MQTT

Checklist

Describe the bug

I've just updated my Z-Pi 7 firmware to 7.17.1. I didn't need to restore the NVM because another RaZberry firmware issue prevented me from migrating my network to a 500 series controller.

First thing I did a full network heal, which I'm not sure if it ever completed because I have 7 battery powered nodes which were turned off. The day after these nodes were still spinning in zwavejs2mqtt, but it might be a UI bug and maybe the backend might have timed them out and completed the heal (or maybe not?). Anyway I've attached both zwave-js and zwavejs2mqtt logs of the heal.

At this point I restarted zwavejs2mqtt and tried an health check on node 83:

image

I'm unsure why sometimes the latency is so high, unfortunately it doesn't show which route has been taken for each ping so this isn't that much useful. Would it be possible to implement that?

Anyway later I've tried to do the same for node 28 and, to my disbelief, it triggered a node re-interview. What's even worse, after the re-interview the node is shown an having S0 security (previously it had none, which should be the correct value).

image

What's going on?

I attached both zwave-js and zwavejs2mqtt logs.
Node 28 health check starts at 2022-02-03 15:08:09 on the zwavejs2mqtt log and 2022-02-03T14:09 on zwavejs log (I'm not sure why the former uses local time while the latter is GMT).

Device information

No response

How are you using node-zwave-js?

  • zwavejs2mqtt Docker image (latest)
  • zwavejs2mqtt Docker image (dev)
  • zwavejs2mqtt Docker manually built (please specify branches)
  • ioBroker.zwave2 adapter (please specify version)
  • HomeAssistant zwave_js integration (please specify version)
  • pkg
  • node-red-contrib-zwave-js (please specify version, double click node to find out)
  • Manually built from GitHub (please specify branch)
  • Other (please describe)

Which branches or versions?

version:
node-zwave-js: 8.11.2
zwavejs2mqtt: 6.5.0

Did you change anything?

no

If yes, what did you change?

No response

Did this work before?

Don't know, this is a new device

If yes, where did it work?

No response

Attach Driver Logfile

Health Check logs:
zwavejs_2022-02-03.log
zwavejs2mqtt_2022-02-03.log

Network Heal logs:
zwavejs_2022-02-02.log
zwavejs2mqtt_2022-02-02.log

@AlCalzone
Copy link
Member

AlCalzone commented Feb 3, 2022

@darkbasic I don't think this is related to the new firmware, but unfortunately your driver log is not on loglevel "debug", so it doesn't tell much. I've tried my best with what I got here:

The healing logs look unsuspicious.

Not sure why the re-interview was triggered during the health check. At least the node doesn't list S0 as supported and S0 is also not used during the communication:

2022-02-03T14:08:33.659Z CNTRLR « [Node 028] received response for protocol info:
                                  basic device class:    Routing Slave
                                  generic device class:  Multilevel Switch
                                  specific device class: Motor Control Class C
                                  node type:             Routing End Node
                                  is always listening:   true
                                  is frequent listening: false
                                  can route messages:    true
                                  supports security:     false
                                  supports beaming:      true
                                  maximum data rate:     100000 kbps
                                  protocol version:      3
2022-02-03T14:08:33.861Z CNTRLR   [Node 028] Interview stage completed: ProtocolInfo
2022-02-03T14:08:33.865Z CNTRLR » [Node 028] querying node info...
2022-02-03T14:08:34.084Z CNTRLR « [Node 028] node info received
                                  supported CCs:
                                  · Z-Wave Plus Info
                                  · Device Reset Locally
                                  · Powerlevel
                                  · Version
                                  · Manufacturer Specific
                                  · All Switch
                                  · Binary Switch
                                  · Multilevel Switch
                                  · Meter
                                  · Notification
                                  · Multi Channel
                                  · Association
                                  · Multi Channel Association
                                  · Association Group Information
                                  · Configuration
                                  controlled CCs:
                                  · Multilevel Switch

The automatic re-interview doesn't feel good here, I've raised #4185 for that.

Its also a bit unfortunate that the re-interview starts in the middle of the health check while node neighbors are queried, I've added this as a TODO to #3707.

unfortunately it doesn't show which route has been taken for each ping so this isn't that much useful

This is in the works for v9.

@robertsLando why does the UI refresh neighbors while the health check is going on? This could influence the results because it turns off the radio for a short time.

@robertsLando
Copy link
Member

@robertsLando why does the UI refresh neighbors while the health check is going on? This could influence the results because it turns off the radio for a short time.

The refresh is called when mesh is loaded or when there are changes in nodes array. It could be a change is detected in nodes object and that triggers a refresh. I need to investigate

@darkbasic
Copy link
Contributor Author

unfortunately your driver log is not on loglevel "debug", so it doesn't tell much

It was supposed to be on debug, but since I crosscompile from my laptop I might have restored an older profile which didn't have it set. When I noticed it was already too late. I'll try to reproduce everything with the appropriate log levels.

why does the UI refresh neighbors while the health check is going on

Good catch, I noticed it as well but I forgot to report it. It's not the first time it happened.

This is in the works for v9

Awesome, in the meantime is there anything we can do to find out why the jitter is so high?

@AlCalzone
Copy link
Member

anything we can do to find out why the jitter is so high?

You could try looking at the debug level logs of the health check. The information about routes etc. is already there, it just isn't used yet.

@robertsLando sometimes the refresh even causes an error if it happens more frequently than every 60s. I'd probably disable any automatic refresh during the health check or delay it until the window is closed.

@darkbasic
Copy link
Contributor Author

Here are the logs with debug level:
zwavejs_2022-02-04.log
zwavejs2mqtt_2022-02-04.log

Initially I just did a network heal. Apart from a node being marked as dead just to be immediately back I didn't notice anything unusual.

Then at 2022-02-04 12:48:47.391 GMT+1 I started an health check on node 83:
83_health_check_1249

Here I see a lot of jitter once again.
From the logs I couldn't figure out which routes has been taken on each ping: debug is quite verbose if you don't know what's going on behind.
The meaning of the SNR margin also isn't clear to me: since the node isn't in direct range what is it supposed to represent?

Then at 2022-02-04 12:51:28.064 GMT+1 I called an health check on node 28.
Again I'm not sure what's going on but it didn't return any result.

@AlCalzone
Copy link
Member

The meaning of the SNR margin also isn't clear to me: since the node isn't in direct range what is it supposed to represent?

If you click on the (?) button, it gives you some explanation for each parameter and how the rating is done. SNR is

An estimation of the Signal-to-Noise Ratio Margin in dBm

basically the difference between measured RSSI of the node's ACK frames and the background RSSI that's measured between checks. According to the built-in IMA tool in PC contorller, it should be >= 17 dBm so your values are very good.

Apart from a node being marked as dead just to be immediately back I didn't notice anything unusual.

That must be this. IMO, this looks terribly like the 700 series issue isn't entirely solved for all commands, since the controller immediately says "I can't send this", although the next command works just fine. This only happened the one time though.

2022-02-04T11:42:56.405Z CNTRLR » [Node 085] Deleting SUC return route...
2022-02-04T11:42:56.433Z SERIAL » 0x01050055553dc7                                                     (7 bytes)
2022-02-04T11:42:56.435Z DRIVER » [Node 085] [REQ] [DeleteSUCReturnRoute]
                                    payload: 0x553d
2022-02-04T11:42:56.445Z SERIAL « [ACK]                                                                   (0x06)
2022-02-04T11:42:56.450Z SERIAL « 0x0104015501ae                                                       (6 bytes)
2022-02-04T11:42:56.452Z SERIAL » [ACK]                                                                   (0x06)
2022-02-04T11:42:56.455Z DRIVER « [RES] [DeleteSUCReturnRoute]
                                    was executed: true
2022-02-04T11:42:56.462Z SERIAL « 0x012100553d0200000000000000000000000000000000000000000000000000000 (35 bytes)
                                  000b4
2022-02-04T11:42:56.468Z SERIAL » [ACK]                                                                   (0x06)
2022-02-04T11:42:56.471Z DRIVER « [REQ] [DeleteSUCReturnRoute]
                                    callback id:     61
                                    transmit status: Fail

Regarding the jitter:
This is the result of a ping with a low latency:

2022-02-04T11:48:50.468Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            89
                                    transmit status:        OK, took 30 ms
                                    repeater node IDs:      2
                                    routing attempts:       1
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -48 dBm
                                    ACK RSSI on repeaters:  N/A
                                    ACK channel no.:        1
                                    TX channel no.:         1

and this is the next one with higher latency:

2022-02-04T11:48:50.691Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            90
                                    transmit status:        OK, took 170 ms
                                    repeater node IDs:      2
                                    routing attempts:       1
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -48 dBm
                                    ACK RSSI on repeaters:  N/A
                                    ACK channel no.:        1
                                    TX channel no.:         1

So from the controller's POV, there isn't any difference, except the 2nd one takes longer. Because ACKs aren't retried AFAIK, the repeater node 2 probably needed 2 tries to reach node 83 on the longer attempts. Unfortunately, this can only be seen with a Zniffer.

Then at 2022-02-04 12:51:28.064 GMT+1 I called an health check on node 28.
Again I'm not sure what's going on but it didn't return any result.

See #4075 (comment), 2nd bullet point. This is a bug.

@darkbasic
Copy link
Contributor Author

That must be this. IMO, this looks terribly like the 700 series issue isn't entirely solved for all commands

I guess so, I'm not the first one who experienced something similar if I recall correctly. At least it's somehow usable now, but I still wouldn't suggest anyone buying a 700 series stick yet.

basically the difference between measured RSSI of the node's ACK frames and the background RSSI that's measured between checks

But between which nodes? I guess it must be between the controller and node 2 (the one one which routed the message), but that doesn't give the full picture if the signal between node 2 and node 83 is weak. I would be much more interested in the lowest SRN in the whole path (which is probably between node 2 and 83).

Because ACKs aren't retried AFAIK, the repeater node 2 probably needed 2 tries to reach node 83 on the longer attempts. Unfortunately, this can only be seen with a Zniffer.

The zniffer is no problem, I've already decided to sacrifice one of my sticks (I still didn't do so because of lack of time).
But once we find out what we already know (that the SNR between 2 and 83 sucks and thus the link is unreliable), what can we do to fix it? Theoretically a network heal should take into accounts SNR between each node when deciding which route to prioritize, but apparently it seems to just picks a random one. IIRC you said that the routing table is generated by the firmware so there isn't anything we can do about it, except that it should be possible to override routes manually right? Do we get SNRs when we ask nodes for neighbors, which is essential in order to compute a reasonable path?

P.S.
Some time ago you sent me links to some very useful in-depth explanations about zwave routing, but I can't find them anymore. Can you please re-send them? Thanks

@AlCalzone
Copy link
Member

I would be much more interested in the lowest SRN in the whole path (which is probably between node 2 and 83).

Yeah, but that isn't available:

ACK RSSI on repeaters:  N/A

I'm not sure what needs to be done to get this information recorded. Maybe the repeater needs to be 700-series?
In any case, to my knowledge we can't instruct nodes to measure the background RSSI, only the controller.

Do we get SNRs when we ask nodes for neighbors, which is essential in order to compute a reasonable path?

No. You might get a somewhat reasonable estimation if the RSSI of the repeaters were known.

links to some very useful in-depth explanations about zwave routing,

These?
https://github.com/zwave-js/specs/blob/c146b14e61be5f0c0d1527eaa1936c63737e06a6/Z-Wave%20Long%20Range%20Specifications/Z-Wave%20and%20Z-Wave%20Long%20Range%20Network%20Layer%20Specification.pdf
and
https://github.com/zwave-js/specs/blob/c146b14e61be5f0c0d1527eaa1936c63737e06a6/INS13954-Instruction-Z-Wave-500-Series-Appl-Programmers-Guide-v6_8x_0x.pdf, chapter 3.4

@darkbasic
Copy link
Contributor Author

I've updated Z-Pi 7 to fw v7.17.2 and zwavejs2mqtt to 6.7.1.2a08abd (zwave-js 9.0.2).

I've done a full heal, but unfortunately it still looks like nodes are dying/reviving mid-heal:

2022-04-15 18:45:56.545 INFO ZWAVE: Controller status: Healing process IN PROGRESS. Healed 34 nodes
2022-04-15 18:46:00.796 INFO ZWAVE: Node 9: value updated: 50-0-value-66049 0 => 0
2022-04-15 18:46:03.880 INFO ZWAVE: Controller status: Healing process IN PROGRESS. Healed 35 nodes
2022-04-15 18:46:09.684 INFO ZWAVE: Node 54 is now dead
2022-04-15 18:46:11.127 INFO ZWAVE: Node 54: has returned from the dead
2022-04-15 18:46:11.223 INFO ZWAVE: Node 54 is now dead
2022-04-15 18:46:12.766 INFO ZWAVE: Node 54: has returned from the dead
2022-04-15 18:46:12.865 INFO ZWAVE: Node 54 is now dead
2022-04-15 18:46:13.425 INFO ZWAVE: Node 54: has returned from the dead
2022-04-15 18:46:13.669 INFO ZWAVE: Controller status: Healing process IN PROGRESS. Healed 36 nodes

Doing a health check on node 83 still leads to bad routes w/ randomly lost packets:
node83_healthcheck

Doing an health check on node 28 still fails, but at least now it doesn't trigger re-interviews apparently:
node28_healthcheck

Debug logs:
zwavejs2mqtt_2022-04-15.log
zwavejs_2022-04-15.log

@AlCalzone
Copy link
Member

2022-04-15 18:46:11.223 INFO ZWAVE: Node 54 is now dead
2022-04-15 18:46:12.766 INFO ZWAVE: Node 54: has returned from the dead

This is because the return routes can't be updated for some reason:

2022-04-15T16:46:09.591Z CNTRLR » [Node 054] Deleting SUC return route...
2022-04-15T16:46:09.624Z DRIVER » [Node 054] [REQ] [DeleteSUCReturnRoute]
                                    payload: 0x36e8
2022-04-15T16:46:09.640Z DRIVER « [RES] [DeleteSUCReturnRoute]
                                    was executed: true
2022-04-15T16:46:09.654Z DRIVER « [REQ] [DeleteSUCReturnRoute]
                                    callback id:     232
                                    transmit status: Fail

and this is incorrectly interpreted as a dead node #4191.
You'll likely have to use Zniffer to figure out the reason for this failure to update routes though.

Doing a health check on node 83 still leads to bad routes w/ randomly lost packets

In the logs there is no difference, except that the attempts with more delay retried routing once - without changing the route though. Node 83 is routing through node 2 each time:

2022-04-15T17:01:14.307Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            125
                                    transmit status:        OK, took 30 ms
                                    repeater node IDs:      2
                                    routing attempts:       1
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -41 dBm
                                    ACK RSSI on repeaters:  N/A
                                    ACK channel no.:        1
                                    TX channel no.:         1

2022-04-15T17:01:14.525Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:            126
                                    transmit status:        OK, took 150 ms
                                    repeater node IDs:      2
                                    routing attempts:       2
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -42 dBm
                                    ACK RSSI on repeaters:  N/A
                                    ACK channel no.:        1
                                    TX channel no.:         1

Question is if the route between 2 and 83 is causing these or the one between controller and 2.

Doing an health check on node 28 still fails

It looks like it is stuck in a reverse ping check from a lower powerlevel and that is influencing the results for the higher powerlevels. I'll probably have to revise the binary search for the limit powerlevel and start from the highest, working doesn.

@darkbasic
Copy link
Contributor Author

Question is if the route between 2 and 83 is causing these or the one between controller and 2.

Definitely between node 2 and 83 because the controller is a few centimeters away from node 2. Honestly I don't understand why it would pick such an useless hop when there are lots of other nodes in between.

Proof:
node2_healthcheck

If I recall correctly it should be the firmware of the controller the one which computes these routes, right?
Can't we brute force our way to a better route? Like automatically health-checking all the nodes in direct range from the controller (capable of reaching the target node) and then health-checking again each of these nodes to the target one? With such approach we can't go further than 1 hop without having to test insanely high numbers of combinations, but considering 99% of the targets can be reached in 1 hop that can be a viable option IMO.

@AlCalzone
Copy link
Member

Honestly I don't understand why it would pick such an useless hop when there are lots of other nodes in between.

Me neither, but maybe there was a situation where the controller didn't "see" 83 and node 2 did, so it took that route and stuck with it.

If I recall correctly it should be the firmware of the controller the one which computes these routes, right?

Correct

Can't we brute force our way to a better route?

Sounds feasible IMO. We could even pin that route down as the application priority route if the user wants. I'll track that in a separate issue.

@darkbasic
Copy link
Contributor Author

Health check on node 28 seems to be fixed in git master thanks to #4494:

immagine

Unfortunately healing still picks terrible routes and we absolutely need a way to pick better ones otherwise the whole network is basically unusable :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Triage
Closed
Development

Successfully merging a pull request may close this issue.

3 participants