Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog heartbeat timeout #8

Closed
dartanidi opened this issue Dec 29, 2021 · 29 comments
Closed

Watchdog heartbeat timeout #8

dartanidi opened this issue Dec 29, 2021 · 29 comments
Labels
bug Something isn't working

Comments

@dartanidi
Copy link

dartanidi commented Dec 29, 2021

@syssi first of all thank you for making this Esphome firmware possible for the zb-gw03.

I moved from a Sonoff Zbbridge where I had sometime some connection stability problems looking for a more reliable solution.
Unfortunately this seems totaly unstable, with bellows reporting a lot of watchdog heartbeat timeouts in the log.

It eventually stops working with HA reporting errors like this:

Error doing job: Exception in callback ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/site-packages/bellows/thread.py:97
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.9/site-packages/bellows/thread.py", line 98, in check_result_wrapper
    result = call()
  File "/usr/local/lib/python3.9/site-packages/bellows/ezsp/__init__.py", line 258, in frame_received
    self._protocol(data)
  File "/usr/local/lib/python3.9/site-packages/bellows/ezsp/protocol.py", line 138, in __call__
    frame_name = self.COMMANDS_BY_ID[frame_id][0]
KeyError: 520

which I read are tipical in the ZB to WiFi bridges.
So it looks like a connection problem, but HA PC and the bridge are connected via the same switch, and the bridge is never reported to be unavailable.

Toggling the Zigbee Reset switch solves the issue, but this problems occurs say every hour or two.
Any clue?
Thanks for your help.

@dartanidi
Copy link
Author

I forgot to say I'm using your example.yaml and Esphome v2021.12.1

@syssi
Copy link
Owner

syssi commented Dec 29, 2021

I haven't seen this issue before. My setup is running 10 days++ without issues. I will try to provide some ideas to isolate the cause:

  1. Please replace the power supply of the gateway to rule out power issues.
  2. The connection between the bridge and HA core is wired and based on a TCP connection. Packet loss on stream/socket level cannot be an issue. If it's a connection issue (reset of the first connection + reconnect + dead-lock in some state) it should be visible on network level. Let's use wireshark/tcpdump to capture the traffic between the bridge and HA core (port 6638 should be sufficient). You can stop the capture as soon the issue occurs. Could you provide this capture?
  3. Keep in mind: It's a serial connection between the zigbee module and the ESP32. If the connection between the ESP32 and the EFR32MG21 is bad the bridge will deliver garbage too.

@dartanidi
Copy link
Author

Thanks for replying syssi.

  1. I tried without luck to use a differen power supply, so it's not the case.
  2. I'm not able to sniff any packet (except MDNS and broadcast) from/to the bridge using wireshark; I have limited experience in it. Anyway looking at the ESPhome log I found that the streamingserver is reportin some client disconnections and further reconnections with the HA IP. I set the logger to very verbose to check if there is something more.

One question: have you tried your setup with the latest ESPhome version (like in my case)?
Thanks again for your help.

@syssi
Copy link
Owner

syssi commented Dec 29, 2021

I'm not able to sniff any packet (except MDNS and broadcast) from/to the bridge using wireshark; I have limited experience in it. Anyway looking at the ESPhome log I found that the streamingserver is reportin some client disconnections and further reconnections with the HA IP. I set the logger to very verbose to check if there is something more.

If your network is switched you can only see broadcasts if you use a third host to sniff traffic. You have to make sure to capture the traffic at the network card of your HA instance (using tcpdump -i eth0 -s0 -w traffic.pcap). But it looks like this isn't necessary anymore because you already know the root-cause: Connection issues.

I've updated my setup vom 2021.11.4 to esphome 2021.12.2 now. I will report tomorrow it's stable or not.

@dartanidi
Copy link
Author

Update: I flashed the bridge with the custom tasmota firmware from thehelpfulidiot, and seems to be working flawlessly.
If this is the case, looks like the stream-server component could have issues with the very recent versions of ESPhome.

I will keep you updated if the actual setup will continue working.
Thanks anyway for your help, much appreciated!

@syssi
Copy link
Owner

syssi commented Dec 30, 2021

You are right. I can see multiple reconnects using 2021.12.2:

[17:21:57][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:21:58][D][streamserver:106]: New client connected from 192.168.1.100
[17:25:37][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:25:37][D][streamserver:106]: New client connected from 192.168.1.100
[17:28:04][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:28:05][D][streamserver:106]: New client connected from 192.168.1.100
[17:35:39][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:35:40][D][streamserver:106]: New client connected from 192.168.1.100
[17:40:00][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:40:00][D][streamserver:106]: New client connected from 192.168.1.100
[17:40:12][D][streamserver:106]: New client connected from 192.168.1.100
[17:42:49][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:42:49][D][streamserver:106]: New client connected from 192.168.1.100
[17:46:59][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:46:59][D][streamserver:106]: New client connected from 192.168.1.100
[17:48:25][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:48:26][D][streamserver:106]: New client connected from 192.168.1.100
[17:50:12][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:50:12][D][streamserver:106]: New client connected from 192.168.1.100
[17:52:08][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:52:09][D][streamserver:106]: New client connected from 192.168.1.100
[17:53:35][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:53:35][D][streamserver:106]: New client connected from 192.168.1.100
[17:57:24][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:57:25][D][streamserver:106]: New client connected from 192.168.1.100
[18:06:11][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:06:11][D][streamserver:106]: New client connected from 192.168.1.100
[18:12:24][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:12:24][D][streamserver:106]: New client connected from 192.168.1.100
[18:22:33][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:22:33][D][streamserver:106]: New client connected from 192.168.1.100
[18:27:34][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:27:35][D][streamserver:106]: New client connected from 192.168.1.100
[18:29:21][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:29:21][D][streamserver:106]: New client connected from 192.168.1.100

I've flashed the device using esphome 2021.11.4 again to verify this version is better for real.

@syssi
Copy link
Owner

syssi commented Dec 30, 2021

The log message at the esphome node are the same on 2021.11.4:

[18:45:43][D][streamserver:106]: New client connected from 192.168.1.100
[18:53:17][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:53:17][D][streamserver:106]: New client connected from 192.168.1.100
[18:57:37][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:57:37][D][streamserver:106]: New client connected from 192.168.1.100
[19:23:18][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:23:19][D][streamserver:106]: New client connected from 192.168.1.100
[19:27:18][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:27:19][D][streamserver:106]: New client connected from 192.168.1.100
[19:32:50][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:32:51][D][streamserver:106]: New client connected from 192.168.1.100
[19:36:09][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:36:09][D][streamserver:106]: New client connected from 192.168.1.100
[19:52:46][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:52:47][D][streamserver:106]: New client connected from 192.168.1.100
[19:55:55][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:55:55][D][streamserver:106]: New client connected from 192.168.1.100
[20:04:42][D][streamserver:055]: Client 192.168.1.100 disconnected
[20:04:42][D][streamserver:106]: New client connected from 192.168.1.100
[20:14:50][D][streamserver:055]: Client 192.168.1.100 disconnected
[20:14:50][D][streamserver:106]: New client connected from 192.168.1.100

@dartanidi
Copy link
Author

Could it be possible that the ESPhome directories need to be cleaned before downgrading the files? Otherwise it doesn't have much sense since previously it was working

@syssi
Copy link
Owner

syssi commented Dec 30, 2021

May be my HA core instance is closing the connection intentionally. I will try to monitor the HA core log as next step.

@syssi
Copy link
Owner

syssi commented Dec 31, 2021

At the home-assistant.log I can see the warnings now:

2021-12-31 10:26:01 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:16 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:31 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:46 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:27:01 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:

I will capture some traffic to get a better understanding who drops the connection.

@syssi
Copy link
Owner

syssi commented Dec 31, 2021

@dartanidi Do you use ZHA or Z2M now?

@dartanidi
Copy link
Author

dartanidi commented Dec 31, 2021

I'm using ZHA, btw with the tasmota custom firmware it's working perfectly, no errors are reported in the log.

I still suspect there is an issue within the stream-server component, I'm reading the issues in oxan's repository and he said the component should have to be rewritten using the new socket abstraction instead of async_tcp

@syssi
Copy link
Owner

syssi commented Dec 31, 2021

I assume you are on the right track. Let's see! :-)

@syssi syssi added the bug Something isn't working label Dec 31, 2021
@syssi
Copy link
Owner

syssi commented Dec 31, 2021

2021-12-31 11:57:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:57:37 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:57:52 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:07 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:24 WARNING (bellows.thread_0) [bellows.uart] Reset future is None
2021-12-31 11:58:24 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /srv/homeassistant/lib/python3.8/site-packages/bellows/thread.py:97
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/thread.py", line 98, in check_result_wrapper
    result = call()
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/ezsp/__init__.py", line 258, in frame_received
    self._protocol(data)
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/ezsp/protocol.py", line 138, in __call__
    frame_name = self.COMMANDS_BY_ID[frame_id][0]
KeyError: 520

@dartanidi
Copy link
Author

Exactly what was happening to me, see first post

@syssi
Copy link
Owner

syssi commented Dec 31, 2021

In short: I'm able to reproduce the issue now. ;-)

I captured some traffic. It looks like the connection is terminated by the watchdog because of no(?) serial/ezsp traffic.

https://github.com/zigpy/bellows/blob/dev/bellows/zigbee/application.py#L44-L50
https://github.com/zigpy/bellows/blob/dev/bellows/zigbee/application.py#L696-L752

@syssi
Copy link
Owner

syssi commented Dec 31, 2021

Quick update: It looks like if you downgrade esphome to 2021.9.3 the old stream server implementation is used. This implementation didn't throw any errors in about 6 hours.

@alelucc
Copy link

alelucc commented Jan 22, 2022

Hi,
i'm having this same issue but i am unable to easily rollback to 2021.9.3. Is there any new solution to this issue?
I was running 2021.12.3 and i was able to recover the 2021.11.1 backup and same issue.

Thanks

@syssi
Copy link
Owner

syssi commented Jan 22, 2022

It's pretty easy to install an old esphome version if you have access to a linux host. Do you need some instructions?

@alelucc
Copy link

alelucc commented Jan 22, 2022

If you are so kind to provide to me it wold be great!! Thanks!!!!
Btw i can imagine that i have to install the 2021.9.3 version on my linux distro and "update" the coordinator with that version, right?

@syssi
Copy link
Owner

syssi commented Jan 22, 2022

Yes. Just pick a linux machine, make sure the python package manager pip3 is available. If not please install the package manager. On debian/ubuntu it looks like this: sudo apt install python3-pip. As soon the python package manager is available you are able to install any esphome version by:

pip3 install esphome==2021.9.3

It's possible the path to the esphome CLI command isn't part of your $PATH. Just call the command with the full path:

~/.local/bin/esphome version

This should return 2021.9.3. Now you are able to install/flash your gateway:

~/.local/bin/esphome run your-configuration.yaml

You could upgrade the esphome install later on to the most recent version:

pip3 install esphome -U

But keep in mind to downgrade the version again if you want to update/improve your gateway configuration.

@alelucc
Copy link

alelucc commented Jan 22, 2022

Thank you so much!! It worked!! Now i have to remember to not "Update All" on the esphome dashboard :D
Now it's not necessary to reflash the zigbee module, rigth?

@syssi
Copy link
Owner

syssi commented Jan 22, 2022

Did you flash your ZigBee module ever? It's independent from the ESPHome version. The ZigBee module is equipped with the last flashed firmware.

@alelucc
Copy link

alelucc commented Jan 23, 2022

Did you flash your ZigBee module ever? It's independent from the ESPHome version. The ZigBee module is equipped with the last flashed firmware.

Yes, i followed the entire guide and the coordinator was working but poorly with EspHome 2021.12.3.

@syssi
Copy link
Owner

syssi commented Jan 23, 2022

Could you create a new issue first? We should avoid off-topic questions here.

@syssi
Copy link
Owner

syssi commented Mar 12, 2022

It looks like the issue is gone. This user don't have trouble with a recent esphome version: https://community.home-assistant.io/t/zb-gw03-ewelink-ethernet-zigbee-gateway-now-hacked-with-tasmota-and-esphome-so-can-be-used-via-mqtt-or-as-a-remote-zigbee-lan-adapter-with-home-assistants-zha-integration-or-zigbee2mqtt/341223/59

@dartanidi
Copy link
Author

@syssi
oxan has updated the stream server component, now your ESPhome firmware works as expected, much better than the Tasmota one.
Now your code can be used as is without any issue.
Thanks!

@syssi
Copy link
Owner

syssi commented Feb 15, 2023

Cool! I will update the project as soon as possible. I would be happy about a pull request too!

@dartanidi
Copy link
Author

Cool! I will update the project as soon as possible. I would be happy about a pull request too!

There is no no need, your yaml files are working out of the box since the component works exactly as the previous version.
I'm testing everything using the latest ESPhome version, and no issues since more than one day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants