Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application should re-establish connection to uart after error #37

Closed
rolf-tx opened this issue Aug 4, 2017 · 19 comments
Closed

Application should re-establish connection to uart after error #37

rolf-tx opened this issue Aug 4, 2017 · 19 comments

Comments

@rolf-tx
Copy link

rolf-tx commented Aug 4, 2017

Running HA in docker session on Synology NAS bellows will stop showing any sign of life in log after 1-2 hours.
Restarting HA will make zigbee devices and bellows work for another 1-2 hours.

The setup has two zigbee motion sensors and a zigbee switch (for repeating), and a zwave switch.

To eliminate docker/NAS issues tried running natively on a Pine64 arm sbc (HA 0.50.2).
Mostly bellows will stop working after 1-2 hours.
This morning I had one motion sensor and the switch still alive after 8 hours, but one motion sensor was dead.

Would you please advice how I can help debug this issue.
Is there any way of bellows to give more more info or some kind of heartbeat?

This in continuation of: https://community.home-assistant.io/t/0-49-zigbee-stopped-working-after-adding-smart-socket-packet-routing-issue/22453

@rolf-tx
Copy link
Author

rolf-tx commented Aug 4, 2017

I noticed I could control the zigbee switch, but not get status.
Restarted HA, and now switch reported status.

In conclusion I think Pine64 is giving slightly better reliability, but still has some issues.
Are there any timing critical operations for bellows talking to the usb stick?

@rolf-tx
Copy link
Author

rolf-tx commented Aug 4, 2017

Updated to branch no-removal.

I'm thinking this may work :-)

2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Data frame: b'1451b1ed502a15b459944a2dab55922f63f3e30d12316e8e00396788fd673fa7e3437e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Sending: b'82503a7e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Sending: b'4256219c54eeb3b25992ca25aa1593499c4920adeace678bfcc791847e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Data frame: b'2556a19c5451b99b7e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Sending: b'83401b7e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Data frame: b'3556b19754eeb3b25992ca25aa1593499c3520abed233a7e'
2017-08-04 12:26:33 DEBUG (MainThread) [bellows.uart] Sending: b'8430fc7e'
2017-08-04 12:26:33 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification
2017-08-04 12:26:35 DEBUG (MainThread) [bellows.uart] Data frame: b'4556b1ed542e14ab59954b65ab55922e63f0e30d12316b8ab7c763563e7f3fb7bbcdcc84ee7e'
2017-08-04 12:26:35 DEBUG (MainThread) [bellows.uart] Sending: b'8520dd7e'
2017-08-04 12:26:38 DEBUG (MainThread) [bellows.uart] Data frame: b'5556b1ed542e14ab59954b65ab55922163f0e30d12316b8ab6c763563e7f3fb7bbcdcc7a917e'
2017-08-04 12:26:38 DEBUG (MainThread) [bellows.uart] Sending: b'8610be7e'
2017-08-04 12:26:40 DEBUG (MainThread) [bellows.uart] Data frame: b'6556b1ed542e14ab59954b65ab55922063f3e30d12316b8ab1c763563e7f3fb7bbcdcc52577e'
2017-08-04 12:26:40 DEBUG (MainThread) [bellows.uart] Sending: b'87009f7e'
2017-08-04 12:26:43 DEBUG (MainThread) [bellows.uart] Data frame: b'7556b1ed542e14ab59954b65ab55922363f0e30d12316b8ab0c763563e7f3fb7bbcdccc72f7e'
2017-08-04 12:26:43 DEBUG (MainThread) [bellows.uart] Sending: b'8070787e'
2017-08-04 12:26:46 DEBUG (MainThread) [bellows.uart] Data frame: b'0556b1ed542e14ab59954b65ab55922263f0e30d12316b8ab3c763563e7f3fb7bbcdccf0847e'
2017-08-04 12:26:46 DEBUG (MainThread) [bellows.uart] Sending: b'8160597e'
2017-08-04 12:28:00 DEBUG (MainThread) [bellows.uart] Data frame: b'1556b1ed502a15b459944a2dab5592ac63f016c212316ebc00396788fd673fa7bfc87e'

@rolf-tx
Copy link
Author

rolf-tx commented Aug 4, 2017

Burst of error frames:

2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Data frame: b'2c59b1ed502a15b259944a2dae55923b61f7720b12316c22fcc69e03fcd66d77ebcde6837e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Sending: b'83401b7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Data frame: b'3c59b5ca4c903eb659fb472566747e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Sending: b'8430fc7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-04 13:28:59 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
...
2017-08-04 13:45:51 DEBUG (MainThread) [bellows.uart] Sending: b'445e219c54eeb3b658924a24ab1593499c4128a8ecc166b1407e'
2017-08-04 13:45:51 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'

@rcloran
Copy link
Collaborator

rcloran commented Aug 5, 2017

Do you always get these errors when it stops working? I'm guessing no based on some of the other information you provided.

@rolf-tx
Copy link
Author

rolf-tx commented Aug 6, 2017

normal.log.txt
@rcloran
I get the "[bellows.zigbee.application] Unexpected message send notification" warning around every 5 minutes. Does not seem related to devices working or not.
Update: After I redid the setup I'm not seeing the frame errors anymore @8MIN since restart.

Good progress with no-removal branch. I don't loose access to all devices any more.
It has run ok for 18hours with only temporary loss of one motion sensor for a few hours.

I'm again suspecting smart-socket acting as repeater, as it is the sensor furthest away that has the issue, I noticed a lot of time lag on the motion sensor furthest away.

I have restarted from scratch, deleting zigbee.db, and now only have the two motion sensors to see how things work for the next 12 hours.

Does bellows support a way to remove devices from zigbee.db ?

Do you recommend a particular set of log settings for debugging zigbee?

I have attached the setup log, and a normal startup log
setup_motion.log.txt
normal.log.txt

@rolf-tx
Copy link
Author

rolf-tx commented Aug 7, 2017

@rcloran Apparently I was wrong, the Frame Errors do seem to appear just as it stops responding.
This time it kept working for 12+ hours
Full log attached.

2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'5e67b5ca4c903eb659fb47257f437e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8610be7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 98 (incomingSenderEui64Handler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'6e67b1ed502a15b259944a2dae5592bf638cc1d812316c55fcc616fc8ed66d77ebcdce517e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'87009f7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 69 (incomingMessageHandler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.zigbee.zdo] [0x73e6:zdo] ZDO request 0x0000: [d0:52:a8:72:75:75:00:01, 0, 0]
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'7e67b5ca4c903eb659fb472511e37e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8070787e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 98 (incomingSenderEui64Handler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'3e67b5ca4c903eb659fb4725cca37e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8430fc7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 98 (incomingSenderEui64Handler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'4e67b1ed502a15b259944a2dae5592bc638cc1d812316c56fcc69e03fcd66d77ebcdeaaf7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8520dd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 69 (incomingMessageHandler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.zigbee.zdo] [0x73e6:zdo] ZDO request 0x0000: [d0:52:a8:00:8a:fd:00:01, 0, 0]
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'5e67b5ca4c903eb659fb47257f437e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8610be7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 98 (incomingSenderEui64Handler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'6e67b1ed502a15b259944a2dae5592bf638cc1d812316c55fcc616fc8ed66d77ebcdce517e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'87009f7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 69 (incomingMessageHandler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.zigbee.zdo] [0x73e6:zdo] ZDO request 0x0000: [d0:52:a8:72:75:75:00:01, 0, 0]
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Data frame: b'7e67b5ca4c903eb659fb472511e37e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Sending: b'8070787e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.ezsp] Application frame 98 (incomingSenderEui64Handler) received
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2017-08-07 14:09:16 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'

stopped_working.log.txt

@rcloran rcloran changed the title Bellows stops reporting/controlling zigbee devices after 1-2 hrs Application should re-establish connection to uart after error Aug 11, 2017
@rcloran
Copy link
Collaborator

rcloran commented Aug 11, 2017

These error frames are for error 0x51, described as "Error: Exceeded maximum ACK timeout count" in the documentation (https://www.silabs.com/documents/public/user-guides/UG101.pdf). This is an unrecoverable error. bellows writes the ACK as soon as it reads the frame. Given all the timestamps in the logs you've provided are in the same second, one of your hass components is probably blocking on the event loop.

bellows could possibly improve in a couple of ways here -- re-establishing the connection to the NCP, and/or running the uart communication in a separate thread. I'm unlikely to work on either of these in the foreseeable future. I suggest you hunt down the component that's blocking the event loop and get that fixed.

@rolf-tx
Copy link
Author

rolf-tx commented Aug 13, 2017

@rcloran Thanks for explaining cause and possible fixes.
I will patiently wait for a fix.

I have very few components loaded (frontend, updater, zwave, recorder, history, logger, sun, ios).
Possibly file access speed could be an issue. I will try the armbian 3.10 legacy kernel to see if it make a difference.

@dshokouhi
Copy link

Hello, I believe I am also experiencing this error as I am finding myself needing to restart home assistant every few hours to gain control my zigbee devices. Is there anything I can do to find the cause of this error?

@ryanwinter
Copy link

Running into the same issue, devices stop responding after a few hours. Captured a few logs and I always see the Error frame: b'c20251a8bd7e' before it stops responding.

2018-01-28 14:53:00 INFO (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L]: domain=homeassistant, service_data=entity_id=light.ikea_of_sweden_tradfri_bulb_e26_opal_1000lm_fe481ca1_1, service_call_id=34504723536-119, service=turn_on>
2018-01-28 14:53:00 INFO (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L]: domain=light, service_data=entity_id=['light.ikea_of_sweden_tradfri_bulb_e26_opal_1000lm_fe481ca1_1'], service_call_id=34504723536-120, service=turn_on>
2018-01-28 14:53:01 DEBUG (MainThread) [bellows.ezsp] Send command sendUnicast
2018-01-28 14:53:01 DEBUG (MainThread) [bellows.uart] Sending: b'10db219c545500b6589c4a24ab1593499cc4adadec446325f8c6ad7d337e'
2018-01-28 14:53:01 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2018-01-28 03:06:25 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2018-01-28 03:06:25 DEBUG (MainThread) [bellows.uart] Data frame: b'71dab1ed502a15b459944a2dab5592db07e69c4d12316ec700396788fd673fa71e477e'
2018-01-28 03:06:25 DEBUG (MainThread) [bellows.uart] Sending: b'8070787e'
2018-01-28 03:06:25 INFO (Thread-26) [pynetgear] Get attached devices
2018-01-28 03:06:25 DEBUG (MainThread) [bellows.ezsp] Application frame 69 (incomingMessageHandler) received
2018-01-28 03:06:26 DEBUG (MainThread) [bellows.zigbee.zdo] [0xe6bb:zdo] ZDO request 0x0006: [65533, 260, [25], []]
2018-01-28 03:06:26 DEBUG (MainThread) [bellows.ezsp] Send command sendUnicast
2018-01-28 03:06:27 ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting
2018-01-28 03:06:28 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.yweather_temperature_min is taking over 10 seconds
2018-01-28 03:06:29 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.living_room_thermostat_online is taking over 10 seconds
2018-01-28 03:06:29 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_010658b7_1 is taking over 10 seconds
2018-01-28 03:06:30 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_010789d1_1 is taking over 10 seconds
2018-01-28 03:06:30 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_0108e4e7_1 is taking over 10 seconds
2018-01-28 03:06:30 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_0108bc64_1 is taking over 10 seconds
2018-01-28 03:06:31 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_motionv4_0105a6cb_1 is taking over 10 seconds
2018-01-28 03:06:31 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_010658b8_1 is taking over 10 seconds
2018-01-28 03:06:31 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.smartthings_multiv4_01078907_1 is taking over 10 seconds
2018-01-28 03:06:32 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_01078907_1 is taking over 10 seconds
2018-01-28 03:06:32 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_0108e4e7_1 is taking over 10 seconds
2018-01-28 03:06:32 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_010658b7_1 is taking over 10 seconds
2018-01-28 03:06:33 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_0108bc64_1 is taking over 10 seconds
2018-01-28 03:06:34 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_010658b8_1 is taking over 10 seconds
2018-01-28 03:06:34 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_motionv4_0105a6cb_1 is taking over 10 seconds
2018-01-28 03:06:34 WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.smartthings_multiv4_010789d1_1 is taking over 10 seconds
2018-01-28 03:06:35 WARNING (MainThread) [homeassistant.helpers.entity] Update of switch.smartthings_outletv4_0106ffc8_1 is taking over 10 seconds
2018-01-28 03:06:37 WARNING (MainThread) [homeassistant.helpers.entity] Updating state for sensor.smartthings_multiv4_010658b7_1 (<class 'homeassistant.components.sensor.zha.TemperatureSensor'>) took 1.229 seconds. Please report platform to the developers at https://goo.gl/Nvioub
2018-01-28 03:06:52 WARNING (MainThread) [homeassistant.helpers.entity] Updating state for switch.smartthings_outletv4_0106ffc8_1 (<class 'homeassistant.components.switch.zha.Switch'>) took 0.873 seconds. Please report platform to the developers at https://goo.gl/Nvioub
2018-01-28 03:07:07 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'
2018-01-28 03:07:08 ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting

@Photo64
Copy link

Photo64 commented Jan 30, 2018

I am also now having this issue after moving from 0.57.2 to 0.61.1. I am not running in a docker container though. I see mostly these warnings every 1-5 minutes. Along with the occasional "timer got out of sync resetting" although they seem to happen in bunches and then I don't see them for awhile.

2018-01-30 13:25:36 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification
2018-01-30 13:26:48 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification
2018-01-30 13:29:30 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification
2018-01-30 13:30:16 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification
2018-01-30 13:35:33 WARNING (MainThread) [bellows.zigbee.application] Unexpected message send notification

@therve
Copy link

therve commented Jun 11, 2018

I have the same issue with current hass version. It happens randomly every few hours. Do you know how we can debug the component that's causing this? Thanks.

@StephenWetzel
Copy link

I have this issue as well. It seems to have started around HA 0.71 when power monitoring and battery level reporting was added (home-assistant/core#14561). I just updated to HA 0.78 with bellows 0.7 and zigpy 0.2 and that seems to have had no affect.

It always starts with a ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting message, and then repeated WARNING (MainThread) [homeassistant.components.sensor] Updating zha sensor took longer than the scheduled update interval 0:00:30. With debugging turned on I see DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'.

Here are two recent logs:
https://pastebin.com/qkNc04ih
https://pastebin.com/zxzWXwRV

I originally posted in this issue but I see now this one my real issue. @rcloran I'm not sure this is an issue with another component. It seems to be getting worse as I add zigbee devices to my network. I recently removed almost all my components and that seems to have made no difference.

All my zigbee devices are Sengled bulbs or Centralite outlets, but I have about 5 of each now, and the problem is making HA almost unusable; the zigbee network crashes at least once an hour. I have a script that checks for the Updating zha sensor took longer than the scheduled update interval message and restarts HA, which helps, but it breaks any automations that rely on timed conditions, as well as making the states of lights wrong.

I'm running Home Assistant on a Raspberry Pi 3B+ (via Hassbian) and am using the HUSBZB-1 stick. I'm curious if anyone else is experiencing this with different hardware? If this is a component is there any way we can attempt to debug which component it is? Is there some way we can restart just zigpy/bellows without restarting Home Assistant (that would help me at least)? Any other ideas for how we can about debugging this in general?

@Adminiuga
Copy link
Collaborator

@StephenWetzel when did you get the HUSBZB-1 stick? People were reporting issues similar to yours with sticks obtained in Apr--Jun 2018 IIRC which were solved after stick were replaced. Something about improperly loaded firmware.
Another possibility is UART port reset on Pi as those communication are a bit time sensitive. Have you tried zha_new with Yoda-x's zigpy/bellows? He had some changes to uart handling incorporated into his code. Otherwise as an experiment I would try running HA on a more powerful host or maybe a VM, just to eliminate UART timing as a possibility.

@StephenWetzel
Copy link

@Adminiuga Thanks for the feedback. I got my stick November 2017, and it was fine from then until around June when HA 0.71 was released. So, I don't think it's that, but perhaps. Wonder if there's any way to check or update the firmware on there.

I was looking into Yoda-x's versions, that's probably my next step. If that doesn't work I suppose I may try running HA on my desktop.

I wonder if there's any way for me to restart the device itself without restarting HA. Although I suppose bellows would have to support recovering from that still.

@Adminiuga
Copy link
Collaborator

@StephenWetzel I've got mine end of Jan'18 so I don't think it is the version. Not sure how to check for firmware update, couldn't find anything on vendor site.

There's really no way to restart the device and really the problem is in zha/bellows and my guess bellows really should reinitialize UART and reset the NCP, once it starts receiving 'error' frames

But it is also interesting that reports were around version 0.71, unless I was reading your reports in the forum :)

@Yoda-x
Copy link
Contributor

Yoda-x commented Sep 25, 2018

Imho, the problem is the pi and python with asyncio. the ncp has some strict timing requirements expecting an ack for every send packet with a defined latency. asynio is not made for low latency, its for massive parallel tasks where latency is not an issue. Also hass uses only one cpu, even its multi-threaded. On a faster cpu this seems no issue. I came from an intel nuc and the problem started right after switching to a 3b+. Even if the cpu is most of the time idle, it happened that the uart throw this ack-retransmit error right after a restart or sometime during night, mostly together with an DB error or a timeout with a query for a router. Maybe the daily DB cleanup, don't know. I get this error more often when I enable the device discovery for routers.
I do a restart and full init of the uart every time this error comes. this needed a lot of changes in the uart code. I assume the right way to fix would be

  • catch the error and restart the NCP,
  • make the uart process multi-threaded to get the uart process more often running with a reliable schedule,
  • use a RT kernel

There also some tuning options, as using on cpu exlusive for hass, thus it has no context switches to other cpus and other processes can't steal precious cpu cycles.
PS: also using uvloop instead pythons default event loop is recommended, you can easily install it with pip

@jstys
Copy link

jstys commented Sep 27, 2018

@StephenWetzel I am experiencing the same exact problem where I get an error frame on the Centralite 3210-L and afterwards the zha component fails to work anymore without reboot.

Would this benefit from running as its own standalone process? I've considered switching to the zigbee2mqtt route because I could see the benefit of isolating the zigbee stack to have its own dedicated resources. I haven't made the switch because I am already invested in using the Linear HUSBZB-1.

I'm not really familiar with the codebase but potentially there could be a standalone python mqtt client that acts as a proxy between bellows and home assistant so that the zigbee stack isn't subject to the home assistant event loops.

@Adminiuga
Copy link
Collaborator

Fixed in #147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants