Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset EZSP stack on error frames or missing heartbeats #147

Merged
merged 14 commits into from
Mar 27, 2019

Conversation

Adminiuga
Copy link
Collaborator

This is one of the fixes targeted to address #124

  1. Reset EZSP if we receive error frames
  2. Reset EZSP if there're missing heartbeats (EZSP nop command) from EZSP
  3. Reset EZSP if serial connection is lost

in my test it would successfully reconnect on USB stick un-plug/plug back in. I was able to reproduce the Error farme with "Maximum NAK exceeded" condition, by artificially suppressing sending ACKs to NCP and successfully recovering from it.
With that said, I'm looking for volunteers to test drive it and additional feedback.

@walthowd
Copy link

I can throw this on my old problematic system early next week and do some testing for you.

@Adminiuga
Copy link
Collaborator Author

@walthowd I appreciate the feedback. Keep in mind this PR addresses one area only (resseting NCP on error frame) so I does not include any other work arrounds related to pending sequence handling.

@Adminiuga
Copy link
Collaborator Author

and please also keep in mind that HA dev tree depends on bellows-homeassistant now so you may need to update the dependencies in homeassistant/components/zha/__init__.py

@Adminiuga Adminiuga changed the title RFC: Reset EZSP stack on error frames or missing heartbeats WIP & RFC: Reset EZSP stack on error frames or missing heartbeats Mar 1, 2019
@Adminiuga Adminiuga changed the title WIP & RFC: Reset EZSP stack on error frames or missing heartbeats RFC: Reset EZSP stack on error frames or missing heartbeats Mar 1, 2019
@Adminiuga Adminiuga changed the title RFC: Reset EZSP stack on error frames or missing heartbeats RFC & WIP: Reset EZSP stack on error frames or missing heartbeats Mar 2, 2019
@Adminiuga
Copy link
Collaborator Author

please merge after PR #150 which should make code much cleaner

@walthowd
Copy link

walthowd commented Mar 5, 2019

@Adminiuga Running on my old system today, I pulled the USB adapter out at 12:09:53 and it was successfully reset and ZHA commands continued to work after it was reset.

I'll leave it running longer today and see if I can get a real UART error

https://www.dropbox.com/s/ysiaa1sipy44xsd/home-assistant-serial-reset.txt?dl=0

@Adminiuga
Copy link
Collaborator Author

Adminiuga commented Mar 5, 2019

That's really good feedback and offers some more visibility into what was happening (but not why)

2019-03-05 12:09:38 ERROR (MainThread) [bellows.uart] Lost serial connection: read failed: [Errno 6] Device not configured
2019-03-05 12:09:38 DEBUG (MainThread) [bellows.ezsp] /dev/tty.GoControl_zigbee connection lost unexpectedly: read failed: [Errno 6] Device not configured
2019-03-05 12:09:38 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2019-03-05 12:09:38 DEBUG (MainThread) [bellows.zigbee.application] Resetting ControllerApplication. Cause: 'Serial connection loss: read failed: [Errno 6] Device not configured'
2019-03-05 12:09:38 DEBUG (MainThread) [bellows.ezsp] Reconnecting /dev/tty.GoControl_zigbee serial port on 57600 bauds
2019-03-05 12:09:38 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: [Errno 2] could not open port /dev/tty.GoControl_zigbee: [Errno 2] No such file or directory: '/dev/tty.GoControl_zigbee'
2019-03-05 12:09:43 DEBUG (MainThread) [bellows.ezsp] Reconnecting /dev/tty.GoControl_zigbee serial port on 57600 bauds
2019-03-05 12:09:43 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: [Errno 2] could not open port /dev/tty.GoControl_zigbee: [Errno 2] No such file or directory: '/dev/tty.GoControl_zigbee'
 ...[skipping  some noise] ...
2019-03-05 12:09:48 DEBUG (MainThread) [bellows.ezsp] Reconnecting /dev/tty.GoControl_zigbee serial port on 57600 bauds
2019-03-05 12:09:48 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: [Errno 2] could not open port /dev/tty.GoControl_zigbee: [Errno 2] No such file or directory: '/dev/tty.GoControl_zigbee'
2019-03-05 12:09:53 DEBUG (MainThread) [bellows.ezsp] Reconnecting /dev/tty.GoControl_zigbee serial port on 57600 bauds
2019-03-05 12:09:53 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2019-03-05 12:09:53 DEBUG (MainThread) [bellows.uart] Resetting ASH
2019-03-05 12:09:53 DEBUG (MainThread) [bellows.uart] Sending: b'1ac038bc7e'
2019-03-05 12:09:54 DEBUG (MainThread) [bellows.uart] RSTACK Version: 2 Reason: RESET_SOFTWARE frame: b'c1020b0a527e'
2019-03-05 12:09:54 DEBUG (MainThread) [bellows.ezsp] Send command version

For whatever reason it looks like the serial port is disconnected, like when the device is unplugged. You get that weird device not configure error on connection loss and then the actual device file is gone for like 12s before HASS can access the serial device TTY again. I don't know why OS looses the device, but bellows did not handle serial port disconnects at all until this PR
I also think the EZSP itself was not resetting, just the serial connection flipping off and on. That could explain why are never were seeing some of the messageSent callback frames causing TSN leak.

PS: this does not mean that exactly the same issue was happening to everyone else with the similar symptoms.

@Adminiuga
Copy link
Collaborator Author

@walthowd do you mind runing PR 150, 149 and 148 mashup branch? It has all three PRs mashed together and shouldn't leak any TSNs

@walthowd
Copy link

walthowd commented Mar 6, 2019

@Adminiuga I pulled the mashup branch and ran the system for another few hours -- no natural occurring errors but I pulled the USB again at 19:26:21 - same behaviour as before.

https://www.dropbox.com/s/vu6kdqud9yv07wk/home-assistant-serial-reset.log.gz?dl=0

It looks like the mashup branch does not have the extra logging to see if any sequence IDs are leaking?

@Adminiuga
Copy link
Collaborator Author

Nah, don't have any extra debug logging in this branch as I'm hope to get it included into upstream. I'd be very surprised if it leaks any TSNs, as everything is cleaned up by the "context manager" exception or no exception.

This line pushes new "pending" request to self._pending so would be a good place to check for other pending TSNs. Something like

for tsn, req in self._pending.items():
    LOGGER.debug("Currently pending TSN %03i: Send: %s, Reply: %s", tsn, req.send, req.reply)
LOGGER.debug("Adding new TSN %03i", self.sequence

Make ControllerApplication.initialize() a retryable method.
Add bellows specific exceptions.
Handle EzspError exception in requests.
Refactor ControllerApplication reset flow.
Reset ControllerApplication on watchod failures.
Allow UART reconnects.
Initiate ControllerApplication reset on UART connection loss.
Serial connection loss restarts reset task.
Error frames do not pre-empts a running reset task.
If we receive a  Reset ACK frame to a reset we've not initiated, then
request ControllerApplication reset.
New ControllerError exception.
Execute broadacast() and/or request() only if ControllerApplication is
running.
@dmulcahey
Copy link
Contributor

@damarco any chance we can get this in a release in time for .91?

@Adminiuga
Copy link
Collaborator Author

give me two days, need to do some more testing, but so far I was satisfied with the results on Pi2

@Adminiuga Adminiuga force-pushed the fixes/reset-ncp branch 4 times, most recently from 8e2606b to ddbfa2e Compare March 23, 2019 00:17
@Adminiuga
Copy link
Collaborator Author

I think this is the version I want to go with. Will be running it for a few days and remove WIP tag once the dust settles.

@Adminiuga Adminiuga changed the title RFC & WIP: Reset EZSP stack on error frames or missing heartbeats Reset EZSP stack on error frames or missing heartbeats Mar 25, 2019
@Adminiuga
Copy link
Collaborator Author

@damarco @dmulcahey I'm ready with this one.
For the note: this still doesn't make Pi2 a great platform for running ZHA, but at least now it can run more or less.

Properly handle replies arriving past EZSP cmd timeout.
Stop watchdog on controller shutdown.
@damarco damarco merged commit 55a1345 into zigpy:master Mar 27, 2019
@damarco
Copy link
Contributor

damarco commented Mar 27, 2019

Thanks!

@rjgrandy
Copy link

I believe I have similar issue from #124 (zigbee devices initially work, after some time they stop), is there a way I can test this new code on my system? I can provide feedback on the the effect.

@Adminiuga Adminiuga deleted the fixes/reset-ncp branch March 27, 2019 23:46
@Adminiuga
Copy link
Collaborator Author

@rjgrandy what hardware are you running on and what environment. I hope this would make into 0.92

@rjgrandy
Copy link

@Adminiuga I am using a HUSBZB with a RPI3B+, I have everything running using docker.

I get a number of the below errors:

2019-03-26 07:17:53 DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e'

Then I start getting the below message over and over:

2019-03-26 07:18:10 WARNING (MainThread) [homeassistant.components.sensor] Updating zha sensor took longer than the scheduled update interval 0:00:30

@Adminiuga
Copy link
Collaborator Author

@rjgrandy looks like it would be in HA 0.91 next Wed. It is in beta now.

@rjgrandy
Copy link

@Adminiuga Ok awesome, if I install one of the betas now, I can check to see if it helps my issue?

@Adminiuga
Copy link
Collaborator Author

Adminiuga commented Mar 29, 2019 via email

@rjgrandy
Copy link

@Adminiuga I pulled the beta2, so far my zigbee devices do seem to be able to maintain controllability, however it is putting out a lot more to the log than before (i do have all the relevant zigbee on debug in the logger) the zigbee devices appear to react slowly and it looks like it is constantly resetting EZSP and ASH.

2019-03-30 01:32:47 DEBUG (MainThread) [bellows.ezsp] Reconnecting /dev/zigbee serial port on 57600 bauds
2019-03-30 01:32:47 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2019-03-30 01:32:47 DEBUG (MainThread) [bellows.uart] Resetting ASH
2019-03-30 01:32:47 DEBUG (MainThread) [bellows.uart] Sending: b'1ac038bc7e'

@rjgrandy
Copy link

Just to follow up on my last comment, I installed beta3 this morning, disabled all logger debug output for zigbee related things, deleted the home-assistant_v2.db (it was almost 3GB). After this, so far the zigbee devices seem pretty stable and responsive.

Also I like the new ZHA "Add Devices" screen :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants