Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UART port close on RSTACK message during startup #459

Merged
merged 5 commits into from
May 10, 2022

Conversation

agners
Copy link
Contributor

@agners agners commented May 10, 2022

When a RSTACK message is processed right after the UART has been opened,
it causes EZSP.enter_failed_state() getting called at a point where the
application callbacks are not registered yet. In that case the UART
will get closed and it won't get opened again. Bellows is stuck with a
closed transport.

Avoid this issue by not closing the port in case there is no application
callback registered yet.

Typically, it is unlikely that a RSTACK message arrives right when
the port gets opened (the race window is very narrow). However, with
hardware flow control opening the port leads to RTS signal to get
asserted which causes the radio to send pending messages, e.g. resets
caused by EmberZNet watchdog.

Note: With hardware flow control this is only the case if the tty "hupcl"
option is set. The option is set by default, but cleared by tools like
GNU screen. This option makes sure that the RTS signal is deasserted
while the port is closed. Pyserial/bellows does not change the state
of that option.

Fixes: home-assistant/core#71575

When a RSTACK message is processed right after the UART has been opened,
it causes EZSP.enter_failed_state() getting called at a point where the
application callbacks are not registered yet. In that case the UART
will get closed and it won't get opened again. Bellows is stuck with a
closed transport.

Avoid this issue by not closing the port in case there is no application
callback registered yet.

Typically, it is unlikely that a RSTACK message arrives right when
the port gets opened (the race window is very narrow). However, with
hardware flow control opening the port leads to RTS signal to get
asserted which causes the radio to send pending messages, e.g. resets
caused by EmberZNet watchdog.

Note: With hardware flow control this is only the case if the tty "hupcl"
option is set. The option is set by default, but cleared by tools like
GNU screen. This option makes sure that the RTS signal is deasserted
while the port is closed. Pyserial/bellows does not change the state
of that option.
@agners
Copy link
Contributor Author

agners commented May 10, 2022

Log outputs when triggering the race with a manual (hardware) reset during bellows/ZHA restart:

2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] connect(): calling uart.connect()
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] connect(): uart.connect() called, _gw is <bellows.thread.ThreadsafeProxy object at 0x7f838a8340>
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] _gw.reset()
2022-05-10 18:27:39 DEBUG (bellows.thread_0) [bellows.uart] RSTACK Version: 2 Reason: RESET_POWER_ON frame: b'c102029b7b7e'
2022-05-10 18:27:39 INFO (MainThread) [bellows.ezsp] NCP entered failed state. No application handler registered, ignoring...
2022-05-10 18:27:39 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] ezsp.close()
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] calling _gw.close()
2022-05-10 18:27:39 DEBUG (MainThread) [bellows.ezsp] called _gw.close()
2022-05-10 18:27:39 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2022-05-10 18:27:39 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'
2022-05-10 18:27:39 DEBUG (bellows.thread_0) [bellows.uart] Closed serial connection

In this case the message got processed between opening the port and assignment of EZSP._gw. This does not lead to a port close. So the actual race window is only after assigning EZSP._gw up until the callback is registered.

2022-05-10 17:58:36 DEBUG (MainThread) [bellows.ezsp] connect(): calling uart.connect()
2022-05-10 17:58:36 DEBUG (bellows.thread_0) [bellows.uart] RSTACK Version: 2 Reason: RESET_POWER_ON frame: b'c102029b7b7e'
2022-05-10 17:58:36 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2022-05-10 17:58:36 DEBUG (MainThread) [bellows.ezsp] ezsp.close()
2022-05-10 17:58:36 DEBUG (MainThread) [bellows.ezsp] connect(): uart.connect() called, _gw is <bellows.thread.ThreadsafeProxy object at 0x7f88b3e8b0>
2022-05-10 17:58:36 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2022-05-10 17:58:36 DEBUG (MainThread) [bellows.ezsp] _gw.reset()
2022-05-10 17:58:36 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2022-05-10 17:58:36 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'

@codecov-commenter
Copy link

codecov-commenter commented May 10, 2022

Codecov Report

Merging #459 (b331ef5) into dev (ad1ea56) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##              dev     #459   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files          44       44           
  Lines        3467     3469    +2     
=======================================
+ Hits         3464     3466    +2     
  Misses          3        3           
Impacted Files Coverage Δ
bellows/ezsp/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad1ea56...b331ef5. Read the comment docs.

Copy link
Collaborator

@Adminiuga Adminiuga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Adminiuga Adminiuga merged commit 0b3d934 into zigpy:dev May 10, 2022
@Adminiuga Adminiuga added the bug label May 10, 2022
@agners agners deleted the fix-uart-close-race-at-startup branch May 11, 2022 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZHA not loading properly sometimes
3 participants