Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lunatone HID: occasional stuck communication, and assert current_command == None in dali/driver/hid.py:592 #124

Open
jktjkt opened this issue Jul 24, 2023 · 0 comments

Comments

@jktjkt
Copy link
Contributor

jktjkt commented Jul 24, 2023

Every now and then my DALI connection (with Lunatone USB module) gets stuck; sometimes it's once a few days, sometimes it's just after a few minutes. When it failed for the last time, the bus traffic right before that was the following:

2023-07-24 06:56:57,367 dali: INFO: OccupancyEvent(short_address=46, data=EventData(movement=False, occupied=True, repeat=True, sensor_type='movement'))
2023-07-24 06:57:27,398 movement-koupelna-vana: INFO: Dimming automatically
2023-07-24 06:57:27,419 dali: INFO: DTR0(40)
2023-07-24 06:57:27,477 dali: INFO: SetExtendedFadeTime(<address (control gear) 19>)
2023-07-24 06:57:27,507 dali: INFO: DTR0(0)
2023-07-24 06:57:27,565 dali: INFO: SetFadeTime(<address (control gear) 19>)
2023-07-24 06:57:27,596 dali: INFO: ArcPower(<address (control gear) 19>,1)
2023-07-24 06:57:27,625 dali: INFO: DTR0(19)

I've added some async task debugging inspired by from python/cpython#91048 and some abuse of the traceback module, so that upon a SIGUSR1 I get a dump of asyncio.all_tasks and what they are waiting for. Apart from the usual, boring stuff, I saw this sequence of awaits:

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 201, in Room.run
    await self.on_event(event)
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 144, in Room.on_event
    await self.do_auto_dim()
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 115, in Room.do_auto_dim
    await self.activate_quick_fade()
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 90, in Room.activate_quick_fade
    await self.dev.send(SetExtendedFadeTime(self.target))
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 203, in hid.send
    response = await self._send_raw(command)
    self = <dali.driver.hid.tridonic object at 0x7fad22c9ae50>

  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 444, in tridonic._send_raw
    await event.wait()
    self = <dali.driver.hid.tridonic object at 0x7fad22c9ae50>

  File "/nix/store/k43ppvy4jxbvg65gwihsn48c10nnaali-python3-3.11.4/lib/python3.11/asyncio/locks.py", line 213, in Event.wait
    await fut
    self = <asyncio.locks.Event object at 0x7fad22ad63d0 [unset, waiters:1]>

I still have no idea why that happens, but then I killed the program via Ctrl-C, and the following was printed:

2023-07-24 07:07:02,917 asyncio: ERROR: Task exception was never retrieved
future: <Task finished name='Task-2' coro=<tridonic._bus_watch() done, defined at /home/jkt/work/prog/python-dali/dali/driver/hid.py:477> exception=AssertionError()>
Traceback (most recent call last):
  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 592, in _bus_watch
    assert current_command == None
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Apparently, the _bus_watch_task is created, but never await-ed. That looks like a bug, but I'm not familiar enough with asyncio and its use from the context of the HID driver to fix this. I also do not know if this is related to my original issue with locking up. As far as I can tell, the asyncio.Event that the _send_raw is waiting on is never touched from the _bus_watch_task, but perhaps there's some subtle interaction that I'm not aware of. Or, maybe, if there's an exception in this task it somehow affects the IO loop? That's probably unlikely because it's the same IO loop which handles my SIGUSR1, and that one worked just fine and produced the trace.

Does that make sense, and are these two issues perhaps related to each other?

Would you like me to get more debugging logs from the driver so that (maybe) we will know what exactly happened before that assertion was thrown? (I'll have to RTFM for logger selectivity because it's very, very verbose.)

jktjkt added a commit to jktjkt/python-dali that referenced this issue Jul 26, 2024
These were all logged as debug messages, which means that they are not
shown by default. This makes any debugging of failures much harder, see,
e.g., sde1000#124.
jktjkt added a commit to jktjkt/python-dali that referenced this issue Jul 26, 2024
I was playing with some commands, and one of my devices apparently just
won't send any reply to a DT5 ResetConverterSettings. That's a
sendtwice=True command, and when that happens, I was getting an
assertion failure later on on line 600, which looks pretty much like
bug sde1000#124:

 15:00:59.333780 D: tridonic: Sending with seq e7
 15:00:59.339550 D: tridonic: waiting for outstanding_transmissions=1 {response=}
 15:00:59.357560 D: tridonic: _handle_read 12730000C105FFFFE7
 15:00:59.363037 D: tridonic: bus_watch message 12730000C105FFFFE7
 15:00:59.368941 D: tridonic: Command EnableDeviceType(5), immediate
 15:00:59.384012 D: tridonic: remembering device type 5
 15:00:59.387287 D: tridonic: Bus watch waiting for data, no timeout
 15:00:59.400013 D: tridonic: message mode=12 rtype=73 frame=b'\x00\x00\xc1\x05' interval=ffff seq=e7
 15:00:59.403352 D: tridonic: waiting for outstanding_transmissions=0 {response=}
 15:00:59.407774 I: dali: EnableDeviceType(5)
 15:00:59.431360 D: tridonic: _handle_read 127100000000008AE7
 15:00:59.454305 D: tridonic: bus_watch message 127100000000008AE7
 15:00:59.457782 D: tridonic: Bus watch waiting for data, no timeout
 15:00:59.461983 D: tridonic: message mode=12 rtype=71 frame=b'\x00\x00\x00\x00' interval=008a seq=e7
 15:00:59.693383 D: tridonic: Sending with seq e8
 15:00:59.698993 D: tridonic: waiting for outstanding_transmissions=1 {response=}
 15:00:59.716567 D: tridonic: _handle_read 1273000001E6103BE8
 15:00:59.721514 D: tridonic: bus_watch message 1273000001E6103BE8
 15:00:59.727816 D: tridonic: Bus watch waiting with timeout
 15:00:59.732960 D: tridonic: message mode=12 rtype=73 frame=b'\x00\x00\x01\xe6' interval=103b seq=e8
 15:00:59.736298 D: tridonic: waiting for outstanding_transmissions=0 {response=}
 15:00:59.741356 D: tridonic: _handle_read 127100000000008AE8
 15:00:59.746287 D: tridonic: message mode=12 rtype=71 frame=b'\x00\x00\x00\x00' interval=008a seq=e8
 15:00:59.785298 D: tridonic: bus_watch message 127100000000008AE8
 15:00:59.788843 D: tridonic: Unexpected response waiting for retransmit of config command
 15:00:59.794324 E: tridonic: current_command is not None: ResetConverterSettings(<address (control gear) 0>)

I'm not saying that this is surely a fix for sde1000#124, but that one was
happening in an environment where DT8 Tc commands were sent regularly.
sde1000 pushed a commit that referenced this issue Aug 2, 2024
These were all logged as debug messages, which means that they are not
shown by default. This makes any debugging of failures much harder, see,
e.g., #124.
sde1000 pushed a commit that referenced this issue Aug 2, 2024
I was playing with some commands, and one of my devices apparently just
won't send any reply to a DT5 ResetConverterSettings. That's a
sendtwice=True command, and when that happens, I was getting an
assertion failure later on on line 600, which looks pretty much like
bug #124:

 15:00:59.333780 D: tridonic: Sending with seq e7
 15:00:59.339550 D: tridonic: waiting for outstanding_transmissions=1 {response=}
 15:00:59.357560 D: tridonic: _handle_read 12730000C105FFFFE7
 15:00:59.363037 D: tridonic: bus_watch message 12730000C105FFFFE7
 15:00:59.368941 D: tridonic: Command EnableDeviceType(5), immediate
 15:00:59.384012 D: tridonic: remembering device type 5
 15:00:59.387287 D: tridonic: Bus watch waiting for data, no timeout
 15:00:59.400013 D: tridonic: message mode=12 rtype=73 frame=b'\x00\x00\xc1\x05' interval=ffff seq=e7
 15:00:59.403352 D: tridonic: waiting for outstanding_transmissions=0 {response=}
 15:00:59.407774 I: dali: EnableDeviceType(5)
 15:00:59.431360 D: tridonic: _handle_read 127100000000008AE7
 15:00:59.454305 D: tridonic: bus_watch message 127100000000008AE7
 15:00:59.457782 D: tridonic: Bus watch waiting for data, no timeout
 15:00:59.461983 D: tridonic: message mode=12 rtype=71 frame=b'\x00\x00\x00\x00' interval=008a seq=e7
 15:00:59.693383 D: tridonic: Sending with seq e8
 15:00:59.698993 D: tridonic: waiting for outstanding_transmissions=1 {response=}
 15:00:59.716567 D: tridonic: _handle_read 1273000001E6103BE8
 15:00:59.721514 D: tridonic: bus_watch message 1273000001E6103BE8
 15:00:59.727816 D: tridonic: Bus watch waiting with timeout
 15:00:59.732960 D: tridonic: message mode=12 rtype=73 frame=b'\x00\x00\x01\xe6' interval=103b seq=e8
 15:00:59.736298 D: tridonic: waiting for outstanding_transmissions=0 {response=}
 15:00:59.741356 D: tridonic: _handle_read 127100000000008AE8
 15:00:59.746287 D: tridonic: message mode=12 rtype=71 frame=b'\x00\x00\x00\x00' interval=008a seq=e8
 15:00:59.785298 D: tridonic: bus_watch message 127100000000008AE8
 15:00:59.788843 D: tridonic: Unexpected response waiting for retransmit of config command
 15:00:59.794324 E: tridonic: current_command is not None: ResetConverterSettings(<address (control gear) 0>)

I'm not saying that this is surely a fix for #124, but that one was
happening in an environment where DT8 Tc commands were sent regularly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant