Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless "initialization loop" after restart #227

Closed
3 tasks
Lurker00 opened this issue May 11, 2024 · 32 comments
Closed
3 tasks

Endless "initialization loop" after restart #227

Lurker00 opened this issue May 11, 2024 · 32 comments
Labels
bug Something isn't working

Comments

@Lurker00
Copy link

LocalTuya Version

3.2.5.2b7

Home Assistant Version

2024.5.3

Environment

  • Does the device work using the Home Assistant Tuya Cloud component?
  • Is this device connected to another local integration, including Home Assistant and any other tools?
  • The devices are within the same HA subnet, and they get discovered automatically when I add them

What happened?

After some recent changes (moved HA from VM to a dedicated armbian box, added some devices, HA updates to 2024.5.*), a weird problem arises: after HA restart, LocalTuya falls into endless loop of the initial sequence trying to connect to all my devices to obtain initial entities values. During this time, HA becomes useless, because all the devices fall into "Became unavailable" every some seconds.

If I don't touch the system, it may continue running as such for a long time. It seems that adding or reconfiguring a device (also not related to LocalTuya, e.g. via direct BLE integration) stops that loop.

I've applied your #226 fix before the last restart (with the log attached), just for information.

I don't mark this issue as a bug because I have no idea what was the reason that this problem started to happen: whether it was HA update, or newly added devices, or a combination of events...

I attach my current diagnostics of all the devices, and a log which ended when that loop just stopped. Note, that the device "[bf5...wrf]" is disabled (a gateway with 2 DP's). But in the log there are plenty of records

2024-05-11 09:24:30.688 DEBUG (MainThread) [custom_components.localtuya.number] [bf5...wrf] Entity up_channel (DP 1) - Not restoring as restore on reconnect is disabled for this entity and the entity has an initial status or it is not a passive entity
2024-05-11 09:24:30.688 DEBUG (MainThread) [custom_components.localtuya.number] [bf5...wrf] Entity down_channel (DP 2) - Not restoring as restore on reconnect is disabled for this entity and the entity has an initial status or it is not a passive entity

which should only happen once at the very beginning. That's why I call it "endless initialization loop".

Steps to reproduce.

Don't know

Relevant log output

No response

Diagnostics information.

home-assistant.log
config_entry-localtuya-944c1616039a1d1ec6b9745179bff74e (5).json

@Lurker00 Lurker00 added the bug Something isn't working label May 11, 2024
@xZetsubou
Copy link
Owner

xZetsubou commented May 11, 2024

It should happens for every entity added still I'm not sure if that is the reason?

I don't understand how can "LocalTuya" causes this there is no waiting on initialization it only wait for 1 device to connect or fails then continue. Is there a notification message at the bottom left "waiting for "LocalTuya" to startup" ? maybe full HA logs may help.

@Lurker00
Copy link
Author

I don't understand how can "LocalTuya" causes this

Neither do I. That why I didn't use "Bug" prefix in the title. But you can see in the log that LocalTuya connects all the devices over and over again, with all successful results! There must be something that triggers this process.

Is there a notification message at the bottom left "waiting for "LocalTuya" to startup" ?

No. It showed the message that HA started. Moreover, I have a an automation that sends a message to Telegram on HA started event, I've received it.

maybe full HA logs may help.

Should I restart HA with "default: debug" for the logger, or what? That log was recorded with default warning level. Writing full huge logs may affect the system by itself, so if you have a suggestion what is the minimum set of "logger" options you need, you're welcome!

@xZetsubou
Copy link
Owner

Yeah enable debug level for logs and then post the full logs of HA, I find it difficult to find out what might cause the infinity loop in localtuya.

@Lurker00
Copy link
Author

Here it is:
home-assistant.log.zip

First I noticed Bluetooth related messages right before each LocalTuya "restart" and disabled Bluetooth intergation, to no avail. Then I tried to reconfigure a device (it helped to stop the loop before), but, alas, it continued.

Currently, I restarted HA without debug logging, and can't stop this process: LocalTuya restarts connecting to devices every 30 seconds, making the devices at this time unavailable.

Probably, it would not be a big problem if the devices were not become unavailable, keeping their statuses. But this way no an automation works.

@Lurker00
Copy link
Author

I've managed to make HA working after a restart. Before I've uninstalled Bluetooth support in the armbian, and deleted all Bluetooth related intergations and devices. But I can't be sure about what caused this misbehavior the first time: I've also removed some other stuff, which didn't make a difference when I did it. Currently I'm afraid to continue investigation. It is running for 10 minutes without a glitch already, and I prefer to keep it running for a while.

@Lurker00 Lurker00 changed the title Endless "initializtion loop" after restart Endless "initialization loop" after restart May 12, 2024
@Lurker00
Copy link
Author

Please note, that this sequence

2024-05-12 11:15:57.088 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-12 11:15:57.101 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf2...ath] Trying to connect to ГСЛ: Zigbee Bulb...
2024-05-12 11:15:57.104 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfe...7yz] Trying to connect to ГС: Motion Sensor...
2024-05-12 11:15:57.109 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfa...hoi] Trying to connect to К: 3x Переключатель...
2024-05-12 11:15:57.113 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...tdq] Trying to connect to ГС: Выключатель...
2024-05-12 11:15:57.118 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...kn4] Trying to connect to ГС: 3x Переключатель...
2024-05-12 11:15:57.121 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...xw5] Trying to connect to К: Выключатель...
2024-05-12 11:15:57.127 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf1...z1q] Trying to connect to ГС: HP&L Сенсор...
2024-05-12 11:15:57.132 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...hwo] Trying to connect to К: M&L Сенсор...
2024-05-12 11:15:57.137 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf8...mlc] Trying to connect to К: Роутер...
2024-05-12 11:15:57.141 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfe...tf0] Trying to connect to ГС: Роутер...
2024-05-12 11:15:57.146 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf6...stj] Trying to connect to Тест: Smart Plug+...
2024-05-12 11:15:57.151 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfb...cbv] Trying to connect to К: Реле 4х...
2024-05-12 11:15:57.155 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfa...dbm] Trying to connect to ГСЛ: Реле 4х...
2024-05-12 11:15:57.161 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf2...aky] Trying to connect to ГС: 2х Переключатель...
2024-05-12 11:15:57.165 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bff...ns9] Trying to connect to ПЛ: Диммер...
2024-05-12 11:15:57.170 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bf0...epk] Trying to connect to П: Выключатель...
2024-05-12 11:15:57.174 DEBUG (MainThread) [custom_components.localtuya.coordinator] [bfc...b03] Trying to connect to П: Motion Sensor...

which continues over and over again, consists of devices of one gateway, Zigbee only, wired (Ethernet). I also have one BLE-only wireless gateway, and one Zigbee+BLE wireless gateway. For the latter, I sometimes see such sequences as well, but only 2-3 loops, never more. I also have a number of WiFi devices, and no one experiences such misbehaviour.

So, LocalTuya re-initializes not all the devices, but only devices connected to this one particular gateway (bf504372fcc2cc8b10fwrf, 192.168.0.31). Based on this, I'm in doubt that it is HA-related or BLE-related problem...

The problem started to happen in a short period of time (some hours) after I've added 3 new devices to this gateway, and created automations for them. But all was working for some hours.

These new devices are: a dimmer (bffb4b344babe630c4lns9), scene switch (bf0c1d4f7f66ddd4b2lepk) and PIR motion sensor (bfcaf22d28414bacb8cb03). For the dimmer ("tdq"), LocalTuya has no support for its DPs (switch and brightness control), but I added them manually.

@fabien4444
Copy link

Hello,
i have to add that only my sub device is unavailable! I have 3 others tuya main devices which are working well.

@Lurker00
Copy link
Author

I've restored core 2024.5.1 from backup, returned back my new devices, and the problem started again.

@Lurker00
Copy link
Author

From my experience, I'm able to stop that loop by disabling or enabling a LocalTuya device. It seems it does not really matter which device. E.g. it can be BLE gateway.

@xZetsubou

This comment was marked as outdated.

@Lurker00
Copy link
Author

Lurker00 commented May 14, 2024

Can you try disabling Sub-Devices only,

Yes, I did it. More over, once I (by mistake) disabled and then enabled only one entity (the switch of the dimmer), and it stopped the loop.

But it is a random process. I just woke up and I see in the log, that some hours ago the loop started again doing it every 1 second, and it stopped by itself after 9 iterations. Then, in around 30 minutes, it started again every 30 seconds, and stopped after 14 iterations. Still it didn't happen.

I have a thought that it may be caused by "too many" Zigbee devices bound to one gateway. The aforementioned gateway has now 17 active devices, and 3 of them were aded recently. When I revered back to core 2024.5.1, I've lost them, and there was no the problem after this HA restart. But the problem returned back after I added them, along with manual steps to configure DPs of the dimmer and PIR sensor. - I disabled more than 3 devices, to no avail.

@Lurker00
Copy link
Author

Lurker00 commented May 14, 2024

I changed some messages from debug to info, and added "Sub-device name disconnected" to def disconnected(self) here:

def disconnected(self):
"""Device disconnected."""
sleep_time = self._device_config.sleep_time
self._interface = None

I also changed HEARTBEAT_SUB_DEVICES_INTERVAL to 60, and HEARTBEAT_INTERVAL to 20. After a restart, the problem disappeared in some minutes. Here is the log:
home-assistant_2024-05-14T19-35-07.947Z.log

Please note:

  1. 192.168.0.37 is low-power WiFi T&H sensor. Ignore it.
  2. The period has changed to 60 seconds.
  3. For each device, from the very beginning, there are two "Trying to connect" records with one second in between. Is it correct?
  4. Only for Zigbee devices connected to that router, there is a callback for disconnect almost immediately after successful connect (in next attempts they disconnect after HEARTBEAT_SUB_DEVICES_INTERVAL). Unfortunately, you don't log disconnect events, and they are missed in full logs I submitted before.

Here is an extraction for one device:

2024-05-14 22:08:30.667 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-14 22:08:31.127 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...
2024-05-14 22:08:31.225 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Started heartbeat for КЛ: Zigbee Bulb
2024-05-14 22:08:31.226 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Success: connected to КЛ: Zigbee Bulb
2024-05-14 22:08:31.909 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Sub-device КЛ: Zigbee Bulb disconnected.
2024-05-14 22:08:35.796 INFO (MainThread) [custom_components.localtuya.coordinator] [bf5...whd] Trying to connect to КЛ: Zigbee Bulb...

I'm quite happy that it finally "calmed down" (I had to add one more device today evening, and then couldn't stop this loop) and I can go to sleep.

By the way, I've updated to 2024.5.0b1.

@xZetsubou
Copy link
Owner

xZetsubou commented May 14, 2024

Actually I'm adjusting the way of sub-devices connecting process.

  1. Sub-devices don't need to be waited for connection process like parent devices.
  2. The heartbeat of sub-devices should start later after sub-devices connected and not while it is in the connecting process.
    ^ This can be fixed by moving asyncio sleep above subdevices_query call in start_sub_devices_heartbeat at pytuya module.

I think this what causing the loop.

  1. The sub-devices connecting successfully even tho it's not connected but the gateway may still cached their status
  2. Heartbeat for sub-devices starts on the mean time while sub-devices are still in connecting process.
  3. Sub-devices HB may reports that it's actually not connected and disconnected() will be called. This is still on the 1st connecting process
  4. Now that disconnected() function called localtuya will immediately re-try to reconnect and that may causes the loop? This is probably why there is 2 attempt of connection at the starts.

@Lurker00
Copy link
Author

I've added logging into def _setup_dispatcher(self) -> MessageDispatcher

on_devices = data.get("online", [])
off_devices = data.get("offline", [])
for cid, device in listener._sub_devices.items():
if cid in off_devices or cid not in on_devices:
device.disconnected()

and yes: LocalTuya disconnects here!
home-assistant_2024-05-15T07-32-30.556Z.log

2. This can be fixed by moving asyncio sleep above subdevices_query call in start_sub_devices_heartbeat at pytuya module.

It does not help to solve this particular problem: LocalTuya continues disconnecting devices. Nor it prevents double connections at the start. Here is the log:
home-assistant_2024-05-15T07-40-22.231Z.log

I've changed the condition at line 835 there to

                        if cid in off_devices:

and finally my system is up and running 😄
home-assistant_2024-05-15T07-47-01.946Z.log

I believe it was a bad logic.

But, with those both fixes, it still connects two times for each device in 1 second.

My opinion is that LocalTuya needs here deep change to its architecture: to create one and the only connection to each gateway, to use it for all the subdevices connected to this gateway. A gateway, as I understand, has limited abilities to serve many connections.

@Lurker00
Copy link
Author

Look to the following two answers from my gateway:

2024-05-15 10:34:41.728 INFO (MainThread) [custom_components.localtuya.core.pytuya] [bf5...whd] Sub-Devices States Update: {'reqType': 'subdev_online_stat_report', 'data': {'online': ['a4c138e51d9c71a2', 'a4c13825abcf5819', 'a4c138ba5b2576e5', 'a4c13825798525df', 'a4c138de3eb1f3f3', 'a4c1380ad40d076d', 'a4c1380e8e46d3c4', '00124b002d8e6a68', 'a4c138717b6a9928', '00124b00272127b7', 'a4c138f46f838ee0', 'a4c1389aeca68fbf', 'a4c138a112e22ebe', 'a4c1384bb9cd4e1d', 'a4c1380f8554dc9d'], 'offline': ['a4c1382e58833a9c', 'a4c138c67c8e556e']}}. a4c138b97aac2ae0 disconnected
2024-05-15 10:34:41.983 INFO (MainThread) [custom_components.localtuya.core.pytuya] [bf5...whd] Sub-Devices States Update: {'reqType': 'subdev_online_stat_report', 'data': {'online': ['a4c138c4b14360be', 'a4c138d43c2db604', 'a4c138b97aac2ae0'], 'offline': []}}. a4c1389aeca68fbf disconnected

It seems if a gateway has connected many devices, it reports online/offline statuses of subdevices in a series of messages. I faced the problem exactly when I've added 3 aforementioned new devices (note the second message: exactly 3 cids!). I thought disabling them in LocalTuya would help, but the reason was at the gateway side.

So, checking against 'offline' list is always good.

To check against 'online', it is required to combine several answers into one. But I doubt LocalTuya needs it at all, because if a subdevice was disconnected from the gateway (the only reason to be missed in both lists), LocalTuya would find it out by other means anyway. E.g. by checking against device list from the cloud.

My suggestion is to remove the check against on_devices as the fix for this problem.

But that double attempt to connect at the start better to fix as well!

@Lurker00
Copy link
Author

Lurker00 commented May 15, 2024

One more thought: when you send the subdev_online_stat_report request to the gateway, it may answer by more than one messages! Now LocalTuya expects only one reply. The second reply is read later, and can be interpreted wrong way.

@Lurker00
Copy link
Author

I'd suggest you to change log messages in _make_connection to include both name and host, like this:

self.debug(f"Trying to connect to {name} ({self._device_config.host})...", force=True)
self.debug(f"Started heartbeat for {name} ({self._device_config.host})", force=True)
self.debug(f"Success: connected to {name} ({self._device_config.host})", force=True)

to better understand which WiFi devices produce the messages, and which gateway is used for a subdevice. Also for warnings, errors and exceptions.

@Lurker00
Copy link
Author

FYI, with the fix suggested in #227 (comment), it runs very well. I had a device with weak signal, and its offline state was handled correctly by the sub-devices heartbeat several times. So, I believe it was tested enough.

@xZetsubou
Copy link
Owner

The issue with this approach that if the device has been removed from the gateway it won't shows as "offline" rather it will disappear from both on and off devices.

@Lurker00
Copy link
Author

It is not a big issue! If required, it can be detected from the information from the cloud. The user can delete the device, if (s)he is concerned.

Moreover, I'd not disconnect a sub-device from offline list, especially if it is used as a fake gateway. The reason is because it does not hurt to use an offline device as a fake gateway, but disconnecting it causes disconnecting all other sub-devices, and then it falls into the loop of connecting-disconnecting (I saw it in logs). In brief, the current implementation destroys the functionality for all sub-devices, if the first in the list goes offline.

My opinion is that the online and offilne lists could be used to prevent connections to offline devices until they apper online, but it is not implemented.

The current implementation of subdev_online_stat_report results looks more harmful than useful for me.

@xZetsubou
Copy link
Owner

xZetsubou commented May 17, 2024

I know but and I'm still looking for better solution if possible, I don't have a lot of sub-devices to test query command on,
Does the command always report all sub-devices you have or does it miss some of them sometimes?

@Lurker00
Copy link
Author

Lurker00 commented May 17, 2024

Yes, currently the gateway always replies with two messages showed in #227 (comment) in log records with all the devices attached to the gateway. I have two powered off devices (disabled in LocalTuya) which are listed in the 1st message. The two online lists have all my online devices when they are online. Of course, when I temporarily turn off some of them, or in case of weak signal, both messages still show all the devices, with correct division into online and offline.

I presume that with even more devices, the number of messages would be more.

xZetsubou added a commit that referenced this issue May 18, 2024
* Sub-devices will no longer be waited to connect only the gateway device will.
* A workaround to handle sub-devices messages that comes on parts.
* Revert back connected on check.
* Add static seqno for sub-devices query command.
* cleans up some codes.
@xZetsubou
Copy link
Owner

Can you give 2024.5.0b2 a try

@Lurker00
Copy link
Author

Can you give 2024.5.0b2 a try

I updated to it, but changed some log messages from debug to info, for better understanding. It started, and twice re-started, with no problems. Let's see it on the long run.

I see, you didn't implement here TimeoutError handling in start_sub_devices_heartbeat 😞 I re-implemented it, because, from logs, I found it useful.

@xZetsubou
Copy link
Owner

The will be no false timeout anymore for sub-devices HB like before, unless the device actually didn't responses at all, so you can try it without adding timeout retries, Matter fact your tests give me more information I don't have the amount of sub-devices like you 😺, so I'll be glad if you tested with normal release

@Lurker00
Copy link
Author

I've turned off a dimmer, and in some seconds received corresponding message in the log. So, the new design detects offline devices, and so far works with no false disconnect events.

Currently, LocalTuya logs are either debug, or empty. I found useful to have more information at info level:

self.debug(f"Success: connected to {name} - {host}", force=True)

self.debug(f"Sub-devices heartbeat stopped due to: {ex}")

Inserted after this line

        if self._node_id:
            self.info(f"Disconnected: {self._device_config.name} {self._device_config.host} {self._node_id}")
        else:
            self.info(f"Disconnected: {self._device_config.name} {self._device_config.host}")

(currently the first message is a warning, that's why if)

Also one more debug->warning, wrote a comment to the commit.

@Lurker00
Copy link
Author

Lurker00 commented May 18, 2024

The will be no false timeout anymore

Yesterday I have had big problem with my internet provider for a whole day. All my automations are local, no direct involving of the cloud. But each device with direct connection to the cloud (Wi-Fi devices and gateways) report their states, or state changes for their sub-devices. When they can't connect or send data to the cloud, they become very unresponsive! Lots of timeouts, lots of Zigbee sub-devices went offline because they failed to inform their gateway (which were just busy), very slow reaction (if any) on events.

So, don't presume that a device either must answer in 5 seconds, or it can't answer at all. It can be busy right now with other tasks.

But I see you've re-wrote it even more. Let's see later. For now, - sorry! - right now I prever to have TimeoutError handled in this particular place for the sake of robustness, based on my experience. But I'll check the logs to see if timeouts ever happened.

@xZetsubou
Copy link
Owner

Technically, if there's a timeout error, the loop of subdevices_query will break and will not raise anything! Probably what caused the mass timeout errors for is status command that being called in connect process.

@Lurker00
Copy link
Author

In addition to #227 (comment), I uncommented

because instead I could not understand what was happening from the log.

@xZetsubou xZetsubou added the master/next-release Fixed in master branch, Will be ready in the next release label May 18, 2024
@Lurker00
Copy link
Author

Can you give 2024.5.0b2 a try

In general, it runs very well.

But it happened that I had to restart HA many times, and I noticed that sometimes, right after restart, some entities are in unavailable state. The corresponding devices were online, there were connections to them, but their entities were disabled, and even after some minutes didn't come alive. Another restart always was without this problem. I also saw such a probblem once with 3.2.5.2b7. LocalTuya log level is info (with my corresponding changes to the source code), so that I have nothing to support this report.

@Lurker00
Copy link
Author

FYI I've installed 2024.5.0.b4 a hour ago, patched for more info in the logs, and it started and is running smoothly so far. I tested a sub-device going offline and back online, and it worked as expected.

Copy link

This issue was closed because it was resolved on the release:

@github-actions github-actions bot added stale and removed master/next-release Fixed in master branch, Will be ready in the next release stale labels May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants