Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Android looses connection and won't reconnect #40

Open
albaintor opened this issue Apr 7, 2024 · 14 comments
Open

[bug] Android looses connection and won't reconnect #40

albaintor opened this issue Apr 7, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@albaintor
Copy link
Contributor

Description

Following the last updates, it addressed the commands not handled when the transport stream is down.
However I still have disconnections issues which are not handled automatically : I mean that the driver won't try to reconnect by itself
When this happens, this message is raised :
2024/04/07 18:49:44,stdout,ERROR:tv:[SHIELD Salon] Cannot send command: Disconnected from device

In the code there should be a reconnection task when this occurs : an exception ucapi.StatusCodes.SERVICE_UNAVAILABLE is raised instead.

    async def wrapper(self: _AndroidTvT, *args: _P.args, **kwargs: _P.kwargs) -> ucapi.StatusCodes:
        try:
            # use the same exceptions as the func is throwing (e.g. AndroidTVRemote.send_key_command)
            state = self.state
            if state != DeviceState.CONNECTED:
                if state in (DeviceState.DISCONNECTED, DeviceState.CONNECTING) or self.is_on is None:
                    raise ConnectionClosed("Disconnected from device")
                if state in (DeviceState.AUTH_ERROR, DeviceState.PAIRING_ERROR):
                    raise InvalidAuth("Invalid authentication, device requires to be paired again")
                raise CannotConnect(f"Device connection not active (state={state})")

            # workaround for "swallowed commands" since _atv.send_key_command doesn't provide a result
            # pylint: disable=W0212
            if (
                not (self._atv and self._atv._remote_message_protocol and self._atv._remote_message_protocol.transport)
                or self._atv._remote_message_protocol.transport.is_closing()
            ):
                _LOG.warning(
                    "[%s] Cannot send command, remote protocol is no longer active. Resetting connection.",
                    self.log_id,
                )
                self.disconnect()
                self._loop.create_task(self.connect())
                return ucapi.StatusCodes.SERVICE_UNAVAILABLE

            return await func(self, *args, **kwargs)
        except (CannotConnect, ConnectionClosed) as ex:
            _LOG.error("[%s] Cannot send command: %s", self.log_id, ex)
            return ucapi.StatusCodes.SERVICE_UNAVAILABLE
        except InvalidAuth as ex:
            _LOG.error("[%s] Cannot send command: %s", self.log_id, ex)
            return ucapi.StatusCodes.CONFLICT
        except ValueError as ex:
            _LOG.error("[%s] Cannot send command, invalid key_code: %s", self.log_id, ex)
            return ucapi.StatusCodes.BAD_REQUEST

How to Reproduce

  1. "Wait" for a "Cannot send command: Disconnected from device" raised from the tv.py
  2. The following commands are not processed and a red error is displayed on the remote

Expected behavior

Automatic reconnection

Integration version

No response

Additional context

No response

@albaintor albaintor added the bug Something isn't working label Apr 7, 2024
@albaintor
Copy link
Contributor Author

I'll try to figure out a fix : inside your wrapper I will add a reconnect task. It seems that the right way to reconnect is to call init() ?
I also added asyncio.locks.Lock : the "connecting" flag is not enough to prevent 2 simultaneous calls to the same method (I don't know why as python is safe thread but I saw that on another driver that I develop)
We'll see...

@zehnm
Copy link
Contributor

zehnm commented Apr 8, 2024

Please wait with changing the logic. There's already an automatic reconnect task in the androidtvremote2 library, plus the additional connection reset workaround in the wrapper. Also the connection is re-established after the Remote Two wakes up from standby.
We first have to find out the real cause. Just adding a reconnect in the wrapper will interfere with the reconnection in androidtvremote2! (https://github.com/tronikos/androidtvremote2/blob/v0.0.14/src/androidtvremote2/androidtv_remote.py#L240-L279).

  • Reconnection can take up to 30 seconds when the Android TV device is available.
    • Backoff factor and reconnection delay are hardcoded in androidtvremote2.
  • Returning SERVICE_UNAVAILABLE while not connected is correct.
    • We cannot wait until the device is connected. This can take a long time and would lead to further timeout errors.
    • The device state (connected / disconnected) is propagated to the UI. Unfortunately it still allows to send commands in the disconnected state. This has to be fixed.

I also require a bit more information:

  • The full log when you encounter this issue.
    The debug log of androidtvremote2 is now included as well, which will give much more information, including the reconnection.
  • What exact disconnect issues are you encountering?
    • No reconnection at all for several minutes?
    • (Slow) reconnect within 1-60 seconds?
    • When does it happen? Randomly while the Remote Two is active, or after Remote Two wakes up from standby?
  • What Android TV device do you have?
  • Are you running the beta firmware or are you running an external integration instance?

@albaintor
Copy link
Contributor Author

albaintor commented Apr 8, 2024

Hi,
Ok I remember you added a reconnection logic but didn't find where exactly.
Here is the logs of the driver side (I configured it as an external driver with the last git main branch :
uc-android.csv

I think I found the culprit : from the logs I can see a bug in the ucapi library (line number 755):
RuntimeError: Set changed size during iteration for websocket in self._clients File \"/usr/local/lib/python3.11/site-packages/ucapi/api.py\", line 267, in _broadcast_ws_event
ucapi/api.py :

    async def _broadcast_ws_event(
        self, msg: str, msg_data: dict[str, Any], category: uc.EventCategory
    ) -> None:
      ...
        for websocket in self._clients:
...

Should be replaced with : for websocket in self._clients.copy() because the list can change (a client may disconnect and reconnect while iterating)
This causes the websocket of the driver to crash.

After that I had to relaunch the driver as you can see in the logs because I had the red banner error message at each press

EDIT : here is the corresponding logs on the remote side, although it won't be helpful I think. There is -1 hour in those logs compared to the driver timestamps. So driver error at 18:50 -> remote error at 17:46
UCR2_logs_2024-04-08.zip
core ERROR [uc_androidtv2_driver] Error connecting to driver ws://192.168.1.10:9090/ws: ServiceUnavailable("Failed to connect to host: Connection refused (os error 111)")
and
2024-04-07 18:02:15.415424 +00:00 core ERROR [uc_androidtv2_driver] Websocket server heartbeat failed, disconnecting!
And then the driver restart made the connection work again with the remote

@albaintor
Copy link
Contributor Author

I have patched the method from the driver, we'll see if this is the only rootcause :

async def patched_broadcast_ws_event(
        self, msg: str, msg_data: dict[str, Any], category: uc.EventCategory
) -> None:
   ...
    for websocket in self._clients.copy():
        if _LOG.isEnabledFor(logging.DEBUG):
            _LOG.debug("[%s] ->: %s", websocket.remote_address, data_log)
        try:
            await websocket.send(data_dump)
        except websockets.exceptions.WebSocketException:
            pass

async def main():
    ...
    IntegrationAPI._broadcast_ws_event = patched_broadcast_ws_event
    await api.init("driver.json", setup_flow.driver_setup_handler)
    # Patched method

@albaintor
Copy link
Contributor Author

albaintor commented Apr 10, 2024

Hi,

I have made another test this time with the driver from the remote and I still have the problem.
Here are the logs and an extract :
UCR2_logs_2024-04-10.txt

2024-04-10 05:44:13.656904 +00:00 core NOTICE Removed standby inhibitor: Active activity 'Regarder Shield' 2024-04-10 05:44:13.656340 +00:00 core ERROR [Regarder Shield] error executing activity command uc_androidtv_driver.main.00044BE87F74:media_player.on (ignore error: false): Service unavailable 2024-04-10 05:44:13.654292 +00:00 intg-androidtv INFO DEBUG:ucapi.api:[('127.0.0.1', 48564)] ->: {"kind": "resp", "req_id": 12, "code": 503, "msg": "result", "msg_data": {}} 2024-04-10 05:44:13.654292 +00:00 intg-androidtv INFO ERROR:tv:[SHIELD Salon] Cannot send command: Disconnected from device 2024-04-10 05:44:13.654292 +00:00 intg-androidtv INFO INFO:driver:[SHIELD Salon] command: on 2024-04-10 05:44:13.650661 +00:00 intg-androidtv INFO DEBUG:ucapi.api:[('127.0.0.1', 48564)] <-: {"kind":"req","id":12,"msg":"entity_command","msg_data":{"cmd_id":"on","entity_id":"00044BE87F74","entity_type":"media_player","params":{}}}

The driver is in a dead state since a while and it seems to come from this event :
2024-04-09 19:15:18.076925 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Connection lost. Error: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2706)

or previous ones :
2024-04-09 19:15:07.908713 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Couldn't connect to 192.168.1.20:6466. Error: Closed idle connection 2024-04-09 19:15:07.905920 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Closing idle connection 2024-04-09 19:15:07.741069 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Connected to ('192.168.1.20', 6466)

During that time I think that the remote had general connection issues (multiple error messages : no connection to the docks, appletv...). Except that the other drivers recovered their connections, although I didn't see any disconnection event from wifi

TO RESOLVE THE PROBLEM : I had to reboot the remote and it worked again.
I think something more has to be done around the reconnection logic : there is no reconnection attempts after "Error: Closed idle connection"

@albaintor
Copy link
Contributor Author

albaintor commented Apr 10, 2024

I checked after the logic, not sure about my analysis : the problem seems to occur when the remote itself has connections issues (wifi lost at line 4467 at 19:06:45 and then connection seems unstable) and takes some time to get connection back. This is another problem why there is this instability.

Error raised from remote.py:208
2024-04-09 19:15:07.908713 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Couldn't connect to 192.168.1.20:6466. Error: Closed idle connection

    async def _async_idle_disconnect(self) -> None:
        await asyncio.sleep(16)
        LOGGER.debug("Closing idle connection")
        if self.transport and not self.transport.is_closing():
            self.transport.close()
        if not self.on_con_lost.done():
            self.on_con_lost.set_result(Exception("Closed idle connection"))

Which is caught by async_connect in androidtv_remote.py:228:

async def async_connect(self) -> None:
...
             if on_con_lost.done():
            con_lost_exc = on_con_lost.result()
            LOGGER.debug(
                "Couldn't connect to %s:%s. Error: %s",
                self.host,
                self._api_port,
                con_lost_exc,
            )
            if isinstance(con_lost_exc, ssl.SSLError):
                raise InvalidAuth("Need to pair again") from con_lost_exc
            raise ConnectionClosed("Connection closed") from con_lost_exc

Something is wrong or that I don't understand : the on_con_lost callback is local in the async_connect method within androidtv_remote.py : so it is only caught during connection, not after. Why isn't it a global callback that triggers a reconnect call ?

This would explain why the following exception is not handled :
2024-04-09 19:15:18.076925 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Connection lost. Error: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify

@albaintor
Copy link
Contributor Author

I think I nailed it: the raise ConnectionClosed("Connection closed") from async_connect is not caught by _async_reconnect task. So the task will crash
It should be :

    async def _async_reconnect(
        self, invalid_auth_callback: Callable | None = None
    ) -> None:
        while self._remote_message_protocol:
            exc = await self._remote_message_protocol.on_con_lost
            self._on_is_available_updated(False)
            LOGGER.debug("Disconnected from %s. Error: %s", self.host, exc)
            delay_seconds = 0.1
            LOGGER.debug(
                "Trying to reconnect to %s in %s seconds", self.host, delay_seconds
            )
            while self._remote_message_protocol:
                await asyncio.sleep(delay_seconds)
                try:
                    await self.async_connect()
                    self._on_is_available_updated(True)
                    break
                except (CannotConnect, ConnectionClosed) as exc:
                    delay_seconds = min(2 * delay_seconds, 30)
                    LOGGER.debug(
                        "Couldn't reconnect to %s. Will retry in %s seconds. Error: %s",
                        self.host,
                        delay_seconds,
                        exc,
                    )
                except InvalidAuth as exc:
                    LOGGER.debug(
                        "Couldn't reconnect to %s. Won't retry. Error: %s",
                        self.host,
                        exc,
                    )
                    if invalid_auth_callback:
                        invalid_auth_callback()
                    return

@albaintor
Copy link
Contributor Author

albaintor commented Apr 10, 2024

I made a local copy of the androidtvremote2 library and made the change, will see if it resolves the issue
EDIT : the same problem again, just right after a wifi disconnection. This is really strange that I have all this disconnections (the remote sits next to a Deco wifi 6E mesh, in a non crowdy environment)...

Anyway the same consequence : "androidtvremote2:Connection lost. Error: None" line 732
UCR2_logs_2024-04-10 (1).zip

This is not exactly the same chain of errors but maybe the same cause : the "Trying to reconnect " doesn't seem to be executed and then no more reconnection attempts.. so I guess that the reconnection task crashed
I cannot test the fix on my side because my docker instance never loses network connection, unlike the remote

@zehnm
Copy link
Contributor

zehnm commented Apr 10, 2024

Thanks for the logs and futher analysis. I'll look into it this week.

Just saw the following while quickly glancing over the logs: androidtvremote2:Couldn't connect to 192.168.1.20:6466. Error: SSL handshake is taking longer than 60.0 seconds: aborting the connection.
The reconnection task might still be active, just hanging for a long time instead of trying again in shorter intervals.
It probably makes sense to try setting different connection timeouts in androidtvremote2 and adding more logs to pinpoint the different issues. A 1 minute timeout seems way too long. This also depends on the used communication libraries. The default timeouts are not always ideal (or not even set). I have little experience how Python handles this, but worth looking into.

Should be replaced with : for websocket in self._clients.copy() because the list can change (a client may disconnect and reconnect while iterating)

I'll fix this in the integration wrapper. This can only happen when running it as an external integration. On the Remote itself it's a single connection with the core service only and never disconnects (unless there's something seriously wrong).

@albaintor
Copy link
Contributor Author

Hi Markus, indeed the second issue should occur only on external integrations

About the main problem, there may be some tuning around timeout, but this is not the only cause as in the previous logs the reconnection will never occur. I may be wrong but I think that adding the ConnectionClosed in the try/catch in _async_reconnect would resolve the problem.

@albaintor
Copy link
Contributor Author

Hi again, I had the same issue again new logs here :
UCR2_logs_2024-04-12.zip

And the same exception :
2024-04-11 17:53:08.234992 +00:00 intg-androidtv INFO DEBUG:androidtvremote2:Connection lost. Error: None
This follows a reconnection to network. The other drivers are fine, only android is KO

@albaintor
Copy link
Contributor Author

albaintor commented Apr 19, 2024

Hi,

following mails exchange with Marton, this problem is confirmed to be a side effect of disconnections due to wifi dropouts (due to AP handoff not handled correctly) : my setup is a wifi mesh (tplink) which is incompatible with the remote two. I had to setup a third party wifi router (not mesh) and now I don't have any disconnections and the problem didn't occur again.
But we agree that this issue has to be fixed because wifi drops can occur

@albaintor
Copy link
Contributor Author

Hi @zehnm , I submitted a PR to androidtvremote2 project : tronikos/androidtvremote2#21

@albaintor
Copy link
Contributor Author

Hi again, the PR has been merged and a new release is available : can you update the androidtv integration with this new release : https://github.com/tronikos/androidtvremote2/releases/tag/v0.0.15

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants