Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desktop and mobile apps loses connection to server #3449

Closed
8 of 21 tasks
vitoreiji opened this issue Sep 1, 2021 · 8 comments · Fixed by #3942
Closed
8 of 21 tasks

Desktop and mobile apps loses connection to server #3449

vitoreiji opened this issue Sep 1, 2021 · 8 comments · Fixed by #3942
Labels
android issues that concern the android client but not all clients bug broken functionality, usability problems, unexpected errors desktop Desktop client related issues ios issues that concern the ios client but not all clients state:done meets our definition of done state:tested We tested it and are about to release it
Milestone

Comments

@vitoreiji
Copy link
Contributor

vitoreiji commented Sep 1, 2021

We are getting reports from macOS desktop app users that the app stops getting updates and they are unable to move/delete mails.

  • The issue persists until the app is restarted.
  • Notifications are received.
  • Screenshots show that the "reconnecting" bar is not displayed.
  • The webapp on the same machine is able to get the updates.

App versions: at least 3.85.10 - 3.87.1

Possibly related to #2337

Test notes

  • Suspend your desktop device. Wait at least 15 seconds. Wake device up. Websocket should disconnect immediately.

Websocket Spec for regression

  • Before logging in Websocket is not connected
  • Once logged in app connects Websocket to the server
  • While app is establishing the connecting the indefinite progress bar is shown on top of the app
  • Once connection is established app shows progress bar for synchronization. Progress bar approximately indicates each step of sync (there's a step for refreshing the events of each group plus few more steps)
  • During the initial synchronization app downloads the latest entityEventBatch for each group to determine the point for the later sync
  • Once the initial sync is done the progress bar is not shown anymore and app can process realtime updates
  • In response to modifications on this or another client app receives websocket updates. With those updates app updates the cache, indexer and the rest of the app
  • For each group we remember the latest received event for downloading missing events when we reconnect
  • If we lose connection to the server one way or another we go into disconnected state. The indefinite progress bar is shown and no updates are received
  • While browser thinks that it is offline app does not try to reconnect
  • Once browser is online again app tries to establish websocket connection again
  • Once websocket connection is re-established we download all missing entityEventBatches for each group. We download all updates since the last update we received
  • Updates are sorted by the time they occurred and processed in that order
  • After connecting app already receives updates but they are not processed until all the missing events are processed
  • Updates are optimized e.g. update and consequent delete will not cause the app to download the instance because it is already known that the instance cannot be updated
  • During reconnect app shows the same sync progress bar as during the initial sync
  • If the app is online and is disconnected it will pick a random delay time to reconnect. Small interval on first attempt, medium interval on second and large interval on third attempt or if the server closed connection normally
  • iOS/Android: if app is coming from background it will try to establish connecting almost immediately if it's not there
  • Android: after 30s in background the app will disconnect websocket
  • Websocket is connected during the signup to receive PayPal updates
@vitoreiji vitoreiji added bug broken functionality, usability problems, unexpected errors desktop Desktop client related issues labels Sep 1, 2021
@vitoreiji
Copy link
Contributor Author

We got some log files and the last line on the worker log shows a "failed request", PUT, .../rest/tutanota/mail/..., 404.

@charlag
Copy link
Contributor

charlag commented Sep 10, 2021

Turns out it is related to the sleep mode. Steps to reproduce:

  1. Login on desktop application
  2. Turn your computer into sleep mode
  3. Receive a new email.
  4. Read it on your mobile phone and delete it.
  5. When you turn on your computer you will still receive the notification for that email but it is not visible as it has been deleted already.

@ghost ghost added android issues that concern the android client but not all clients ios issues that concern the ios client but not all clients labels Oct 21, 2021
@ghost
Copy link

ghost commented Oct 21, 2021

This issue is also occurring on Windows devices and mobile apps.

@ghost ghost changed the title [macOS] Desktop app loses connection to server Desktop and mobile apps loses connection to server Oct 21, 2021
@bedhub bedhub added this to the Next release milestone Oct 29, 2021
@bedhub bedhub removed this from the Next release milestone Jan 31, 2022
@charlag charlag self-assigned this Feb 24, 2022
@charlag
Copy link
Contributor

charlag commented Feb 24, 2022

I struggle to reproduce it on mac but I was able to reproduce in on Linux with sleep mode.

I would suggest detecting the sleep mode by timer and forcible reconnecting websocket. By spec websocket must send pings and pongs but somehow it takes a long time for it to recover.

charlag added a commit that referenced this issue Mar 3, 2022
It seems like the issue is caused by how Chromium handles suspension
(or rather doesn't). After coming back from suspension of the device it
doesn't detect that the connection might have been broken until ping
interval expires (which is minutes for us). In this time websocket is in
the peculiar state where it thinks that it is connected, but it missed
some events that were sent by the server but not acknowledged by the
client. Once ping interval occurs the server would re-send all the
events missed by the client which is what we would want ideally, but it
seems to be too slow to be reliable. In this time users don't see the
new state and the results of their actions are not displayed. This is
a suboptimal experience.

To address this we introduced SleepDetector. It uses simple idea with
scheduling a recurring task and measuring the time between invocations.
Our time period is hopefully loose enough that slight throttling by
browsers won't affect it. Once we realize that more time than needed
has passed we force reconnect.

Alternatively, we could try to send an event over the socket to force
re-synchronization, but it might not produce the result we want,
restarting the connection seems to be the most reliable method.
@charlag charlag added this to the after 3.93.5 milestone Mar 7, 2022
@charlag charlag added the state:done meets our definition of done label Mar 9, 2022
charlag added a commit that referenced this issue Mar 9, 2022
It seems like the issue is caused by how Chromium handles suspension
(or rather doesn't). After coming back from suspension of the device it
doesn't detect that the connection might have been broken until ping
interval expires (which is minutes for us). In this time websocket is in
the peculiar state where it thinks that it is connected, but it missed
some events that were sent by the server but not acknowledged by the
client. Once ping interval occurs the server would re-send all the
events missed by the client which is what we would want ideally, but it
seems to be too slow to be reliable. In this time users don't see the
new state and the results of their actions are not displayed. This is
a suboptimal experience.

To address this we introduced SleepDetector. It uses simple idea with
scheduling a recurring task and measuring the time between invocations.
Our time period is hopefully loose enough that slight throttling by
browsers won't affect it. Once we realize that more time than needed
has passed we force reconnect.

Alternatively, we could try to send an event over the socket to force
re-synchronization, but it might not produce the result we want,
restarting the connection seems to be the most reliable method.
@charlag charlag removed their assignment Mar 9, 2022
@ganthern ganthern self-assigned this Mar 10, 2022
@ganthern
Copy link
Contributor

ganthern commented Mar 10, 2022

Reconnect works, but it seems a little noisy with ERR_NOT_RESOLVED errors before it manages to get back on track. Just in case that's problematic:

log
worker init time (ms): 152
Logger.ts:89 _initIndexer after log in
Logger.ts:89 ws connect reconnect: false state: automatic
Logger.ts:89 ws displayed state:  0
Logger.ts:89 ws open state: automatic
Logger.ts:89 ws displayed state:  1
Logger.ts:89 New leader status set: true
Logger.ts:89 disabling outline
Logger.ts:89 ws sleep detected, reconnecting...
Logger.ts:89 ws tryReconnect closeIfOpen: true enableAutomaticState: true delay: null
Logger.ts:89 ws reconnect socket.readyState: (CONNECTING=0, OPEN=1, CLOSING=2, CLOSED=3): 1 state: automatic closeIfOpen: true enableAutomaticState: true
Logger.ts:89 2022-03-10T09:01:39.474Z offline - pause event bus
Logger.ts:89 ws close closeOption:  pause state: automatic
Logger.ts:89 ws displayed state:  0
Logger.ts:89 2022-03-10T09:01:39.697Z online - try reconnect
Logger.ts:89 ws tryReconnect closeIfOpen: true enableAutomaticState: true delay: 2000
Logger.ts:89 ws reconnect socket.readyState: (CONNECTING=0, OPEN=1, CLOSING=2, CLOSED=3): 2 state: suspended closeIfOpen: true enableAutomaticState: true
Logger.ts:89 ws connect reconnect: true state: automatic
Logger.ts:89 ws displayed state:  0
WorkerLocator.ts:155 WebSocket connection to 'wss://test.tutanota.com/event?modelVersions=72.51&clientVersion=3.93.6&userId=MfCQs9T----0&accessToken=XrNXrNXrNXrNXrNXrNXrNXrNXrNXrNXrNQ' failed: Error in connection establishment: net::ERR_NAME_NOT_RESOLVED
(anonymous) @ WorkerLocator.ts:155
connect @ EventBusClient.ts:199
(anonymous) @ EventBusClient.ts:597
Logger.ts:89 ws error: Event {isTrusted: true, type: 'error', target: WebSocket, currentTarget: WebSocket, eventPhase: 2, …} {"isTrusted":true} state: automatic
Logger.ts:89 ws close event: CloseEvent {isTrusted: true, wasClean: false, code: 1006, reason: '', type: 'close', …} state: automatic
Logger.ts:89 New leader status set: false
Logger.ts:89 ws displayed state:  0
Logger.ts:89 ws tryReconnect closeIfOpen: false enableAutomaticState: false delay: null
Logger.ts:89 ws reconnect socket.readyState: (CONNECTING=0, OPEN=1, CLOSING=2, CLOSED=3): 3 state: automatic closeIfOpen: false enableAutomaticState: false
Logger.ts:89 ws connect reconnect: true state: automatic
Logger.ts:89 ws displayed state:  0
WorkerLocator.ts:155 WebSocket connection to 'wss://test.tutanota.com/event?modelVersions=72.51&clientVersion=3.93.6&userId=MfCQs9T----0&accessToken=XrNXrNXrNXrNXrNXrNXrNXrNXrNXrNXrNQ' failed: Error in connection establishment: net::ERR_NAME_NOT_RESOLVED
(anonymous) @ WorkerLocator.ts:155
connect @ EventBusClient.ts:199
(anonymous) @ EventBusClient.ts:597
Logger.ts:89 ws error: Event {isTrusted: true, type: 'error', target: WebSocket, currentTarget: WebSocket, eventPhase: 2, …} {"isTrusted":true} state: automatic
Logger.ts:89 ws close event: CloseEvent {isTrusted: true, wasClean: false, code: 1006, reason: '', type: 'close', …} state: automatic
Logger.ts:89 New leader status set: false
Logger.ts:89 ws displayed state:  0
Logger.ts:89 ws tryReconnect closeIfOpen: false enableAutomaticState: false delay: 40000
Logger.ts:89 ws reconnect socket.readyState: (CONNECTING=0, OPEN=1, CLOSING=2, CLOSED=3): 3 state: automatic closeIfOpen: false enableAutomaticState: false
Logger.ts:89 ws connect reconnect: true state: automatic
Logger.ts:89 ws displayed state:  0
Logger.ts:89 ws open state: automatic
Logger.ts:89 ws displayed state:  1
Logger.ts:89 New leader status set: true

This does not occur every time, suspend during signup was fine

@charlag
Copy link
Contributor

charlag commented Mar 10, 2022

I think the ERR_NOT_RESOLVED is just part of our RestClient. It's odd that it failed reconnect at first, probably because you have Wifi, it probably is disconnecting during sleep, best tested with cable. I also tested it, seems fine

@ganthern
Copy link
Contributor

on windows, it doesn't show the ws sleep detected message for sleep or hibernation, but otherwise gets updates fine.

@ganthern ganthern added the state:tested We tested it and are about to release it label Mar 10, 2022
@charlag
Copy link
Contributor

charlag commented Mar 10, 2022

Tested on iOS, no change in behavior, immediate reconnect

@ganthern ganthern removed their assignment Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android issues that concern the android client but not all clients bug broken functionality, usability problems, unexpected errors desktop Desktop client related issues ios issues that concern the ios client but not all clients state:done meets our definition of done state:tested We tested it and are about to release it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants