Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing desync with blocking calls #3143

Open
FeldrinH opened this issue May 29, 2024 · 15 comments · May be fixed by #3199
Open

Confusing desync with blocking calls #3143

FeldrinH opened this issue May 29, 2024 · 15 comments · May be fixed by #3199

Comments

@FeldrinH
Copy link

FeldrinH commented May 29, 2024

Description

Doing blocking calls in some places (e.g. in event handlers) can sometimes cause silent desync between the browser and Python.

For example, take this code:

import time
from nicegui import ui

counter = 0

def increment():
    global counter
    counter += 1
    time.sleep(5)
    label.text = counter
    log.push(f'Increment {counter}')

ui.button('Increment', on_click=increment)
ui.button('Print', on_click=lambda: log.push('Print'))
label = ui.label('-')
log = ui.log()

ui.run()

Here is an example of me playing around with this code:

NiceGUI.-.Google.Chrome.2024-05-30.02-34-09.mp4

Note in particular two things:

  1. After the first increment completes, the GUI works, but the counter and log are out of sync with the actual state in Python (which is revealed when reloading the page).
  2. There is no indication that anything has gone wrong. The desync occurs completely silently. (Sometimes a 'Connection lost' message pops up briefly, but in this case it didn't. Whether it shows up or not seems to be random.)

Now obviously you're not supposed to use long-running blocking calls in event handlers, because it blocks the async event loop. However, the problem is that if you accidentally introduce a blocking call then it can lead to silent desync which is very confusing and hard to debug (speaking from experience).

In my opinion, there are at least two issues here:

  1. These desyncs can occur with no indication that anything has gone wrong.
  2. Even after the system recovers from the blocking call and resumes working as normal the desync is not corrected.
@falkoschindler
Copy link
Contributor

Hi @FeldrinH,

That's an interesting observation. If I understand correctly, "desync" basically means that the UI is not updated after increment() has been called, right? All log messages - if displayed after reload - are in the right order.

So it boils down to something like this:

def test():
    time.sleep(5)
    button.text += '!'

button = ui.button('Test', on_click=test)

With a delay of 4 seconds the UI is correctly updated, but not with 5 seconds.

I guess what is happening is that the socket connection disconnects while the server is sleeping, but afterwards the server automatically reconnects. Until the reconnect is done, updates get lost.

We'll need to investigate this more thoroughly to see if we can do anything about it.

@falkoschindler
Copy link
Contributor

Here is another simple example to reproduce lost messages:

@ui.page('/', reconnect_timeout=10.0)
def page():
    log = ui.log()
    ui.timer(1.0, lambda: log.push(f'{time.time():.0f}'))

When disabling the network connection for a few seconds, there will be a gap in the logged time stamps.

@afullerx
Copy link
Contributor

afullerx commented Jun 2, 2024

Did you want to look into this one yourself @falkoschindler? If not, I'd be happy to take it on.

Using your example, I already discovered that client.has_socket_connection does not seem to be evaluating to false when the connection is interrupted. This leads to outbox continuing to dispatch updates, which are then lost.

Let me know if I should keep going or not.

@falkoschindler
Copy link
Contributor

@afullerx Feel free to take it on.

Questioning client.has_socket_connection is an interesting approach. But we might not always know the exact state, because there are several intervals and timeouts involved before the connection is considered lost:

nicegui/nicegui/nicegui.py

Lines 122 to 123 in 2e00244

sio.eio.ping_interval = max(app.config.reconnect_timeout * 0.8, 4)
sio.eio.ping_timeout = max(app.config.reconnect_timeout * 0.4, 2)

To ensure every single message is received by the client, I was thinking about adding some UID to each socket message as well as another outbox buffer collecting all sent messages per client (or target_id). Then we could modify the handshake to include the UID of the last successfully transferred message, such that the server can re-send buffered messages.

@afullerx
Copy link
Contributor

afullerx commented Jun 3, 2024

@falkoschindler I think I spoke too soon on this one. While fixing client.has_socket_connection resolves the issue when using your example and manually disabling networking, it doesn't work for the OP's example.

I completely agree that in the absence of perfect knowledge about the connection state, a system of message accounting is the only way to completely ensure delivery.

I think it's still worth submitting a pull request to improve has_socket_connection. Right now, it's a property that just checks if client.tab_id has been set. As far as I can tell, it will never evaluate to false once the initial connection is established. Given that, it doesn't make sense to be checking it on every iteration of outbox.loop, as we currently are.

I was thinking it should be changed to a simple attribute that would be toggled on and off to reflect the connection state as it's known to Socket.IO. I realize this might be irrelevant in this case if message handling is reworked. It could still be a worthwhile improvement to bring its behavior in line with its name and avoid any future confusion. A minimal alternative would be to just change the docstring to reflect the fact that it only tracks the initial connection.

@me21
Copy link
Contributor

me21 commented Jun 4, 2024

Hmmm, isn't the connection between NiceGUI and browser based on the websocket which is based on TCP? How could messages be lost in that case? 🤔
Surely, the connection can be broken, but in that case we need to somehow get the last message that was delivered to the browser. I think that the browser should acknowledge the received messages back. Only after the message is ackowledged it should be removed from the queue. Otherwise after the connection is reestablished, the messages should be retransmitted.

@afullerx
Copy link
Contributor

afullerx commented Jun 4, 2024

@me21 The message retransmission described above by @falkoschindler has the same purpose. Acknowledging every message would probably be excessive overhead since most of the time, the connection is fine.

@afullerx
Copy link
Contributor

afullerx commented Jun 4, 2024

@me21 You have me thinking though. If there's a chance that a message could fail to be delivered, but the connection could recover without a subsequent reconnection. Then acknowledging every message would be necessary. Hopefully, Socket.IO wouldn't allow this.

Edit: Actually, this would still be OK because the client could detect a gap in the message ID's it's receiving and request retransmission of any missed ones or simply perform a reconnection to synchronize.

@FeldrinH
Copy link
Author

FeldrinH commented Jun 4, 2024

For Socket.IO this part of the official documentation might be helpful: https://socket.io/docs/v4/tutorial/handling-disconnections.

@afullerx
Copy link
Contributor

afullerx commented Jun 4, 2024

"Connection state recovery" looks interesting:
https://socket.io/docs/v4/connection-state-recovery
https://socket.io/docs/v4/tutorial/step-6

I'm surprised to see this because on another page of the documentation, it says if you want this kind of functionality, you have to implement it yourself. I'll have to take a close look at this later.

@afullerx
Copy link
Contributor

afullerx commented Jun 5, 2024

Unfortunately, "Connection state recovery" doesn't appear to be implemented for the Python version.

@me21
Copy link
Contributor

me21 commented Jun 5, 2024

@me21 You have me thinking though. If there's a chance that a message could fail to be delivered, but the connection could recover without a subsequent reconnection. Then acknowledging every message would be necessary. Hopefully, Socket.IO wouldn't allow this.

I think it's impossible to lose a message without the need to reconnect. The underlying TCP protocol implements a stream of bytes meaning it guarantees all data to be received in the same order it was sent.

@afullerx
Copy link
Contributor

afullerx commented Jun 5, 2024

@me21 That's true, but since we're relying on the socket.io library to handle the underlying TCP connection, we're at the mercy of its behavior. The documentation does say the sequence of messages is guaranteed though, just not the delivery. So, I think you're right that it's a non-issue.

@afullerx
Copy link
Contributor

afullerx commented Jun 6, 2024

How long should the message history be retained? The longest we can have a non-functional connection before socket.io detects it is pingInterval + pingTimeout. In our case, as @falkoschindler pointed out above:

sio.eio.ping_interval = max(app.config.reconnect_timeout * 0.8, 4)
sio.eio.ping_timeout = max(app.config.reconnect_timeout * 0.4, 2)

ui.page

For the case of ui.page we get:

dead_connection_max = sio.eio.ping_interval + sio.eio.ping_timeout
history_duration = dead_connection_max + page.reconnect_timeout

For the default case of reconnect_timeout = 3.0, this means we only need 9 seconds of message history. Although, with this change, much longer timeouts could make sense.

Auto-index page

In the case of the auto-index page, the client can reconnect after any amount of time. This leaves us two choices:

  1. Keep messages for the app's lifetime.
  2. Keep messages for a reasonable time (TBD), and force a full page reload for disconnections longer than this.

Number 2 makes more sense both from a resource standpoint, and from the user standpoint. You wouldn't want a whole bunch of stale notifications popping up after an extended disconnection, for example. You would just want the current UI state.

@afullerx
Copy link
Contributor

afullerx commented Jun 7, 2024

I've got the message history/retransmission pretty much done. I hugely overestimated how much time it would take. I'm going to blame it on how easy it is to work in this codebase. 😉

@afullerx afullerx linked a pull request Jun 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants