New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event system can cause deadlock #5372

Closed
calmh opened this Issue Dec 12, 2018 · 1 comment

Comments

Projects
None yet
2 participants
@calmh
Copy link
Member

calmh commented Dec 12, 2018

There's an issue with the event system which can cause what is effectively a deadlock during normal operations. It's not a logical deadlock, it's just that sending events can end up taking forever, which blocks other things. We often send events while holding other locks, which makes it worse.

Specifically, when an event listener no longer polls for events the Log() call will block for 150 ms per non-polling listener. If there are several dead listeners this can add up, and it blocks other Log() calls during this time as well due to needing to hold the lock.

There are a couple of places where we can end up with dead event listeners. The watch aggregator can exit but never unsubscribes. But more seriously, anything that 1) ceases to process events 2) unsubscribes, perhaps as part of a defer, causes this issue. For example, sendIndexes.

The reason for the latter is that once we stop reading events, the Unsubscribe() can take effectively forever. There might be a thousand event senders producing events quite frequently (for example, lots of folders which are scanning on a short interval). These all compete for the mutex which must be held to unsubscribe. As each of these calls can take over a second and there are thousands and thousands of them waiting for the lock, the odds that our Unsubscribe() call actually gets to acquire the lock in reasonable time are slim.

I have a test that reproduces this sad situation.

I see several potential solutions; first of all we must ensure that every subscribe is matched by an unsubscribe. Then;

  1. We can use a priority lock to prioritize Subscribe()/Unsubscribe() over Log(). This works nicely in my testing, because the unsubscribe gets to happen ahead of the queue and then everything else is quick again.

  2. We can use BufferedSubscriptions() everywhere, as these continue reading the actual event stream in the background until the unsubscribe has actually gotten through.

  3. We can make the event system more lossy, putting the burden to keep up with the events more on the receiving side. A normal Subscription already has a buffered channel; we could simply skip the timeout handling in Log() and use a non-blocking send.

I think we should do number 1 at minimum, and possibly number 3. In the cases where we expect many events and have to catch 'em all, that's what the buffered subscription is for.

As a motivator, here's a massive backtrace showing the problem (on 0.14.48).

backtrace.txt

@calmh calmh added the bug label Dec 12, 2018

@calmh calmh self-assigned this Dec 12, 2018

@imsodin

This comment has been minimized.

Copy link
Member

imsodin commented Dec 12, 2018

What about making Logger a "service", i.e. instead of locking send to a channel and have the usual goroutine with a central for loop, which can then easily prioritize subscription operations over logging?

btw: Very nice find!

calmh added a commit to calmh/syncthing that referenced this issue Dec 13, 2018

lib/events: Become a service (fixes syncthing#5372)
Here the event Logger is rewritten as a service with a main loop instead
of mutexes. This loop has a select with essentially two legs: incoming
events, and subscription changes. When both are possible select will
chose one randomly, thus ensuring that in practice unsubscribes will
happen timely and not block the system.

calmh added a commit to calmh/syncthing that referenced this issue Dec 13, 2018

imsodin added a commit that referenced this issue Dec 13, 2018

@calmh calmh closed this in abb3fb8 Dec 13, 2018

@calmh calmh added this to the v0.14.55 milestone Dec 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment