New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub masks complex deadlock bugs. #951
Comments
Suggested solution:
The general rule here is that pubsub is blocking and synchronous, while the subscriber must provide non-blocking callback functions. Why not provide or require the subscriber to pass in a 0 capacity channel? That makes pubsub handle blocking channels. Then what? Depending on the subscriber, failure to deliver a message may or may not be fatal. The subscriber should deal with it, not pubsub. What's the benefit again? To reveal deadlock issues sooner, e.g. if in the example above, if Should pubsub even use an internal channel? Why not use mtxs? Nothing wrong with mutexes, but IFF the code is correct, then you may be able to increase performance by increasing the pubsub internal channel capacity. |
👍 I prefer channels to pass state between goroutines, but I get your point. Actually, believe it or not I was going to change pubsub interface to return ch with 0 cap here #893.
not necessary. as with callbacks!, we can make it a contract that every reader must be non-blocking. the code you wrote:
|
Ok, great. :)
You still can, (e.g. mustSendMessage uses egressCh). You're not forced to start a goroutine though, which I think is nice.
I don't think so. You can't always assume that you can immediately send onto a channel, because the client may be slow. So if you do the code snippet above, the pubsub clients would randomly be closed with error. In order to give flexibility to the client (i.e. the may or may not be fatal part), you end up passing this information into the client in order to get guidance on whether to close the connection with error or not. |
client may be slow in case of callbacks too! that does not change anything. But I see the benefit of callbacks where you are free to use whatever you want (channels, mutexes). Anyway, callbacks have some problems too https://github.com/tendermint/internal/issues/36#issuecomment-350960775 |
We no longer use the async pubsub for internal events, just for the RPC. Can we close this ? |
@ebuchman Can we create a follow-up issue to remove synchronous pubsub for internal communication? |
Done #1692 |
Sweet, let's close and re-open if the issue surfaces again. |
I think this is a design issue rather than a particular bug. Let's keep it open for now to mark the discussion |
Even if we decide to stay with channels, we still need to change a few things.
|
Refs #951 Jae: I don't know a good way to catch these errors in general, but forcing pubsub's internal channel to have a capacity of 0 will reveal bugs sooner, if the subscriber also has a 0 or small capacity ch to pull from.
The pubsub channel can fill up and end up blocking the entire program.
This can happen with sufficient activity and a single function somewhere that (in the same goroutine) reads from and also writes to the pubsub.
The example above is problematic because between line 1 and line 2, another goroutine somewhere else may have published a million other events, and by the time we run line 2, we may be blocking. It helps that ch's capacity is larger, but it's still not correct because ch may get full depending on goroutine racing.
I believe other pubsub solves this with persistent storage.
I don't know a good way to catch these errors in general, but forcing pubsub's internal channel to have a capacity of 0 will reveal bugs sooner, if the subscriber also has a 0 or small capacity
ch
to pull from.So currently we have pubsub with an elastic inner capacity (which I've tried to prove is actually bad), which somewhat (but not completely) compensates by having subscriber
ch
be flexible. But I think this bad in that it's prone to being used incorrectly resulting in buggy software.An alternative is to have a synchronous pubsub/event system that only provides the caller with a 0-capacity
ch
to receive from, or a callback function (I prefer callback functions when they suffice, I'll explain more later here: https://github.com/tendermint/internal/issues/36). AND, we create something on the client connection side (i.e. outside the pubsub module) that does the following:The text was updated successfully, but these errors were encountered: