Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub masks complex deadlock bugs. #951

Closed
jaekwon opened this issue Dec 11, 2017 · 11 comments
Closed

pubsub masks complex deadlock bugs. #951

jaekwon opened this issue Dec 11, 2017 · 11 comments
Assignees
Labels
T:bug Type Bug (Confirmed) T:security Type: Security (specify priority)
Milestone

Comments

@jaekwon
Copy link
Contributor

jaekwon commented Dec 11, 2017

The pubsub channel can fill up and end up blocking the entire program.

This can happen with sufficient activity and a single function somewhere that (in the same goroutine) reads from and also writes to the pubsub.

func main() {
  // ...
  pubsub.Subscribe(ctx, "client", query, ch)
  go myRoutine(ch, pubsub)
}

func myRoutine(ch, pubsub) {
  for {
    item := <- ch // line 1
    pubsub.PublishWithTags(...) // line 2
  }
}

The example above is problematic because between line 1 and line 2, another goroutine somewhere else may have published a million other events, and by the time we run line 2, we may be blocking. It helps that ch's capacity is larger, but it's still not correct because ch may get full depending on goroutine racing.

I believe other pubsub solves this with persistent storage.

I don't know a good way to catch these errors in general, but forcing pubsub's internal channel to have a capacity of 0 will reveal bugs sooner, if the subscriber also has a 0 or small capacity ch to pull from.

So currently we have pubsub with an elastic inner capacity (which I've tried to prove is actually bad), which somewhat (but not completely) compensates by having subscriber ch be flexible. But I think this bad in that it's prone to being used incorrectly resulting in buggy software.

An alternative is to have a synchronous pubsub/event system that only provides the caller with a 0-capacity ch to receive from, or a callback function (I prefer callback functions when they suffice, I'll explain more later here: https://github.com/tendermint/internal/issues/36). AND, we create something on the client connection side (i.e. outside the pubsub module) that does the following:

// NOTE: does not block.
func (c client) mustSendMessage(msg interface{}) {
  select {
    case c.egressCh <- msg:
    case <- c.quitCh: c.close()
    default:  c.closeWithError()
  }
}
@jaekwon
Copy link
Contributor Author

jaekwon commented Dec 11, 2017

Suggested solution:

  • Make pubsub capacity 0 for testing, otherwise 0 by default in config.toml.
  • Make subscriber pass pubsub a function. This function must not block. The function is responsible for dealing with buffer issues or whatever.

The general rule here is that pubsub is blocking and synchronous, while the subscriber must provide non-blocking callback functions.

Why not provide or require the subscriber to pass in a 0 capacity channel?

That makes pubsub handle blocking channels. Then what? Depending on the subscriber, failure to deliver a message may or may not be fatal. The subscriber should deal with it, not pubsub.

What's the benefit again?

To reveal deadlock issues sooner, e.g. if in the example above, if ch had a capacity of 0 (or if the API required the subscriber to provide a callback function) it would immediately deadlock.

Should pubsub even use an internal channel? Why not use mtxs?

Nothing wrong with mutexes, but IFF the code is correct, then you may be able to increase performance by increasing the pubsub internal channel capacity.

@melekes
Copy link
Contributor

melekes commented Dec 11, 2017

Make pubsub capacity 0 for testing, otherwise 0 by default in config.toml

👍

I prefer channels to pass state between goroutines, but I get your point. Actually, believe it or not I was going to change pubsub interface to return ch with 0 cap here #893.

That makes pubsub handle blocking channels.

not necessary. as with callbacks!, we can make it a contract that every reader must be non-blocking. the code you wrote:

// NOTE: does not block.
func (c client) mustSendMessage(msg interface{}) {
  select {
    case c.egressCh <- msg:
    case <- c.quitCh: c.close()
    default:  c.closeWithError()
  }
}

@jaekwon
Copy link
Contributor Author

jaekwon commented Dec 15, 2017

Actually, believe it or not I was going to change pubsub interface to return ch with 0 cap here #893.

Ok, great. :)

I prefer channels to pass state between goroutines

You still can, (e.g. mustSendMessage uses egressCh). You're not forced to start a goroutine though, which I think is nice.

we can make it a contract that every reader must be non-blocking

I don't think so. You can't always assume that you can immediately send onto a channel, because the client may be slow. So if you do the code snippet above, the pubsub clients would randomly be closed with error.

In order to give flexibility to the client (i.e. the may or may not be fatal part), you end up passing this information into the client in order to get guidance on whether to close the connection with error or not.

@melekes
Copy link
Contributor

melekes commented Dec 15, 2017

I don't think so. You can't always assume that you can immediately send onto a channel, because the client may be slow. So if you do the code snippet above, the pubsub clients would randomly be closed with error.

client may be slow in case of callbacks too! that does not change anything. But I see the benefit of callbacks where you are free to use whatever you want (channels, mutexes). Anyway, callbacks have some problems too https://github.com/tendermint/internal/issues/36#issuecomment-350960775

@greg-szabo greg-szabo added this to the launch milestone Feb 19, 2018
@ebuchman ebuchman added T:bug Type Bug (Confirmed) T:security Type: Security (specify priority) labels Apr 27, 2018
@ebuchman
Copy link
Contributor

ebuchman commented Jun 5, 2018

We no longer use the async pubsub for internal events, just for the RPC.

Can we close this ?

@xla
Copy link
Contributor

xla commented Jun 5, 2018

@ebuchman Can we create a follow-up issue to remove synchronous pubsub for internal communication?

@ebuchman
Copy link
Contributor

ebuchman commented Jun 5, 2018

Done #1692

@xla
Copy link
Contributor

xla commented Jun 5, 2018

Sweet, let's close and re-open if the issue surfaces again.

@xla xla closed this as completed Jun 5, 2018
@ebuchman
Copy link
Contributor

ebuchman commented Jun 6, 2018

I think this is a design issue rather than a particular bug. Let's keep it open for now to mark the discussion

@ebuchman ebuchman reopened this Jun 6, 2018
@melekes
Copy link
Contributor

melekes commented Jun 6, 2018

Even if we decide to stay with channels, we still need to change a few things.

  • Make pubsub capacity 0 for testing, otherwise 0 by default in config.toml.

melekes added a commit that referenced this issue Jun 19, 2018
Refs #951

Jae: I don't know a good way to catch these errors in general, but
forcing pubsub's internal channel to have a capacity of 0 will reveal
bugs sooner, if the subscriber also has a 0 or small capacity ch to pull
from.
melekes added a commit that referenced this issue Oct 2, 2018
melekes added a commit that referenced this issue Oct 4, 2018
melekes added a commit that referenced this issue Oct 4, 2018
melekes added a commit that referenced this issue Nov 1, 2018
melekes added a commit that referenced this issue Nov 2, 2018
* pubsub adr

Refs #951, #1879, #1880

* highlight question

* fix typos after Ismail's review
@ebuchman ebuchman modified the milestones: v1.0, v0.30.0 Jan 13, 2019
@melekes melekes self-assigned this Jan 29, 2019
@melekes melekes mentioned this issue Feb 23, 2019
4 tasks
@ebuchman
Copy link
Contributor

ebuchman commented Feb 23, 2019

Design addressed in #2532 and implementation in #3227

The pubsub now differentiates between buffered and unbuffered subscriptions and never blocks on the former (if the buffer is full, it kills the subscription).

firelizzard18 pushed a commit to AccumulateNetwork/tendermint that referenced this issue Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug Type Bug (Confirmed) T:security Type: Security (specify priority)
Projects
None yet
Development

No branches or pull requests

5 participants