Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock spotted in v0.0.80 #108

Closed
GiedriusS opened this issue Oct 24, 2022 · 17 comments
Closed

Deadlock spotted in v0.0.80 #108

GiedriusS opened this issue Oct 24, 2022 · 17 comments

Comments

@GiedriusS
Copy link

I am trying Rueidis before updating my PR here: thanos-io/thanos#5593

I have noticed that under heavy load, some goroutines get stuck (deadlock) in Rueidis code:

image

I would expect the goroutines count to go down but it stays the same.

Here's the goroutine dump from pprof:
rueidis_dump.txt

My guess is that under heavy load it tries to switch to a replica in Redis Cluster and then fails to timeout some operation somewhere? I don't know how to reproduce this ATM but maybe you'll be able to get some ideas from the dump?

@rueian
Copy link
Collaborator

rueian commented Oct 24, 2022

Hi @GiedriusS, thanks you for providing the dump. I will try to find out the cause ASAP.

@rueian
Copy link
Collaborator

rueian commented Oct 24, 2022

Hi @GiedriusS, it looks like 4 DoCache calls were stuck probably due to somehow deadlock or missing response from redis node. To narrow down the possibilities, did some of your cluster nodes failover before the stuck period?

My guess is that under heavy load it tries to switch to a replica in Redis Cluster and then fails to timeout some operation somewhere?

rueidis should not switch to a replica proactively.

@rueian
Copy link
Collaborator

rueian commented Oct 26, 2022

Hi @GiedriusS, I am still trying hard to find the cause, but haven't found it yet. Could you also provide some information about your setup and how do you generate the heavy load?

@GiedriusS
Copy link
Author

Yeah, this happens when the network usage gets saturated so the connection with Redis servers is probably lost at that point. I'll try to make a reproducible case tomorrow or when time permits. It should probably be enough to make lots of calls to Redis and then introduce network errors to reproduce this 🤔

@rueian
Copy link
Collaborator

rueian commented Oct 30, 2022

Hi @GiedriusS,

Thank you for your goroutine dump. I do find the deadlock from it.

From the dump, there some stack traces that help me find the deadlock, the first one is:

1 @ 0x43cd76 0x406d9b 0x406898 0x1953588 0x46e2a1
#	0x1953587	github.com/rueian/rueidis.(*pipe)._background+0x227	/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/pipe.go:231

It indicated that the _backgroundRead had exited and the pipe was waiting for the _backgroundWrite to exit and then about to cleaning all pending goroutines, but the _backgroundWrite was stuck.
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L218-L231

Since the underlying connection had already been closed after _backgroundRead had exited, the _backgroundWrite could only stuck at p.queue.WaitForWrite() to wait for the next write:
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L291

Normally, it won't stuck here after _backgroundRead had exited because there would be a QUIT injected after _backgroundRead.
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L227

However, from you goroutine dump, the injected QUIT was also stuck:

1 @ 0x43cd76 0x46a36c 0x46a34c 0x47698c 0x19600a5 0x1953a22 0x46e2a1
#	0x46a34b	sync.runtime_notifyListWait+0x12b				/usr/lib/go-1.19/src/runtime/sema.go:517
#	0x47698b	sync.(*Cond).Wait+0x8b						/usr/lib/go-1.19/src/sync/cond.go:70
#	0x19600a4	github.com/rueian/rueidis.(*ring).PutOne+0x84			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/ring.go:58
#	0x1953a21	github.com/rueian/rueidis.(*pipe)._background.func4+0x41	/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/pipe.go:227

This indicated that the ring was full. Releasing the ring was exactly what the pipe was about to do next but it was stuck on waiting for _backgroundWrite to exit and thus deadlocked. This is also the cause that many DoCache in the dump stucked at sync.(*Cond).Wait or rueidis.(*entry).Wait.

PR #113 fixed this deadlock by releasing the ring right after _backgroundRead, so that _backgroundWrite won't stuck at p.queue.WaitForWrite().

There is also a new test case reproduces the deadlock condition. The old code won't pass this test:
https://github.com/rueian/rueidis/blob/b26c3bada71d8eac685e38f10219acc71e59548b/pipe_test.go#L2448-L2473

@rueian
Copy link
Collaborator

rueian commented Oct 30, 2022

Hi @GiedriusS,

The fix has been released in v0.0.82. Please check if the same issue or other issue occurs.

@GiedriusS
Copy link
Author

Thank you for your detailed report and the amazing library! Let me try to update it and give it a try. Sorry for not helping you out much with the code, I'm a bit preoccupied with other projects :(

@GiedriusS
Copy link
Author

Couldn't reproduce it so far with v0.0.82 even though there were occurrences with really high load. I'll close this in a few days if nothing comes up. Thank you again 💪

@rueian
Copy link
Collaborator

rueian commented Nov 4, 2022

Hi @GiedriusS,

Thank you very much for testing rueidis! Please feel free to report your findings. ❤️

@GiedriusS
Copy link
Author

Yeah, it's not reproducible anymore. Thanks for your fix ❤️

@GiedriusS
Copy link
Author

GiedriusS commented Nov 11, 2022

Somehow it still happens :(
image

goroutine dump 0.0.82.txt

48 (the count from the graphic) goroutines are hanging here:

48 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x990be7 0x1a6ae26 0x1462643 0x1a69a05 0x13fbe50 0x128d02c 0x142333a 0x128d02c 0x1460cfb 0x142333a 0x128d02c 0x142333a 0x1ce8f69 0x142333a 0x1cedd5e 0x142333a 0x14231de 0xdfd23b 0xdfea4a 0xdf7fb8 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24											/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51											/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x990be6	golang.org/x/sync/errgroup.(*Group).Wait+0x26									/home/giedrius/go/pkg/mod/golang.org/x/sync@v0.0.0-20220907140024-f12130a52804/errgroup/errgroup.go:53

Hanging here, I think:

119 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x196c0c5 0x196bef2 0x194df67 0x194d9e8 0x1977ce7 0x19aeecb 0x1a76e7a 0x1a6685b 0x1a6be45 0x990d84 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24								/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51								/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x196c0c4	github.com/rueian/rueidis/internal/util.closeThenParallel[...]+0x184			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:44
#	0x196bef1	github.com/rueian/rueidis/internal/util.ParallelVals[...]+0xf1				/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:21

@GiedriusS GiedriusS reopened this Nov 11, 2022
@rueian
Copy link
Collaborator

rueian commented Nov 11, 2022

Hi @GiedriusS,

Thank you for keep trying it. While I am still investigating your latest dump, could you also try v0.0.83? There is a gorutine leak fix in v0.0.83.

@rueian
Copy link
Collaborator

rueian commented Nov 11, 2022

Hey @GiedriusS,

Thank you for your goroutine dump,

Hanging here, I think:

119 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x196c0c5 0x196bef2 0x194df67 0x194d9e8 0x1977ce7 0x19aeecb 0x1a76e7a 0x1a6685b 0x1a6be45 0x990d84 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24								/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51								/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x196c0c4	github.com/rueian/rueidis/internal/util.closeThenParallel[...]+0x184			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:44
#	0x196bef1	github.com/rueian/rueidis/internal/util.ParallelVals[...]+0xf1				/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:21

From the dump, the above goroutines were actually all hanging on rueidis.(*entry).Wait and indicating that there were *entry leaking from the lru cache. I will try to figure out the cause ASAP.

@rueian
Copy link
Collaborator

rueian commented Nov 13, 2022

Hi @GiedriusS,

While I am still investigating how will *entry leak and other possibilities causing *entry.Wait stucked, would you like to try v0.0.84? *entry.Wait will now respect to ctx cancellation so that it can exit if ctx is done.

@GiedriusS
Copy link
Author

Let me try and get back to you 👍

@rueian
Copy link
Collaborator

rueian commented Nov 20, 2022

Hi @GiedriusS,

I have fix the leaking *entry in commit 069b77a.

It turns out that this was a naive bug of common sync.RWMutex scenario but I had missed during the development of the lru cache. It look like this:

func (c *lru) GetEntry(key, cmd string) (entry *entry) {
	c.mu.RLock()
	store, ok := c.store[key]
	if ok {
		entry, ok = store.cache[cmd]
	}
	c.mu.RUnlock()
	if entry != nil {
		return entry
	}
	c.mu.Lock()
	if store == nil { // <-- this check is the bug, the store might already been evicted
		if store, ok = c.store[key]; !ok {
			store = newStore()
			c.store[key] = store
		}
	}
	if entry, ok = store.cache[cmd]; !ok {
		entry = newEntry()
		store.cache[cmd] = entry
	}
	c.mu.Unlock()
	return entry
}

It tries to shortcut lookup with the cheap mu.RLock() first than only do mu.Lock() when the entry does not exists.

You probably have noticed that the if store == nil check under the mu.Lock() is risky. And it is indeed the cause of leaking *entry: It will cause a newEntry() to be stored into a evicted store and then the entry.Wait() will never be released.

The if store == nil check was valid at the early days of the development because I didn't evict the store until the commit ec19a5e.

The fix has been released in v0.0.85. Thank you for being patient and please let me know if you have found further problems.

@GiedriusS
Copy link
Author

Coming back to this - cannot reproduce this anymore. Thx for your work 🍻

appleboy pushed a commit to appleboy/rueidis that referenced this issue Oct 15, 2023
Return custom error implementations as pointers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants