fix(matching): leaked RLock in describe() blocks all writers indefinitely#10077
Conversation
When buildIds contains an empty string and defaultQueue() returns nil, describe() returned errDefaultQueueNotInit without first releasing the RLock acquired at the start of the function, leaving the mutex permanently locked and blocking any concurrent writers.
|
Friendly ping @temporalio/oss-matching @temporalio/oss-foundations this fixes a real deadlock |
|
It's not a real deadlock, all code paths that get there wait until the default queue is initialized. The check is there just to turn a panic into an error in case there's a logic bug. We should have the RUnlock for consistency though, but I don't think we need the test. If you remove the test from the PR I can merge it. |
…rtitionManager (pr comments)
@dnr Done! thanks for the feedback! |
When buildIds contains an empty string and defaultQueue() returns nil, describe() returned errDefaultQueueNotInit without first releasing the RLock acquired at the start of the function, leaving the mutex permanently locked and blocking any concurrent writers. I've detected these bug using my own linter for concurrency https://github.com/sanbricio/goconcurrencylint
What changed?
Added missing versionedQueuesLock.RUnlock() in describe() before the early return when defaultQueue() returns nil.
Why?
The function acquires RLock at the top but does not use defer instead releases the lock manually at each exit point. One early return path was missing the RUnlock: when buildIds contains "" and the default queue is not yet initialized, the function returned errDefaultQueueNotInit without releasing the lock, leaving the mutex permanently locked and blocking any concurrent writer.
How did you test it?
Potential risks
No impact only solving a deadlock bug