-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stm/dist_kv_stm: fix locking sequence for remove_all #17402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thanks for fixing this
new failures in https://buildkite.com/redpanda/redpanda/builds/46822#018e7c71-1ba7-4479-823c-5dd8807187f5:
|
src/v/cluster/distributed_kv_stm.h
Outdated
units.return_all(); | ||
auto read_units = _snapshot_lock.attempt_read_lock(); | ||
vassert( | ||
read_units, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was under the impression that this wouldn't work because the underlying semaphore attempts to prevent starvation by forcing new acquirers to queue up behind other waiters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya you're right I think (nice catch), a sequence of events where this could fail.
u = write_lock (current), queue = [write_lock, ...] all the units deposited by current are grabbed by the first waiter.
let me fix this.
Currently remove_all takes a write lock for the entire duration that prevents replicate_and_wait() to make progress because it needs a read lock. Fix this by making remove_all best effort, so it can just grab a read lock.
Don't forget to update the cover letter |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46837#018e7d47-6be0-47d0-8e4e-7628af06898a |
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
This code doesn't exist in 23.3.x so we don't need a backport IIRC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Currently remove_all takes a write lock for the entire duration that
prevents replicate_and_wait() to make progress because it needs a read
lock. Fix this by making remove_all best effort, so it can just grab
a read lock.
Backports Required
Release Notes
Bug Fixes