-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed background apply fiber race condition in raft::state_machine_manager
#16850
Fixed background apply fiber race condition in raft::state_machine_manager
#16850
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45620#018e0a1b-7973-4376-88d9-999ed8bd5a42 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45628#018e0ab3-5550-45bf-8afe-ae8988f35806 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45683#018e0e08-d13b-4646-900f-a29b52690f21 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45683#018e0e08-d138-4f1e-92b8-f357023a4e6c ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45683#018e0e04-010d-4da6-9c4f-79be628cff35 |
new failures in https://buildkite.com/redpanda/redpanda/builds/45620#018e0a1b-796c-4865-ba57-5fc00b08a5e8:
new failures in https://buildkite.com/redpanda/redpanda/builds/45628#018e0ab3-5547-482a-9e1a-35279f46e5b1:
|
`raft::state_machine_manager` uses background apply fiber to individually apply batches to state machines which are behind the main apply fiber. When background apply fiber is active it reads and apply batches up to current committed offset. When background apply fiber is active it acquires the mutex. When mutex is acquired the main apply fiber do not consider the stm as up to date. The code was prone to very rare race condition as the background apply was finished in one continuation but the units were release in subsequent `finally` block. Normally this approach is harmless as the semaphore is waited for and it will be signaled after the `finally` fiber is executed. In `state_machine_manager` we only check if the semaphore is available, this makes the solution vulnerable to timing. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added state machine manager test waiting for batches to be applied after each replicate. This test is designed to detect a situation in which background apply fiber finishes but managed stm is still behind the others. Signed-off-by: Michal Maslanka <michal@redpanda.com>
2aceef5
to
a55d01a
Compare
/ci-repeat 1 |
/backport v23.3.x |
raft::state_machine_manager
uses background apply fiber toindividually apply batches to state machines which are behind the main
apply fiber. When background apply fiber is active it reads and apply
batches up to current committed offset. When background apply fiber is
active it acquires the mutex. When mutex is acquired the main apply
fiber do not consider the stm as up to date.
The code was prone to very rare race condition as the background apply
was finished in one continuation but the units were release in
subsequent
finally
block. Normally this approach is harmless as thesemaphore is waited for and it will be signalled after the
finally
fiber is executed. In
state_machine_manager
we only check if thesemaphore is available, this makes the solution vulnerable to timing.
Backports Required
Release Notes