You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following deadlock can happen if there is at least one persisted_stm for a partition (i.e. when transactions or shadow indexing is enabled) when log eviction happens concurrently with creation of a new segment during append:
fiber1:
before log prefix truncation log_eviction_stm fiber waits for persisted_stms to ensure that snapshots exist at least at truncation offset
before creating snapshots, state machines wait until records up to truncation offset get applied
state machine apply fibers in turn wait for raft committed index notifications
fiber2:
append_entries fiber appends new entries to the log
log roll happens, new segment gets created
persisted_stm::make_snapshot() is called
it tries to take persisted_stm::_op_lock
Deadlock happens when persisted_stm::_op_lock is taken by persisted_stm::ensure_snapshot_exists that waits for notifications that never arrive.
There are several ways "applied to state machine" or "committed index increased" notifications can be lost:
there is no notification in do_hydrate_snapshot if _commit_index increases due to installed snapshot
there is no notification if the state machine offset increases due to state_machine::handle_eviction
_commit_index notification may get lost in consensus::maybe_update_follower_commit_idx if _flushed_offset hasn't yet reached the new _commit_index (we do call consensus::refresh_commit_index before persisted_stm::ensure_snapshot_exists but it is a no-op for followers).
maybe others
The problem is easily reproduced with the following setup:
create a 3-node redpanda cluster
create a topic with small max segment size and small retention.bytes setting: rpk topic create foo -p 1 -r 3 -c 'retention.bytes=1500000' -c 'segment.bytes=1000000'
apply produce load
start/stop a node several times
To fix the problem we can fix missed notifications one by one, but a more robust fix would be to remove persisted_stm::make_snapshot from the critical append path. It is strictly an optimization and can be done either in the state_machine::apply fiber or in its own fiber after the state machine has processed enough records.
One of the points in lost committed index update notifications doesn't seems right as the notification will be dispatched during subsequent heartbeat i.e.:
_commit_index notification may get lost in consensus::maybe_update_follower_commit_idx if _flushed_offset hasn't yet reached the new _commit_index (we do call consensus::refresh_commit_index before persisted_stm::ensure_snapshot_exists but it is a no-op for followers).
One of the points in lost committed index update notifications doesn't seems right as the notification will be dispatched during subsequent heartbeat
This won't happen if the append_entries fiber is blocked (and thus unable to process heartbeats) :( That's the reason I think it is important to remove persisted_stm::make_snapshot calls from the append_entries fiber so that it can make progress and dispatch the notifications even if they get lost at some point.
The following deadlock can happen if there is at least one
persisted_stm
for a partition (i.e. when transactions or shadow indexing is enabled) when log eviction happens concurrently with creation of a new segment during append:fiber1:
log_eviction_stm
fiber waits forpersisted_stm
s to ensure that snapshots exist at least at truncation offsetfiber2:
append_entries
fiber appends new entries to the logpersisted_stm::make_snapshot()
is calledpersisted_stm::_op_lock
Deadlock happens when
persisted_stm::_op_lock
is taken bypersisted_stm::ensure_snapshot_exists
that waits for notifications that never arrive.There are several ways "applied to state machine" or "committed index increased" notifications can be lost:
do_hydrate_snapshot
if_commit_index
increases due to installed snapshotstate_machine::handle_eviction
_commit_index
notification may get lost inconsensus::maybe_update_follower_commit_idx
if_flushed_offset
hasn't yet reached the new_commit_index
(we do callconsensus::refresh_commit_index
beforepersisted_stm::ensure_snapshot_exists
but it is a no-op for followers).The problem is easily reproduced with the following setup:
rpk topic create foo -p 1 -r 3 -c 'retention.bytes=1500000' -c 'segment.bytes=1000000'
To fix the problem we can fix missed notifications one by one, but a more robust fix would be to remove
persisted_stm::make_snapshot
from the critical append path. It is strictly an optimization and can be done either in thestate_machine::apply
fiber or in its own fiber after the state machine has processed enough records.cc @rystsov
The text was updated successfully, but these errors were encountered: