Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in raft append_entries fiber #3528

Closed
ztlpn opened this issue Jan 18, 2022 · 3 comments · Fixed by #3537
Closed

Deadlock in raft append_entries fiber #3528

ztlpn opened this issue Jan 18, 2022 · 3 comments · Fixed by #3537
Assignees
Labels
area/raft kind/bug Something isn't working

Comments

@ztlpn
Copy link
Contributor

ztlpn commented Jan 18, 2022

The following deadlock can happen if there is at least one persisted_stm for a partition (i.e. when transactions or shadow indexing is enabled) when log eviction happens concurrently with creation of a new segment during append:

fiber1:

  • before log prefix truncation log_eviction_stm fiber waits for persisted_stms to ensure that snapshots exist at least at truncation offset
  • before creating snapshots, state machines wait until records up to truncation offset get applied
  • state machine apply fibers in turn wait for raft committed index notifications

fiber2:

  • append_entries fiber appends new entries to the log
  • log roll happens, new segment gets created
  • persisted_stm::make_snapshot() is called
  • it tries to take persisted_stm::_op_lock

Deadlock happens when persisted_stm::_op_lock is taken by persisted_stm::ensure_snapshot_exists that waits for notifications that never arrive.

There are several ways "applied to state machine" or "committed index increased" notifications can be lost:

  • there is no notification in do_hydrate_snapshot if _commit_index increases due to installed snapshot
  • there is no notification if the state machine offset increases due to state_machine::handle_eviction
  • _commit_index notification may get lost in consensus::maybe_update_follower_commit_idx if _flushed_offset hasn't yet reached the new _commit_index (we do call consensus::refresh_commit_index before persisted_stm::ensure_snapshot_exists but it is a no-op for followers).
  • maybe others

The problem is easily reproduced with the following setup:

  • create a 3-node redpanda cluster
  • create a topic with small max segment size and small retention.bytes setting: rpk topic create foo -p 1 -r 3 -c 'retention.bytes=1500000' -c 'segment.bytes=1000000'
  • apply produce load
  • start/stop a node several times

To fix the problem we can fix missed notifications one by one, but a more robust fix would be to remove persisted_stm::make_snapshot from the critical append path. It is strictly an optimization and can be done either in the state_machine::apply fiber or in its own fiber after the state machine has processed enough records.

cc @rystsov

@ztlpn ztlpn added kind/bug Something isn't working area/raft labels Jan 18, 2022
@mmaslankaprv
Copy link
Member

One of the points in lost committed index update notifications doesn't seems right as the notification will be dispatched during subsequent heartbeat i.e.:

_commit_index notification may get lost in consensus::maybe_update_follower_commit_idx if _flushed_offset hasn't yet reached the new _commit_index (we do call consensus::refresh_commit_index before persisted_stm::ensure_snapshot_exists but it is a no-op for followers).

@ztlpn
Copy link
Contributor Author

ztlpn commented Jan 18, 2022

One of the points in lost committed index update notifications doesn't seems right as the notification will be dispatched during subsequent heartbeat

This won't happen if the append_entries fiber is blocked (and thus unable to process heartbeats) :( That's the reason I think it is important to remove persisted_stm::make_snapshot calls from the append_entries fiber so that it can make progress and dispatch the notifications even if they get lost at some point.

@mmaslankaprv mmaslankaprv self-assigned this Jan 18, 2022
@mmaslankaprv
Copy link
Member

I think i was able to fix this, going to write a ducktape test and send PR tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants