Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed not applying snapshot for new stms #17112

Merged
merged 2 commits into from
Mar 27, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Mar 15, 2024

State machine manager manages the aggregated Raft snapshot for all the
state machines created based on one Raft instance. The managed snapshot
is a map containing individual snapshot data for each state machine.
When a new STM is created while the managed snapshot was already taken
the state_machine_base::apply_raft_snapshot should still be called
even if the snapshot doesn't exists in the managed snapshot map.
This way an STM will be informed that the log doesn't start from 0.

Previously when snapshot was not present in the managed snapshot map we
skipped calling apply_snapshot on the STM and didn't advance it's
_next offset which lead to stuck background apply fiber loop.
Fixes: #17086

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • fixed enabling cloud storage in existing clusters

State machine manager manages the aggregated Raft snapshot for all the
state machines created based on one Raft instance. The managed snapshot
is a map containing individual snapshot data for each state machine.
When a new STM is created while the managed snapshot was already taken
the `state_machine_base::apply_raft_snapshot` should still be called
even if the snapshot doesn't exists in the managed snapshot map.
This way an STM will be informed that the log doesn't start from 0.

Previously when snapshot was not present in the managed snapshot map we
skipped calling `apply_snapshot` on the STM and didn't advance it's
`_next` offset which lead to stuck background apply fiber loop.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
tests/rptest/tests/tiered_stoage_enable_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/tiered_stoage_enable_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/tiered_stoage_enable_test.py Outdated Show resolved Hide resolved
src/v/raft/state_machine_manager.cc Show resolved Hide resolved
Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1


if (stm_entry->stm->last_applied_offset() < last_offset) {
if (it != snapshot.snapshot_map.end()) {
co_await stm_entry->stm->apply_raft_snapshot(it->second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a use-after-free waiting to happen. apply_raft_snapshot takes an iobuf&, but within apply_raft_snapshot if it co_await's then it may cause the snapshot map to change and the iterator (and the associated iobuf) invalid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the snapshot map is no changing here. Once deserialized it is set i stone.

@piyushredpanda piyushredpanda added this to the v23.3.10 milestone Mar 26, 2024
@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv
Copy link
Member Author

ci failure: #17247

@mmaslankaprv mmaslankaprv merged commit fe83597 into redpanda-data:dev Mar 27, 2024
14 of 18 checks passed
@mmaslankaprv mmaslankaprv deleted the fix-stm-manager branch March 27, 2024 07:21
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Reader cannot read before start of the log" logging endlessly on enabling TS
5 participants