Fixed not applying snapshot for new stms #17112

mmaslankaprv · 2024-03-15T10:45:34Z

State machine manager manages the aggregated Raft snapshot for all the
state machines created based on one Raft instance. The managed snapshot
is a map containing individual snapshot data for each state machine.
When a new STM is created while the managed snapshot was already taken
the state_machine_base::apply_raft_snapshot should still be called
even if the snapshot doesn't exists in the managed snapshot map.
This way an STM will be informed that the log doesn't start from 0.

Previously when snapshot was not present in the managed snapshot map we
skipped calling apply_snapshot on the STM and didn't advance it's
_next offset which lead to stuck background apply fiber loop.
Fixes: #17086

Backports Required

Release Notes

Bug Fixes

fixed enabling cloud storage in existing clusters

State machine manager manages the aggregated Raft snapshot for all the state machines created based on one Raft instance. The managed snapshot is a map containing individual snapshot data for each state machine. When a new STM is created while the managed snapshot was already taken the `state_machine_base::apply_raft_snapshot` should still be called even if the snapshot doesn't exists in the managed snapshot map. This way an STM will be informed that the log doesn't start from 0. Previously when snapshot was not present in the managed snapshot map we skipped calling `apply_snapshot` on the STM and didn't advance it's `_next` offset which lead to stuck background apply fiber loop. Signed-off-by: Michal Maslanka <michal@redpanda.com>

tests/rptest/tests/tiered_stoage_enable_test.py

src/v/raft/state_machine_manager.cc

Signed-off-by: Michal Maslanka <michal@redpanda.com>

vbotbuildovich · 2024-03-25T09:04:49Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46697#018e749b-d2dc-4e93-b812-57f7e9255718

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46697#018e749b-d2d6-42cd-b0be-326d63d097f1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46713#018e76be-ac00-4a30-86c2-a8b6a3209b18

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46805#018e7bea-4ae4-4267-a8c4-2d9570bf07a0

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46805#018e7bea-4adb-48eb-836e-35a47429c9a2

mmaslankaprv · 2024-03-25T16:55:25Z

/ci-repeat 1

dotnwat · 2024-03-25T21:16:40Z

src/v/raft/state_machine_manager.cc

+
+    if (stm_entry->stm->last_applied_offset() < last_offset) {
+        if (it != snapshot.snapshot_map.end()) {
+            co_await stm_entry->stm->apply_raft_snapshot(it->second);


This looks like a use-after-free waiting to happen. apply_raft_snapshot takes an iobuf&, but within apply_raft_snapshot if it co_await's then it may cause the snapshot map to change and the iterator (and the associated iobuf) invalid.

the snapshot map is no changing here. Once deserialized it is set i stone.

mmaslankaprv · 2024-03-26T16:21:12Z

/dt

mmaslankaprv · 2024-03-27T07:21:13Z

ci failure: #17247

vbotbuildovich · 2024-03-27T07:21:31Z

/backport v23.3.x

github-actions bot added the area/redpanda label Mar 15, 2024

mmaslankaprv requested review from bharathv and ztlpn March 15, 2024 12:40

ztlpn reviewed Mar 22, 2024

View reviewed changes

tests: add a test validating enabling tiered storage in running cluster

b700b95

Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv force-pushed the fix-stm-manager branch from 079a841 to b700b95 Compare March 25, 2024 07:03

mmaslankaprv requested a review from ztlpn March 25, 2024 07:03

ztlpn approved these changes Mar 25, 2024

View reviewed changes

dotnwat reviewed Mar 25, 2024

View reviewed changes

mmaslankaprv requested a review from dotnwat March 26, 2024 07:01

piyushredpanda added this to the v23.3.10 milestone Mar 26, 2024

mmaslankaprv merged commit fe83597 into redpanda-data:dev Mar 27, 2024
14 of 18 checks passed

mmaslankaprv deleted the fix-stm-manager branch March 27, 2024 07:21

This was referenced Mar 27, 2024

[v23.3.x] "Reader cannot read before start of the log" logging endlessly on enabling TS #17419

Closed

[v23.3.x] Fixed not applying snapshot for new stms #17420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed not applying snapshot for new stms #17112

Fixed not applying snapshot for new stms #17112

mmaslankaprv commented Mar 15, 2024 •

edited

vbotbuildovich commented Mar 25, 2024 •

edited

mmaslankaprv commented Mar 25, 2024

dotnwat Mar 25, 2024

mmaslankaprv Mar 26, 2024

mmaslankaprv commented Mar 26, 2024

mmaslankaprv commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

Fixed not applying snapshot for new stms #17112

Fixed not applying snapshot for new stms #17112

Conversation

mmaslankaprv commented Mar 15, 2024 • edited

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Mar 25, 2024 • edited

mmaslankaprv commented Mar 25, 2024

dotnwat Mar 25, 2024

Choose a reason for hiding this comment

mmaslankaprv Mar 26, 2024

Choose a reason for hiding this comment

mmaslankaprv commented Mar 26, 2024

mmaslankaprv commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

mmaslankaprv commented Mar 15, 2024 •

edited

vbotbuildovich commented Mar 25, 2024 •

edited