Skip to content

[v24.2.x] rm_stm: fix a race during partition shutdown#24938

Merged
lf-rep merged 2 commits intoredpanda-data:v24.2.xfrom
vbotbuildovich:backport-pr-24936-v24.2.x-497
Jan 27, 2025
Merged

[v24.2.x] rm_stm: fix a race during partition shutdown#24938
lf-rep merged 2 commits intoredpanda-data:v24.2.xfrom
vbotbuildovich:backport-pr-24936-v24.2.x-497

Conversation

@vbotbuildovich
Copy link
Collaborator

Backport of PR #24936

Currently apply fiber can continue to run (and possibly add new
producers to _producers map) as the state machine is shutting down.
This can manifest in weird crashes as the clean up destroys the
_producers without deregistering properly.

First manifestation

Iterator invalidation in reset_producers() as it loops thru _producers
with scheduling points while state machine apply adds new producers

future<> rm_stm::stop() {
.....
    co_await _gate.close();
    co_await reset_producers();  <---- interferes with state machine apply
    _metrics.clear();
    co_await raft::persisted_stm<>::stop();
.....

Second manifestation

Crashes: every producer creation registers with an intrusive list in
producer_state_manager using a safe link. Now, if a new producer is
registered after reset_producers, the map is destroyed in the state
machine destructor without unlinking from the producer_state_manager
and the safe_link fires an assert.

This bug has been there forever from what I can tell, perhaps got
worsened with recent changes that added more scheduling points in the
surrounding code.

(cherry picked from commit fb57ccd)
(cherry picked from commit 873b282)
@vbotbuildovich vbotbuildovich added this to the v24.2.x-next milestone Jan 27, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jan 27, 2025
@vbotbuildovich
Copy link
Collaborator Author

Retry command for Build#61213

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition

@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#61213
test_id test_kind job_url test_status passed
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition ducktape https://buildkite.com/redpanda/redpanda/builds/61213#0194a7bc-fc70-4b10-ae96-fc5aabf0e61c FAIL 0/1

@lf-rep lf-rep merged commit e99985d into redpanda-data:v24.2.x Jan 27, 2025
4 checks passed
@BenPope BenPope modified the milestones: v24.2.x-next, v24.2.17 Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants