stm/dist_kv_stm: fix locking sequence for remove_all #17402

bharathv · 2024-03-26T19:08:45Z

Currently remove_all takes a write lock for the entire duration that
prevents replicate_and_wait() to make progress because it needs a read
lock. Fix this by making remove_all best effort, so it can just grab
a read lock.

Backports Required

Release Notes

Bug Fixes

Fixes lock starvation during transform offset commits.

rockwotj

Nice catch, thanks for fixing this

vbotbuildovich · 2024-03-26T21:33:35Z

new failures in https://buildkite.com/redpanda/redpanda/builds/46822#018e7c71-1ba7-4479-823c-5dd8807187f5:

"rptest.tests.retention_policy_test.ShadowIndexingCloudRetentionTest.test_cloud_time_based_retention.cloud_storage_type=CloudStorageType.ABS"

bharathv · 2024-03-26T22:53:00Z

Failures unrelated:
#11269
#16561

dotnwat · 2024-03-26T22:56:53Z

src/v/cluster/distributed_kv_stm.h

+        units.return_all();
+        auto read_units = _snapshot_lock.attempt_read_lock();
+        vassert(
+          read_units,


i was under the impression that this wouldn't work because the underlying semaphore attempts to prevent starvation by forcing new acquirers to queue up behind other waiters?

ya you're right I think (nice catch), a sequence of events where this could fail.

u = write_lock (current), queue = [write_lock, ...] all the units deposited by current are grabbed by the first waiter.

let me fix this.

Currently remove_all takes a write lock for the entire duration that prevents replicate_and_wait() to make progress because it needs a read lock. Fix this by making remove_all best effort, so it can just grab a read lock.

rockwotj · 2024-03-26T23:40:26Z

Don't forget to update the cover letter

vbotbuildovich · 2024-03-27T01:39:15Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46837#018e7d47-6be0-47d0-8e4e-7628af06898a

vbotbuildovich · 2024-03-27T02:45:15Z

/backport v23.3.x

vbotbuildovich · 2024-03-27T02:46:05Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17402-v23.3.x-426 remotes/upstream/v23.3.x
git cherry-pick -x 2c7b25cd04ae1605e006a807f756cb17d507e524

Workflow run logs.

rockwotj · 2024-03-27T02:59:42Z

This code doesn't exist in 23.3.x so we don't need a backport IIRC

dotnwat

lgtm

bharathv requested a review from rockwotj March 26, 2024 19:08

github-actions bot added the area/redpanda label Mar 26, 2024

rockwotj previously approved these changes Mar 26, 2024

View reviewed changes

dotnwat reviewed Mar 26, 2024

View reviewed changes

stm/dist_kv_stm: fix locking sequence for remove_all

2c7b25c

Currently remove_all takes a write lock for the entire duration that prevents replicate_and_wait() to make progress because it needs a read lock. Fix this by making remove_all best effort, so it can just grab a read lock.

bharathv dismissed rockwotj’s stale review via 2c7b25c March 26, 2024 23:24

bharathv force-pushed the stm_fix_lock branch from 9999b57 to 2c7b25c Compare March 26, 2024 23:24

rockwotj approved these changes Mar 26, 2024

View reviewed changes

bharathv requested a review from dotnwat March 27, 2024 00:23

piyushredpanda merged commit 1873856 into redpanda-data:dev Mar 27, 2024
18 checks passed

vbotbuildovich mentioned this pull request Mar 27, 2024

[v23.3.x] stm/dist_kv_stm: fix locking sequence for remove_all #17417

Closed

dotnwat reviewed Mar 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stm/dist_kv_stm: fix locking sequence for remove_all #17402

stm/dist_kv_stm: fix locking sequence for remove_all #17402

bharathv commented Mar 26, 2024 •

edited

rockwotj left a comment

vbotbuildovich commented Mar 26, 2024

bharathv commented Mar 26, 2024

dotnwat Mar 26, 2024

bharathv Mar 26, 2024

rockwotj commented Mar 26, 2024

vbotbuildovich commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

rockwotj commented Mar 27, 2024

dotnwat left a comment

stm/dist_kv_stm: fix locking sequence for remove_all #17402

stm/dist_kv_stm: fix locking sequence for remove_all #17402

Conversation

bharathv commented Mar 26, 2024 • edited

Backports Required

Release Notes

Bug Fixes

rockwotj left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Mar 26, 2024

bharathv commented Mar 26, 2024

dotnwat Mar 26, 2024

Choose a reason for hiding this comment

bharathv Mar 26, 2024

Choose a reason for hiding this comment

rockwotj commented Mar 26, 2024

vbotbuildovich commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

vbotbuildovich commented Mar 27, 2024

rockwotj commented Mar 27, 2024

dotnwat left a comment

Choose a reason for hiding this comment

bharathv commented Mar 26, 2024 •

edited