Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stm/dist_kv_stm: fix locking sequence for remove_all #17402

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Mar 26, 2024

Currently remove_all takes a write lock for the entire duration that
prevents replicate_and_wait() to make progress because it needs a read
lock. Fix this by making remove_all best effort, so it can just grab
a read lock.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes lock starvation during transform offset commits.

rockwotj
rockwotj previously approved these changes Mar 26, 2024
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks for fixing this

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/46822#018e7c71-1ba7-4479-823c-5dd8807187f5:

"rptest.tests.retention_policy_test.ShadowIndexingCloudRetentionTest.test_cloud_time_based_retention.cloud_storage_type=CloudStorageType.ABS"

@bharathv
Copy link
Contributor Author

Failures unrelated:
#11269
#16561

Comment on lines 320 to 323
units.return_all();
auto read_units = _snapshot_lock.attempt_read_lock();
vassert(
read_units,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was under the impression that this wouldn't work because the underlying semaphore attempts to prevent starvation by forcing new acquirers to queue up behind other waiters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya you're right I think (nice catch), a sequence of events where this could fail.

u = write_lock (current), queue = [write_lock, ...] all the units deposited by current are grabbed by the first waiter.

let me fix this.

Currently remove_all takes a write lock for the entire duration that
prevents replicate_and_wait() to make progress because it needs a read
lock. Fix this by making remove_all best effort, so it can just grab
a read lock.
@rockwotj
Copy link
Contributor

Don't forget to update the cover letter

@bharathv bharathv requested a review from dotnwat March 27, 2024 00:23
@vbotbuildovich
Copy link
Collaborator

@piyushredpanda piyushredpanda merged commit 1873856 into redpanda-data:dev Mar 27, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17402-v23.3.x-426 remotes/upstream/v23.3.x
git cherry-pick -x 2c7b25cd04ae1605e006a807f756cb17d507e524

Workflow run logs.

@rockwotj
Copy link
Contributor

This code doesn't exist in 23.3.x so we don't need a backport IIRC

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants