Skip to content

[v25.3.x] cloud_io/cache: fix shard 0 misuse in put ENOSPC handler#30352

Merged
nvartolomei merged 2 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30335-v25.3.x-1777547943
May 2, 2026
Merged

[v25.3.x] cloud_io/cache: fix shard 0 misuse in put ENOSPC handler#30352
nvartolomei merged 2 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30335-v25.3.x-1777547943

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #30335

  • Command: git cherry-pick -x 47364d0 96d229f
  • Commits backported: 2
  • Conflicts resolved: 1
  • Commits skipped (already on target): 0
  • Backport branch: ai-backport-pr-30335-v25.3.x-1777547943

Conflict details

  • 96d229f (src/v/cloud_io/cache_service.cc): trim_throttled_unlocked signature on v25.3.x lacks the std::optional<ss::lowres_clock::time_point> deadline parameter that exists on dev. Kept the v25.3.x signature unchanged and applied the incoming vassert(ss::this_shard_id() == 0, "Method can only be invoked on shard 0") at the top of the function.

trim() asserts shard 0 only, so the local trim_throttled() call from
put's ENOSPC handler aborted when put ran on a non-zero shard. And
set_block_puts only flipped the calling shard's flag, leaving other
shards racing into a full disk.

Route trim_throttled to shard 0 via invoke_on(0, ...) and broadcast
the block-puts flag with invoke_on_all, mirroring notify_disk_status.

(cherry picked from commit 47364d0)
Add vasserts to trim_throttled, trim_throttled_unlocked,
maybe_background_trim, and sync_access_time_tracker so that any future
caller from a non-zero shard fails loudly rather than silently
operating on shard-local state that's only meaningful on shard 0.

Also break() the cleanup, access-tracker-writer, and tracker-sync
semaphores on non-zero shards from the constructor: those semaphores
exist solely to serialise shard 0 work, and any get_units call on
another shard is a bug.

(cherry picked from commit 96d229f)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Apr 30, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Apr 30, 2026
@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#83878
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/83878#019dde30-3713-4f47-8b68-e5f73973967a 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard

@nvartolomei nvartolomei merged commit c225b7a into redpanda-data:v25.3.x May 2, 2026
19 checks passed
@tyson-redpanda tyson-redpanda modified the milestones: v25.3.x-next, v25.3.15 May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants