Skip to content

[v26.1.x] ct/l1: integrate metastore with cloud cache#30201

Merged
andrwng merged 11 commits intoredpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30045-v26.1.x-29
Apr 16, 2026
Merged

[v26.1.x] ct/l1: integrate metastore with cloud cache#30201
andrwng merged 11 commits intoredpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30045-v26.1.x-29

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #30045

andrwng added 11 commits April 16, 2026 20:56
Plumb an optional lowres_clock deadline through the reserve_space
path so blocking waits respect it instead of waiting indefinitely.

Upcoming work to integrate the metastore with the cloud cache will use
this to bound reservation waits during staging file SST writes.

(cherry picked from commit 56ab6c3)
Lets a single guard accumulate reservations from multiple chunks.

Upcoming work will add a staging_file handle that reserves cache space
incrementally as data is appended, consolidating each chunk into
one guard so commits have exactly one reservation to finalize.

(cherry picked from commit 9ad51e9)
Open files with O_EXCL instead of truncate in disk_persistence and
cloud_persistence. Reject duplicate file handles in
memory_persistence. The LSM engine guarantees unique file handles,
so duplicates indicate a bug and should fail loudly.

This is useful because I intend on adding another persistence type with
this behavior, and it will be easier for it and the other persistence
impls to have this guarantee, rather than having the overwrite behavior
(especially since the LSM should guarantee unique IDs).

(cherry picked from commit 630d2b8)
.part files represent in-flight writes whose space is already
accounted for via reservations. Deleting them during trim_exhaustive
would cause subsequent reads to fail, only to retry the download and
recreate the file, so it doesn't really help much.

Upcoming work will add a staging_file handle that creates new .part
files for writes, making this race more likely.

The original behavior was added (redpanda-data#11860) more as a pre-caution against
hypothetical runtime bugs, rather than a functional fix of behavior.
As is, .part files are cleaned up at cache startup anyway.

(cherry picked from commit 41a3853)
Adds a handle bundling everything needed to write a file into the cache.
Callers append data (cache space is reserved automatically in chunks),
then commit or close depending on whether the appends were successful.

This will be used to write SST staging files in an upcoming LSM
persistence implementation that uses the cloud cache. The idea will be
that SSTs will append to these staging files (rather than in the L1
staging directory) and then after uploading, commit them into the cloud
cache so subsequent reads can use the local file.

(cherry picked from commit 6f3a535)
Factor reusable code out of cloud_persistence.cc into
cloud_data_persistence_base so the cache-backed implementation can reuse
it. At a high level, this is code that interacts with the cloud
directly, with the idea that we'll use this in a persistence
implementation that is similar to the cloud persistence but uses cloud
cache instead of the staging directory for local files. No behavior
change.

(cherry picked from commit bb20d8f)
Cache-backed data_persistence. Reads check the cache first and
download on miss. Writes stage through the cache and upload to cloud,
so subsequent reads hit locally.

(cherry picked from commit c117cba)
replicated_db::open() takes cloud_io::cache* instead of a staging
directory path. Thread the cache pointer through domain_supervisor
and db_domain_manager in place of the staging directory.

(cherry picked from commit 7dda78c)
The LSM previously staged SST files in the l1_staging directory via
cloud_data_persistence. Now that it uses the cloud cache, those files
are orphaned. Widen the startup cleanup to delete all regular files
in the staging directory, not just .tmp files.

(cherry picked from commit 98dc328)
Switch the read replica database_refresher to cache-backed LSM
persistence. The staging directory is still passed through for
l1::file_io which needs it for L1 object staging.

(cherry picked from commit 5d92c77)
The cloud_persistence code is now superceded with
cloud_cache_persistence. Updates the remaining callers (just tests) to
use the cloud_cache_persistence instead of cloud_persistence, and moves
the helpers and shared code into cloud_cache_persistence.

(cherry picked from commit f4781ff)
@vbotbuildovich vbotbuildovich added this to the v26.1.x-next milestone Apr 16, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Apr 16, 2026
@vbotbuildovich vbotbuildovich requested a review from andrwng April 16, 2026 20:57
@andrwng andrwng enabled auto-merge April 16, 2026 22:45
@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#83275
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) DataMigrationsMultiClusterTest test_with_consumer_groups null integration https://buildkite.com/redpanda/redpanda/builds/83275#019d9831-423f-4e32-91b1-073a285826b8 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsMultiClusterTest&test_method=test_with_consumer_groups
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/83275#019d9830-61d1-4bef-ba31-863500574a5f 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0688, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1926, p1=0.1177, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@andrwng andrwng merged commit 8aed8ce into redpanda-data:v26.1.x Apr 16, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants