archival: Adjacent segment merging #8238

Lazin · 2023-01-14T20:06:34Z

This PR introduces shard local upload_housekeeping_service which can run jobs in the background. The ntp_archiver instances can create and register housekeeping_job objects. The upload_housekeeping_service then will run these jobs periodically.

The service uses the following scheduling strategy:

if the cloud_storage::remote is idle then the service activates and executes jobs one by one;
if all jobs are executed it goes back into idle state
if one of the ntp_archiver or remote_partition objects starts using the cloud storage api the service pauses execution without interrupting the current job forcibly.
when cloud_storage_housekeeping_interval_ms passes the service goes into draining mode and executes all jobs which are not executed yet.
when all jobs are executed, the service goes into the idle state

This approach guarantees that every registered job will be executed once per cloud_storage_housekeeping_interval_ms.

The remote object was extended with notification mechanism. It can be used to subscribe to all cloud storage api events. There is a mechanism that allows the subscriber to filter out some events. This mechanism is used to filter out events triggered by housekeeping jobs.

There are two housekeeping jobs per archiver. Local adjacent segment reupload and remote adjacent segment reupload. The first one reuploads small segments in case if data is available on disk. The remote job downloads the segments from the cloud to reupload them.

Backports Required

UX Changes

Release Notes

Features

Small adjacent segments in Tiered Storage are now periodically merged together in the background to improve efficiency

andrwng

Just leaving some high level thoughts for now.

src/v/archival/segment_reupload.cc

src/v/archival/ntp_archiver_service.cc

src/v/archival/upload_housekeeping_service.cc

src/v/archival/ntp_archiver_service.cc

src/v/archival/adjacent_segment_merger.cc

src/v/archival/ntp_archiver_service.cc

andrwng · 2023-01-19T07:11:59Z

src/v/archival/upload_housekeeping_service.cc

+    return o;
+}
+
+upload_housekeeping_service::upload_housekeeping_service(


I'm interested in the overarching design of this, having a single loop that goes through available tasks and executing them. Long term I think it can be helpful to sort/prioritize work fairly based on the current needs of the system.

That said, was this out of necessity? Could adjacent segment merging have just been added to the existing archiver housekeeping path? Curious why the architectural shift?

src/v/archival/upload_housekeeping_service.cc

src/v/redpanda/application.cc

src/v/config/configuration.cc

src/v/archival/ntp_archiver_service.cc

src/v/archival/tests/CMakeLists.txt

src/v/archival/types.cc

jcsp · 2023-01-19T11:04:20Z

src/v/archival/segment_reupload.cc

          "offset: {}",
          _manifest.get_ntp(),
          start_offset);
        return {};
    }

    const auto& segment = *it;
-    if (segment->finished_self_compaction()) {
+    auto is_compacted = segment->finished_self_compaction();


Variable names are confusing here: "is_compacted" refers to whether the segment has been compacted yet, but non_compacted refers to the configuration of the topic. Maybe change these to is_segment_compacted and is_topic_compacted or similar.

src/v/archival/segment_reupload.cc

src/v/archival/adjacent_segment_merger.cc

jcsp · 2023-01-19T11:25:09Z

src/v/config/configuration.cc

+      "Smallest acceptable segment size in the cloud storage. Default: "
+      "cloud_storage_segment_size_target/2",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      std::nullopt)
  , cloud_storage_upload_ctrl_update_interval_ms(
      *this,


Please can you add an additional bool configuration for entirely switching off the feature (i.e. not registering any housekeeping jobs at all), in case we need it in the field.

We can mostly disable it by setting the size target to something small, but I would like to be able to switch off even the scheduling + inspection of the manifest.

src/v/archival/segment_reupload.cc

src/v/archival/adjacent_segment_merger.cc

src/v/archival/ntp_archiver_service.cc

tests/rptest/tests/adjacent_segment_merging_test.py

src/v/ssx/sleep_abortable.h

VladLazar · 2023-01-20T14:25:58Z

src/v/archival/upload_housekeeping_service.h


 namespace archival {

+enum class housekeeping_state {
+    // Housekeeping is on pause


nit: maybe say "Housekeeping is idle" to avoid confusion with the pause state.

VladLazar · 2023-01-20T14:46:39Z

src/v/archival/tests/upload_housekeeping_service_test.cc

+
+    ss::future<> run(retry_chain_node& rtc) override {
+        vlog(test_log.info, "mock job executed");
+        executed++;


nit: executed actually means started here, right? We increment executed before the job finished.

src/v/archival/ntp_archiver_service.cc

tests/rptest/tests/adjacent_segment_merging_test.py

VladLazar · 2023-01-20T15:20:06Z

tests/rptest/tests/adjacent_segment_merging_test.py

+        super().setUp()  # topic is created here
+
+    def tearDown(self):
+        self.s3_client.empty_bucket(self.s3_bucket_name)


The name of s3_client changed to cloud_storage_client to support Azure parametrisation.

VladLazar · 2023-01-20T15:22:00Z

tests/rptest/tests/adjacent_segment_merging_test.py

+    def _find_partition_manifests(self):
+        res = []
+        for obj in self.s3_client.list_objects(self.s3_bucket_name):
+            if obj.Key.endswith("manifest.json") and not obj.Key.endswith(
+                    "topic_manifest.json"):
+                res.append(obj.Key)
+        return res
+
+    def _download_partition_manifest(self, manifest_path):
+        """Find and download individual partition manifest"""
+        manifest = self.s3_client.get_object_data(self.s3_bucket_name,
+                                                  manifest_path)
+        self.logger.info(f"manifest found: {manifest}")
+        return json.loads(manifest)


Rename s3_client to cloud_storage_client and use key instead of Key.
See 46d5495 and 011363d.

andrwng · 2023-01-20T21:22:01Z

src/v/archival/adjacent_segment_merger.cc

+                     const cloud_storage::partition_manifest& manifest)
+      -> std::optional<adjacent_segment_run> {
+        adjacent_segment_run run(_archiver.get_ntp());
+        auto [low_watermark, high_watermark] = get_low_high_segment_size(


nit: can we rename these to size_lower_bound and size_upper_bound or somesuch? "watermark" is a bit ambiguous IMO

andrwng · 2023-01-20T21:26:50Z

src/v/archival/types.cc

+        if (meta.size_bytes + s.size_bytes <= max_size) {
+            // Move the end of the small segment run forward
+            meta.committed_offset = s.committed_offset;
+            meta.max_timestamp = s.max_timestamp;


Is it possible for segments to have out of order timestamps? Should we std::max here?

andrwng · 2023-01-20T21:29:31Z

src/v/archival/types.h

+    bool
+    maybe_add_segment(const cloud_storage::segment_meta& s, size_t max_size);


nit: could you add a comment, at least mentioning what this returns? Without reading the rest of the code, I would've assumed it "Returns true if actually added", but it looks like "Returns true if the resulting segment has more than one segment"?

src/v/archival/types.cc

andrwng · 2023-01-20T21:37:07Z

src/v/archival/adjacent_segment_merger.cc

+            if (!run.maybe_add_segment(meta, high_watermark)) {
+                continue;
+            }


Should this be a break?

src/v/archival/adjacent_segment_merger.cc

andrwng · 2023-01-20T22:10:32Z

src/v/archival/upload_housekeeping_service.h

+    cloud_storage::remote& _remote;
+    /// Idle timer, used to detect idle state
+    ss::timer<ss::lowres_clock> _idle_timer;
+    ss::timer<ss::lowres_clock> _epoch_timer;


nit: maybe we can call this _force_housekeep_timer or somesuch? "epoch" is a bit too general and doesn't quite capture the context without already understanding how it works IMO

andrwng

PR looks like it has all the necessary pieces, minus a small bug (I think). Looks like CI is unhappy with it though (segfaults in most DT tests).

andrwng · 2023-01-20T23:33:10Z

src/v/archival/adjacent_segment_merger.cc

+    }
+}
+
+void adjacent_segment_merger::interrupt() { _as.request_abort(); }


Should this abort_source be used for anything else? At the moment it just looks like a bool, and we don't really use it in our async work.

andrwng · 2023-01-20T23:36:28Z

src/v/archival/ntp_archiver_service.cc

+        co_await dynamic_cast<adjacent_segment_merger*>(
+          _local_segment_merger.get())
+          ->stop();


nit: would it make sense to give housekeeping_job an abstract stop() method?

andrwng · 2023-01-20T23:59:54Z

tests/rptest/tests/adjacent_segment_merging_test.py

+        si_settings = SISettings(
+            test_context,
+            cloud_storage_max_connections=10,
+            log_segment_size=0x10000,


nit: probably easier for future readers as 65536

andrwng · 2023-01-21T00:01:04Z

tests/rptest/tests/adjacent_segment_merging_test.py

+    def setUp(self):
+        super().setUp()  # topic is created here
+
+    def tearDown(self):
+        self.cloud_storage_client.empty_bucket(self.bucket_name)
+        super().tearDown()


nit: can omit all of these? They should happen by default already

andrwng · 2023-01-21T00:02:28Z

tests/rptest/tests/adjacent_segment_merging_test.py

+            # Every 'produce' call should create at least one segment
+            # in the cloud which is 1MiB
+            self.kafka_tools.produce(self.topic, 1024, 1024, acks)
+            time.sleep(1)


Would it be non-flaky to add a wait that we have more than one segment in the cloud?

andrwng

CI failure looks like #8293

Seems like this has all the major components in; if we're in a rush to get this merged, we can address feedback in follow-on PRs. Would be good to get a look from others, since the latest commits landed recently

The service is not doing anything yet. This is only a scaffolding for the next commits. It contains the intrusive list of all ntp_archiver instances. The instances are registered by the cloud::partition after being created. The upload_housekeeping_service is created in application.cc as a sharded service and passed to the partition_manager and further to the partition.

The method can be used to compare to retry_chain_node objects. If they both have the same root the method will return true.

The method can be used to subscribe to usage events. Method returns a future that will be available when the remote will be used to upload/download/delete a segment or upload/download a manifest. The caller can then subscribe to the next event.

The idle timeout is used by the cloud storage to decide when the system becomes idle. If no uploads or downloads happend during idle timeout the system is considered being idle and housekeeping jobs could be triggered without waiting for full cloud storage housekeeping interval.

Implement scheduling logic for upload housekeeping service. The service is not tied to ntp-archiver anymore. It schedules housekeeping_job objects. The ntp_archiver is supposed to register its periodic job which will be executed by the service. The service runs all jobs once per epoch. The epoch is one cloud_storage_housekeeping_interval_ms period. If the cloud storage is idle for cloud_storage_idle_timeout_ms the housekeeping may be triggered earlier. But in this case it may also be put on pause if the cloud storage will become active again. The goal is to prevent housekeeping to run in bursts and run it more evently if possible.

This is a wrapper for seastar::sleep_abortable which accepts more than one abourt source. Can be useful when you can't decide which abort_source to pick!

The parameter 'cloud_storage_segment_size_min' represents minimal segment size that redpanda should re-upload to the cloud storage.

The object represents a series of adjacent segments. It contains inclusive begin and end offsets, number of segments and their total size.

segments. This is needed to reupload small segments.

The object can be used to re-upload segments from local data.

The housekeeping job is a local_adjacent_segment_merger which uploads segments from local disk to S3 in case if there are a lot of small segments. The 'ntp_archiver' is extended. The method that returns next upload candidate based on predicate is added. The new 'upload' method can be used to upload this upload cadidate to the cloud storage. Both methods are supposed to be used externally, by the upload_housekeeping_serivce. The 'upload' method is taking hold of the semaphore that protects the upload path. The 'upload', 'upload_tx, and 'upload_segment' methods are now accept the 'retry_chain_node' optinally. By default they're using internal retry chain of the 'ntp_archiver'. But when 'upload_housekeeping_service' is calling 'upload' method it has to pass its own 'retry_chain_node' to supress notification that the upload may trigger. Otherwise the first upload done by the 'upload_housekeeping_service' will send notification back to housekeeping service and will put subsequent uploads on pause. The 'find_upload_candidates' first uses the predicate to find candidates in the manifest. Then it uses 'segment_collector' to create a multi-segment upload candidate with locks. The 'local_adjacent_segment_merger' uses predicate that scans manifest for the next series of small segments that can be merged and re-upload. The merging is not happenes on disk, only in the cloud storage.

The test verifies that adjacent segment merging works for data present on disk.

Previously, upload_candidate was used only to represent data stored locally. With adjacent segment merging we no longer upload only local data. In order to support reupload of remote data the struct was extended with 'remote_sources' field. The field contains list of object ids of segments in the cloud storage. Valid struct can only contain 'sources' or 'remote_sources' but not both of them. The 'adjacent_segments_run' was also updated. Instead of storing individual fields it now stores 'cloud_storage::segment_meta'. This is needed to compute segment_meta for the reuploaded data. The 'find_reupload_candidate' of the 'ntp_archiver' can now find both local and remote upload candidates. The 'upload' method of the 'ntp_archiver' have two separate implementations for local and remote data. The latter one is not implemented yet. The 'local_adjacent_segment_merger' is now 'adjacent_segment_merger' and can be used to upload both local and remote (when 'upload' method will be fully implemented).

Replace three methods with the same implementation: upload_loop_can_continue, sync_manifest_loop_can_continue, and housekeeping_can_continue. The reason behind this is that the implementation is the same and it could be confusing to have three different methods for all cases where we are updating the metadata.

into test_archival_service_rpunit binary.

Explicitly deregister housekeeping jobs before partition stops the ntp_archiver instance.

andrwng

Again nothing truly blocking, as long as we address the follow-ups before the release

andrwng · 2023-01-23T17:30:05Z

src/v/archival/types.h

+/// When the underlying partition is stopped or the
+/// entire shard is stopped the job is 'completed'
+/// by calling 'complete' method.


nit: needs an update? Or the completed() method is in a follow-up PR

andrwng · 2023-01-23T17:32:20Z

src/v/redpanda/application.cc

@@ -1014,6 +1016,13 @@ void application::wire_up_redpanda_services(model::node_id node_id) {
          cloud_configs.local().bucket_name,
          std::ref(cloud_storage_api))
          .get();
+
+        construct_service(
+          _archival_upload_housekeeping, std::ref(cloud_storage_api))


Should pass in the archival scheduling group?

yes, will add this in the followup

Lazin marked this pull request as draft January 14, 2023 20:06

github-actions bot added the area/redpanda label Jan 14, 2023

Lazin force-pushed the feature/background-segment-merging branch 2 times, most recently from fbc2116 to 579bc91 Compare January 18, 2023 23:41

Lazin marked this pull request as ready for review January 18, 2023 23:44

Lazin requested review from jcsp and abhijat January 18, 2023 23:45

andrwng reviewed Jan 19, 2023

View reviewed changes

jcsp reviewed Jan 19, 2023

View reviewed changes

abhijat reviewed Jan 19, 2023

View reviewed changes

src/v/archival/segment_reupload.cc Outdated Show resolved Hide resolved

abhijat reviewed Jan 19, 2023

View reviewed changes

src/v/archival/adjacent_segment_merger.cc Outdated Show resolved Hide resolved

abhijat reviewed Jan 19, 2023

View reviewed changes

src/v/archival/ntp_archiver_service.cc Outdated Show resolved Hide resolved

abhijat reviewed Jan 19, 2023

View reviewed changes

tests/rptest/tests/adjacent_segment_merging_test.py Outdated Show resolved Hide resolved

abhijat reviewed Jan 19, 2023

View reviewed changes

tests/rptest/tests/adjacent_segment_merging_test.py Outdated Show resolved Hide resolved

abhijat reviewed Jan 19, 2023

View reviewed changes

src/v/ssx/sleep_abortable.h Outdated Show resolved Hide resolved

Lazin force-pushed the feature/background-segment-merging branch 2 times, most recently from 63899b6 to 5ba4f52 Compare January 20, 2023 13:07

VladLazar reviewed Jan 20, 2023

View reviewed changes

Lazin force-pushed the feature/background-segment-merging branch from 5ba4f52 to aab6caa Compare January 20, 2023 15:47

andrwng reviewed Jan 20, 2023

View reviewed changes

Lazin force-pushed the feature/background-segment-merging branch from aab6caa to 0ae4021 Compare January 20, 2023 22:50

Lazin requested review from jcsp, abhijat and VladLazar January 20, 2023 22:50

andrwng reviewed Jan 21, 2023

View reviewed changes

Lazin force-pushed the feature/background-segment-merging branch from 0ae4021 to 9196d9e Compare January 21, 2023 00:38

andrwng previously approved these changes Jan 21, 2023

View reviewed changes

Lazin added 3 commits January 22, 2023 03:18

utils: Add 'same_root' method to retry chain node

23788eb

The method can be used to compare to retry_chain_node objects. If they both have the same root the method will return true.

Lazin added 18 commits January 22, 2023 03:18

ssx: Ignore broke_promise exception in spawn_with_gate

1f48f1d

archival: Add housekeeping_job base class

ef7c244

application: Start housekeeping service

937f007

archival: Add housekeeping unit-tests

784fca8

ssx: Add ssx::sleep_abortable

3aa36c6

This is a wrapper for seastar::sleep_abortable which accepts more than one abourt source. Can be useful when you can't decide which abort_source to pick!

conf: Add new configuartion parameter

0b265a0

The parameter 'cloud_storage_segment_size_min' represents minimal segment size that redpanda should re-upload to the cloud storage.

archival: Add adjacent_segment_run type

f29ee68

The object represents a series of adjacent segments. It contains inclusive begin and end offsets, number of segments and their total size.

archival: Make segment_reupload work with normal...

1ef5470

segments. This is needed to reupload small segments.

archival: Add adjacent segment merger class

d4c6956

The object can be used to re-upload segments from local data.

rptest: Add adjacent_segment_merging_test.py

b5e8939

The test verifies that adjacent segment merging works for data present on disk.

archival: Separate some archival test ...

9f716de

into test_archival_service_rpunit binary.

cluster: Deregister housekeeping jobs

670225e

Explicitly deregister housekeeping jobs before partition stops the ntp_archiver instance.

archival: Use scheduling group to run housekeeping

ac1204d

Lazin requested a review from andrwng January 22, 2023 09:52

Lazin dismissed andrwng’s stale review via ac1204d January 22, 2023 14:53

Lazin force-pushed the feature/background-segment-merging branch from 9196d9e to ac1204d Compare January 22, 2023 14:53

andrwng approved these changes Jan 23, 2023

View reviewed changes

piyushredpanda merged commit e4aef52 into redpanda-data:dev Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archival: Adjacent segment merging #8238

archival: Adjacent segment merging #8238

Lazin commented Jan 14, 2023 •

edited by jcsp

andrwng left a comment

andrwng Jan 19, 2023

jcsp Jan 19, 2023

Lazin Jan 20, 2023

jcsp Jan 19, 2023

VladLazar Jan 20, 2023

VladLazar Jan 20, 2023

VladLazar Jan 20, 2023

VladLazar Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng left a comment •

edited

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 20, 2023

andrwng Jan 21, 2023

andrwng Jan 21, 2023

andrwng left a comment •

edited

andrwng left a comment

andrwng Jan 23, 2023

andrwng Jan 23, 2023

Lazin Jan 23, 2023

		bool
		maybe_add_segment(const cloud_storage::segment_meta& s, size_t max_size);

archival: Adjacent segment merging #8238

archival: Adjacent segment merging #8238

Conversation

Lazin commented Jan 14, 2023 • edited by jcsp

Backports Required

UX Changes

Release Notes

Features

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment • edited

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Jan 14, 2023 •

edited by jcsp

andrwng left a comment •

edited

andrwng left a comment •

edited