archival: Add offset_range_size method to the storage::log #13927

Lazin · 2023-10-04T16:13:58Z

This PR adds a offset_range_size methods to the storage::log. The implementation in storage::disk_log_impl uses segment indexes to find approximate location of the start of the range and approximate location of the last batch. Then it performs short scan to find the precise location.

There are two overloads of the method.

First one accepts two offsets and returns the size of the range.
Second one accepts start offset and target size. It finds the last offset of the range and its size.

The method is supposed to be used to create S3 uploads. Currently, the code reads from disk directly and therefore is coupled with the storage format and a lot of internal storage details. Our goal is to decouple local and remote storage by providing clear API boundary.

Backports Required

Release Notes

none

VladLazar

Looks neat. I understand avoiding the batch cache, but why not go through the readers cache?

src/v/storage/types.h

src/v/archival/async_data_uploader.cc

src/v/archival/async_data_uploader.h

src/v/archival/async_data_uploader.cc

vbotbuildovich · 2023-10-09T19:48:46Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38576#018b15bf-a837-4a2d-bb9e-9dcc405bc341

vbotbuildovich · 2023-11-01T14:26:55Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e0e-4316-b2d3-ad6cd8e762ba: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features"
"rptest.tests.read_replica_e2e_test.ReadReplicasUpgradeTest.test_upgrades.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionTest.test_basic_upgrade"
"rptest.tests.upgrade_test.RedpandaInstallerTest.test_install_by_line"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_cancel_ongoing_movements"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=3600000"
"rptest.tests.nodes_decommissioning_test.NodeDecommissionFailureReportingTest.test_allocation_failure_reporting"
"rptest.tests.acls_test.AccessControlListTestUpgrade.test_upgrade_sasl"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=False"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_explicit_activation"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_empty.num_to_upgrade=2"

vbotbuildovich · 2023-11-01T14:29:45Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e11-49da-acf9-e7d2d52e0588: "rptest.tests.cluster_features_test.FeaturesSingleNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_overlapping_changes.num_to_upgrade=2"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_crashed_node"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_node_finishes"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_old_node_join"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_static.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_get_features"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=True"
"rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade"
"rptest.tests.license_upgrade_test.UpgradeToLicenseChecks.test_basic_upgrade.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.2.next_release=.22.3"
"rptest.tests.offset_retention_test.OffsetRetentionDisabledAfterUpgrade.test_upgrade_from_pre_v23.initial_version=.22.2.9"
"rptest.tests.upgrade_test.UpgradeFromSpecificVersion.test_basic_upgrade"

vbotbuildovich · 2023-11-01T14:33:49Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e15-461b-98f6-f3f4add83f13: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=3600000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=False"
"rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_one_of_decommissioned_nodes"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=True"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_synthetic_old_node_join"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=True"
"rptest.tests.workload_upgrade_runner_test.RedpandaUpgradeTest.test_workloads_through_releases.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=False"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_license_upload_and_query"
"rptest.tests.cluster_features_test.FeaturesUpgradeAssertionTest.test_upgrade_assertion"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_invalid_destination.num_to_upgrade=2"
"rptest.tests.topic_creation_test.CreateTopicUpgradeTest.test_retention_config_on_upgrade_from_v22_2_to_v22_3.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.transactions_test.GATransaction_v22_1_UpgradeTest.upgrade_coordinator_test"

vbotbuildovich · 2023-11-01T14:38:21Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e18-4b90-8a98-df658a37a2f7: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_synthetic_too_new_node_join"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_rollback"
"rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_upgrade_compat"
"rptest.tests.controller_snapshot_test.ControllerSnapshotPolicyTest.test_upgrade_auto_enable"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=True"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_deletion_stops_move.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionCloudStorageTest.test_rolling_upgrade.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_moving_not_fully_initialized_partition.num_to_upgrade=2"
"rptest.tests.topic_creation_test.CreateTopicUpgradeTest.test_retention_upgrade_with_cluster_remote_write.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.transactions_test.GATransaction_v22_1_UpgradeTest.upgrade_topic_test"

vbotbuildovich · 2023-11-02T15:08:11Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db3b-499d-b8c7-f9884842adf2: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True"

vbotbuildovich · 2023-11-02T15:17:50Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db42-4df8-8ced-824474b78247: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True"

vbotbuildovich · 2023-11-02T15:21:01Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db45-4711-929c-5ad180e97c75: "rptest.tests.offset_for_leader_epoch_test.OffsetForLeaderEpochTest.test_offset_for_leader_epoch_transfer"

vbotbuildovich · 2023-11-02T15:21:56Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db3e-4d6b-be5b-61bb7c0b881e: "rptest.tests.rpk_registry_test.RpkRegistryTest.test_produce_consume_proto"

piyushredpanda · 2023-11-03T07:10:38Z

/ci-repeat 1

vbotbuildovich · 2023-12-04T11:33:09Z

new failures in https://buildkite.com/redpanda/redpanda/builds/42192#018c3450-60bd-4327-b78f-6c1cdfb0b985:

"rptest.tests.recovery_mode_test.RecoveryModeTest.test_recovery_mode"

new failures in https://buildkite.com/redpanda/redpanda/builds/43904#018d1e55-2030-45ce-a2b7-a6437ed73083:

"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.2.next_release=.22.3"

new failures in https://buildkite.com/redpanda/redpanda/builds/43904#018d1e55-202a-4620-9510-6a3dd28b53d3:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57c8-41c3-89a8-b06d7b32d1ab:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Delete==True.SpilloverManifestUploaded==True.TS_Spillover_ManifestDeleted==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57d3-4fbe-980c-16d74ee0f278:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57cc-4694-8929-6e0172af439c:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57cf-4952-afbb-4ad8cb55b15b:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Delete==True.SpilloverManifestUploaded==True.TS_Spillover_ManifestDeleted==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44240#018d3d9e-ffb1-41df-957c-5d67c059f032:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/44535#018d6097-cb18-4eb6-9d0f-24e28ec59d17:

"rptest.tests.data_transforms_test.DataTransformsTest.test_tracked_offsets_cleaned_up"

andrwng

It seems like we're at most doing a subscan of two segments. Is it possible that the upload range covers multiple segments? If so, why don't we need to accumulate the contents of the intermediate segments?

Maybe I'm missing something, but it seems possible that there are no data records in the left or right subscan, but there are entire segments in the middle that have offsets to upload and data timestamps that get skipped.

Curious whether you considered accumulating over the entire upload range rather than just left and right, and found it to be insufficient. Even if it'd be more IO, it feels like it'd be worth avoiding the complexity and introducing new file-based dependencies (figuring out file size, etc) on the storage layer

src/v/model/timeout_clock.h

andrwng · 2024-01-03T22:43:51Z

src/v/archival/async_data_uploader.h

+
+/// Result of the upload size calculation.
+/// Contains size of the region in bytes and locks.
+struct cloud_upload_parameters {


nit: I think "metadata" would be a more accurate name. To me "parameter" implies that these will affect how the upload happens

It actually affects the created upload. This is a result of the reconciliation process which is used to find the upload candidate. Then the actual upload should be created out of this structure. The upload is a bytestream + metadata for the manifest. I'll try to come up with a better name but metadata is a bit loaded. Maybe upload_reconciliation_result?

andrwng · 2024-01-03T23:47:54Z

src/v/archival/async_data_uploader.cc

+private:
+    model::record_batch_reader _reader;
+    iobuf _buffer;
+    ssize_t _max_bytes;


nit: why does this need to be signed?

andrwng · 2024-01-04T00:08:50Z

src/v/archival/async_data_uploader.h

+    // Scan te beginning of the segment range
+    ss::future<result<subscan_res>> subscan_left(
+      std::vector<ss::lw_shared_ptr<storage::segment>> segments,
+      model::offset range_base,
+      model::timeout_clock::time_point deadline);
+
+    /// Scan the end of the segment range
+    ss::future<result<subscan_res>> subscan_right(
+      std::vector<ss::lw_shared_ptr<storage::segment>> segments,
+      model::offset range_last,
+      model::timeout_clock::time_point deadline);


nit: I feel like I'm missing a mental image of what "left" and "right" mean in this context. Could you elaborate? Maybe it'd help to be more explicit about what range_base and range_last are and how they are intended to be used

The subscans here are referring to having to scan partial segments, right? If so, I'm surprised that these don't take individual segments as arguments.

We have to upload an offset range that may span multiple segments and it may also start and stop in the middle of the segment. We have to find an upload candidate based on start offset and end offset of the upload or based on start offset and desired size of the upload. The algorithm is

call subscan_left to find the exact beginning of the upload.

call subscan_right to find the end of the upload.

compute size and metadata of the upload based on two subscan results and segments in the middle.

The subscan uses segment index to find index entry closest to the target and then it scans up to the target to calculate precise file location. It may also scan past the target to find the data timestamp.

andrwng · 2024-01-04T00:10:06Z

src/v/archival/async_data_uploader.h

+    ss::future<result<subscan_res>> subscan(
+      ss::lw_shared_ptr<cluster::partition> part,
+      inclusive_offset_range range,
+      size_t initial_offset,


nit: should this be some offset type? Or perhaps name it to some filepos or something

andrwng · 2024-01-04T00:21:12Z

src/v/archival/async_data_uploader.h

+    /// the timestamp and offset of the 'target'.
+    /// The timestamp is a data timestamp (if the target is a config batch
+    /// the subscan will scan until the data batch is found and use first data


I think these are referring to the members of subscan_res? If so, could you add them to the struct definition?

updated the comment

andrwng · 2024-01-04T00:25:28Z

src/v/archival/async_data_uploader.h

+    /// Calculate upload size using segment indexes
+    ss::future<result<cloud_upload_parameters>> compute_upload_parameters(
+      inclusive_offset_range range, model::timeout_clock::time_point deadline);


It'd be nice to avoid this dependency on file index... Perhaps we should bake a size estimate interface into the storage layer or log reader

Agree, but at the moment this is not possible.

andrwng · 2024-01-04T00:27:38Z

src/v/archival/async_data_uploader.cc

+    co_return std::move(upl);
+}
+
+ss::future<result<void>> segment_upload::initialize(


nit: alternatively make this a static constructor, and make segment_upload's constructor take the params? Then we wouldn't need to worry about checking for inited

The problem is that this initialization is asynchronous so we have to expose some sort of async method anyway.

andrwng · 2024-01-04T01:01:47Z

src/v/archival/async_data_uploader.cc

+                    // timestamp.
+                    result->done = true;
+                    co_return ss::stop_iteration::yes;
+                }


It's still possible that returned timestamp isn't initialized if all records aren't data batches. Is that intentional?

If we're unable to find a timestamp in this range, I wonder if it makes sense to just use the preceding remote segment's max timestamp (not at this level of abstraction, somewhere else)

Yes, it is possible to do this and I was actually going to use it in this case (if we have preceding segment in the manifest). But if we scanning the end of the segment we don't have such option. This could be tuned in a followup when this code will be used to start uploads.

andrwng · 2024-01-04T01:15:05Z

src/v/archival/async_data_uploader.cc

+    }
+}
+
+ss::future<result<segment_upload::subscan_res>> segment_upload::subscan(


nit: gate and logger aside, feels like this should be a static storage utility method

Eliminate scheduling point inside the loop. The scheduling point may cause the iteration to become invalidated which is an UB.

Allow index entry to be an optional to get rid of the code that initializes entry to default values.

src/v/storage/disk_log_impl.cc

src/v/storage/tests/storage_e2e_test.cc

Return std::optional instead of encoding absense of the value using the fields of the struct.

Lazin · 2024-01-29T14:28:03Z

Test failures is #16308

src/v/storage/disk_log_impl.cc

andrwng · 2024-01-30T20:20:32Z

src/v/storage/disk_log_impl.cc

+
+    auto holders = co_await ss::when_all_succeed(
+      std::begin(f_locks), std::end(f_locks));
+


I suspect we'll need to check if the segment is closed while the locks are held, and throw if so.

the close method of the segment is trying to acquire the write lock so it should be impossible to get the read lock for the closed segment. but I added the check just in case.

It does take the write lock, but it drops it once the segment is closed, after which even if we have the read lock, I don't think we'll be able to read the segment

src/v/storage/disk_log_impl.h

src/v/storage/disk_log_impl.cc

andrwng · 2024-01-30T21:21:10Z

src/v/storage/segment_index.h

+    std::optional<entry> find_above_size_bytes(size_t distance);
+    /// Find entry by file offset (the value will undershoot or find precise
+    /// match)
+    std::optional<entry> find_below_size_bytes(size_t distance);


Bumping this question

src/v/storage/disk_log_impl.cc

Lazin · 2024-01-31T19:21:04Z

CI failure is unrelated #16402

andrwng

Sorry the reviews here have dragged on here. Structurally, and ignoring side conversations about the right interface between storage and archival, I think this is looking pretty good.

While I appreciate that this has already split up the storage/archival changes, it'd really help as a reviewer to break things down even further. One approach to consider is to split out PRs into independtly testable units. For instance, this PR could probably have been split into:

new method for the segment_index
batch cache option for the log reader
get_file_offset() + maybe batch_size_accumulator
offset_range_size(start, last)
offset_range_size(start, size)

It's much harder to introduce bugs when the PRs are small. That said, given there's already quite a lot of review on this PR, I can understand if you would prefer to punt on this.

src/v/storage/disk_log_impl.cc

andrwng · 2024-02-01T20:51:03Z

src/v/storage/disk_log_impl.cc

+    for (; it < _segs.end(); it++) {
+        if (it->get()->is_closed()) {
+            co_return std::nullopt;
+        }
+        if (*it == first_segment) {
+            vlog(
+              stlog.debug,


Can/should we be iterating over segments? It doesn't look like there are scheduling points below, but it's a little surprising to not use segments

maybe in a followup

andrwng · 2024-02-01T21:02:36Z

src/v/storage/disk_log_impl.cc

+            if (num_segments == 1) {
+                // There are two cases here.
+                // 1. The segment has at least target_size bytes after
+                // base_file_pos,
+                // 2. The segment is too small. In this case we need to clamp
+                // the result.
+                truncate_after = first_segment_file_pos + target.target_size;
+                truncate_after = std::clamp(
+                  truncate_after,
+                  first_segment_file_pos,
+                  it->get()->file_size());
+            } else {
+                // In this case we need to find the truncation point
+                // always starting from the beginning of the segment.
+                //
+                // prev is guaranteed to be smaller than target_size
+                // because we reached this branch.
+                auto prev = current_size - it->get()->file_size();
+                auto delta = target.target_size - prev;
+                truncate_after = delta;
+            }


I'm a little confused about why this has to be two distinct cases. In either case, isn't the idea to remove current_size - target.target_size bytes? And aren't we guaranteed that it's this segment that pushed us over the top?

Would something like this work:

auto cur_seg_pos = it->get()->file_size(); // Above calculation `current_size` has assumed that we've accepted to the end of the file, and now we need to truncate. auto truncate_by = current_size - target.target_size; auto truncate_after = cur_seg_pos - truncate_by;

the computation is different in both cases

single segment, the current_size includes full segment size - the prefix which is not included into the result. we need to use the size of the truncated prefix to locate the end of the offset range.

more than one segment, current_size includes more than one segment, so we don't need to use the size of the prefix;

we need to use the size of the truncated prefix to locate the end of the offset range.

Right, but doesn't current_size already account for the missing (or not) prefix?

I suspect these are the same, translating my suggestion, ignoring the clamp:

case of 1 segment:

cur_seg_pos = first_seg_file_size current_size = first_seg_file_size - first_seg_file_pos truncate_by = first_seg_file_size - first_seg_file_pos - target.target_size truncate_after = first_seg_file_size - (first_seg_file_size - first_seg_file_pos - target.target_size) ... = first_seg_file_pos + target.target_size (same as L2147, and sure we can clamp after)

case of more than 1 segment:

cur_seg_pos = iter->file_size() current_size = size of previous segments + iter->file_size() truncate_by = size of previous segments + iter->file_size() - target.target_size truncate_after = iter->file_size() - (size of previous segments + iter->file_size() - target.target_size) ... = target.target_size - size of previous segments (same as L2160)

After writing it all out, I'm more convinced these are the same, though it's somewhat a nit. I do think it makes the code more readable though.

andrwng · 2024-02-01T21:33:44Z

src/v/storage/disk_log_impl.cc

+        } else if (current_size > target.min_size) {
+            vlog(
+              stlog.debug,
+              "Setting offset range to {} - {}",
+              first,
+              it->get()->offsets().committed_offset);
+            // We can include full segment to the list of segments
+            last_included_offset = it->get()->offsets().committed_offset;
+            continue;


Why is this gated on current_size > target.min_size? Doesn't the current_size < target.min_size check below kind of encompass that? Or put another way, if the first segments we iterated through didn't meet the min_size requirement, why is it important to avoid setting this?

In the previous version(s) of this code the variable was used to indicate that we found the offset range.
After the return type was changed to optional I started using std::nullopt to return empty result. I'm inclined to keep it the way it is.

Got it, thanks for the explanation. IMO it makes it more confusing leaving it in, especially because there are so many branches in this code already

Lock the whole segment range in advance. Also, fix the error in compacted test case by handling the situation when the whole offset range is fully compacted and has size 0.

andrwng

Remaining comments are pretty much nits. I know it's been a long process, but thanks for continuing to improve this code!

dotnwat · 2024-04-06T17:31:53Z

src/v/storage/segment_index.h

@@ -153,6 +153,12 @@ class segment_index {
      size_t filepos);
    std::optional<entry> find_nearest(model::offset);
    std::optional<entry> find_nearest(model::timestamp);
+    /// Find entry by file offset (the value may overshoot or find precise
+    /// match)
+    std::optional<entry> find_above_size_bytes(size_t distance);


can this be const?

dotnwat · 2024-04-06T17:32:01Z

src/v/storage/segment_index.h

+    std::optional<entry> find_above_size_bytes(size_t distance);
+    /// Find entry by file offset (the value will undershoot or find precise
+    /// match)
+    std::optional<entry> find_below_size_bytes(size_t distance);


can this be const?

dotnwat · 2024-04-06T17:35:11Z

src/v/storage/disk_log_impl.cc

+struct batch_size_accumulator {
+    ss::future<ss::stop_iteration> operator()(model::record_batch b) {
+        vassert(
+          result_size_bytes != nullptr,


it seems to me like you could add a constructor to this struct and not pay this cost on each operator() invocation?

dotnwat · 2024-04-06T17:38:51Z

src/v/storage/disk_log_impl.cc

+    auto reader = co_await make_reader(reader_cfg);
+
+    try {
+        co_await std::move(reader).consume(acc, model::no_timeout);


i thought that consume() allowed the consumer to return a value, which would mean, you probably don't need to pass in a pointer to size_bytes in order to get size_bytes back out after the consumer is done processing data from the reader?

src/v/storage/disk_log_impl.cc

dotnwat · 2024-04-06T17:44:10Z

src/v/storage/disk_log_impl.cc

+        const auto& offsets = it->get()->offsets();
+        auto r_lock = co_await it->get()->read_lock();
+        if (offsets.base_offset <= first && offsets.committed_offset >= first) {


generally speaking anytime we have a reference to something accessed via an iterator held across co_await calls we should see a comment that explains why it is safe. can you add that here?

dotnwat · 2024-04-06T17:44:37Z

src/v/storage/disk_log_impl.cc

+            auto file_pos = co_await get_file_offset(
+              *it, index_entry, first, false, io_priority);
+            auto sz = it->get()->file_size() - file_pos;


how is any of this safe w.r.t. iterator invalidations?

dotnwat · 2024-04-06T17:45:59Z

src/v/storage/tests/storage_e2e_test.cc

@@ -3667,3 +3668,183 @@ FIXTURE_TEST(test_offset_range_size, storage_test_fixture) {

 #endif
 };
+
+FIXTURE_TEST(test_offset_range_size2, storage_test_fixture) {
+#ifdef NDEBUG


there should definitely be a comment explaining why this doesn't work in debug mode.

dotnwat · 2024-04-06T17:46:33Z

src/v/storage/tests/storage_e2e_test.cc

@@ -3848,3 +3848,175 @@ FIXTURE_TEST(test_offset_range_size2, storage_test_fixture) {

 #endif
 };
+
+FIXTURE_TEST(test_offset_range_size_compacted, storage_test_fixture) {
+#ifdef NDEBUG


there should definitely be a comment explaining why this doesn't work in debug mode.

dotnwat · 2024-04-06T17:46:55Z

src/v/storage/tests/storage_e2e_test.cc

@@ -4020,3 +4031,262 @@ FIXTURE_TEST(test_offset_range_size_compacted, storage_test_fixture) {

 #endif
 };
+
+FIXTURE_TEST(test_offset_range_size2_compacted, storage_test_fixture) {
+#ifdef NDEBUG


why doesn't this work in debug mode

github-actions bot added the area/redpanda label Oct 4, 2023

Lazin requested review from VladLazar and andrwng October 4, 2023 16:14

VladLazar reviewed Oct 5, 2023

View reviewed changes

src/v/storage/types.h Show resolved Hide resolved

src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved

src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved

src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved

nvartolomei reviewed Oct 5, 2023

View reviewed changes

src/v/archival/async_data_uploader.h Outdated Show resolved Hide resolved

nvartolomei reviewed Oct 5, 2023

View reviewed changes

src/v/archival/async_data_uploader.h Outdated Show resolved Hide resolved

nvartolomei reviewed Oct 5, 2023

View reviewed changes

src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved

nvartolomei reviewed Oct 5, 2023

View reviewed changes

src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved

Lazin force-pushed the feature/split-upload-path branch 4 times, most recently from 5a9682c to f5b9af4 Compare October 9, 2023 16:41

Lazin requested review from VladLazar and nvartolomei October 9, 2023 16:42

Lazin self-assigned this Oct 9, 2023

Lazin added the area/cloud-storage Shadow indexing subsystem label Oct 9, 2023

Lazin force-pushed the feature/split-upload-path branch from 4d025a5 to 6fda80f Compare November 1, 2023 12:16

Lazin force-pushed the feature/split-upload-path branch from 6fda80f to fbeb495 Compare December 3, 2023 22:35

andrwng reviewed Jan 4, 2024

View reviewed changes

Lazin added 4 commits January 25, 2024 13:58

model: Add boundary_type and use it in disk_log_impl

2b32a4d

storage: Make offset_range_size safe

b74de3a

Eliminate scheduling point inside the loop. The scheduling point may cause the iteration to become invalidated which is an UB.

storage: Add doc comments to log.h

e41657c

storage: Simplify disk_log_impl::get_file_offset

06b2c72

Allow index entry to be an optional to get rid of the code that initializes entry to default values.

Lazin force-pushed the feature/split-upload-path branch from bd5800f to 06b2c72 Compare January 25, 2024 18:59

Lazin requested a review from nvartolomei January 25, 2024 18:59

nvartolomei reviewed Jan 26, 2024

View reviewed changes

Lazin added 5 commits January 26, 2024 14:37

storage: Change offset_range_size signature

1e07cd6

Return std::optional instead of encoding absense of the value using the fields of the struct.

storage: Add cover letter to test_offset_range_size2_compacted

b61f2fc

storage: Add cover letter to test_offset_range_size2

d67bb2f

storage: Add cover letter to test_offset_range_size

e1e147f

storage: Add cover letter to test_offset_range_size_compacted

1ef6a58

storage: Enable offset_range_size tests in debug

1d201d7

Lazin requested a review from nvartolomei January 29, 2024 15:20

andrwng reviewed Jan 30, 2024

View reviewed changes

Lazin requested a review from andrwng January 31, 2024 14:25

Lazin force-pushed the feature/split-upload-path branch 2 times, most recently from 5b594e0 to 8070e50 Compare January 31, 2024 16:19

Lazin changed the title ~~archival: Add alternative upload mechanism~~ archival: Add offset_range_size method to the storage::log Feb 1, 2024

Lazin changed the title ~~archival: Add offset_range_size method to the storage::log~~ archival: Add offset_range_size method to the storage::log Feb 1, 2024

andrwng reviewed Feb 1, 2024

View reviewed changes

storage: Change locking in the offset_range_size

0e95ffc

Lock the whole segment range in advance. Also, fix the error in compacted test case by handling the situation when the whole offset range is fully compacted and has size 0.

Lazin force-pushed the feature/split-upload-path branch from 8070e50 to 0e95ffc Compare February 1, 2024 23:13

Lazin requested a review from andrwng February 1, 2024 23:14

andrwng approved these changes Feb 1, 2024

View reviewed changes

Lazin merged commit e3acdec into redpanda-data:dev Feb 2, 2024
17 checks passed

Lazin mentioned this pull request Mar 21, 2024

archival: Use log reader to upload data #16999

Merged

6 tasks

dotnwat reviewed Apr 6, 2024

View reviewed changes


		auto holders = co_await ss::when_all_succeed(
		std::begin(f_locks), std::end(f_locks));

archival: Add offset_range_size method to the storage::log #13927

archival: Add offset_range_size method to the storage::log #13927

Conversation

Lazin commented Oct 4, 2023 • edited

Backports Required

Release Notes

VladLazar left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Oct 9, 2023

vbotbuildovich commented Nov 1, 2023

vbotbuildovich commented Nov 1, 2023

vbotbuildovich commented Nov 1, 2023

vbotbuildovich commented Nov 1, 2023

vbotbuildovich commented Nov 2, 2023

vbotbuildovich commented Nov 2, 2023

vbotbuildovich commented Nov 2, 2023

vbotbuildovich commented Nov 2, 2023

piyushredpanda commented Nov 3, 2023

vbotbuildovich commented Dec 4, 2023 • edited

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Jan 31, 2024

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Oct 4, 2023 •

edited

vbotbuildovich commented Dec 4, 2023 •

edited

andrwng Feb 1, 2024 •

edited