Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archival: Add offset_range_size method to the storage::log #13927

Merged
merged 21 commits into from
Feb 2, 2024

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Oct 4, 2023

This PR adds a offset_range_size methods to the storage::log. The implementation in storage::disk_log_impl uses segment indexes to find approximate location of the start of the range and approximate location of the last batch. Then it performs short scan to find the precise location.

There are two overloads of the method.

  • First one accepts two offsets and returns the size of the range.
  • Second one accepts start offset and target size. It finds the last offset of the range and its size.

The method is supposed to be used to create S3 uploads. Currently, the code reads from disk directly and therefore is coupled with the storage format and a lot of internal storage details. Our goal is to decouple local and remote storage by providing clear API boundary.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

  • none

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat. I understand avoiding the batch cache, but why not go through the readers cache?

src/v/storage/types.h Show resolved Hide resolved
src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved
src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved
src/v/archival/async_data_uploader.cc Outdated Show resolved Hide resolved
@Lazin Lazin force-pushed the feature/split-upload-path branch 4 times, most recently from 5a9682c to f5b9af4 Compare October 9, 2023 16:41
@Lazin Lazin self-assigned this Oct 9, 2023
@Lazin Lazin added the area/cloud-storage Shadow indexing subsystem label Oct 9, 2023
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e0e-4316-b2d3-ad6cd8e762ba: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features"
"rptest.tests.read_replica_e2e_test.ReadReplicasUpgradeTest.test_upgrades.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionTest.test_basic_upgrade"
"rptest.tests.upgrade_test.RedpandaInstallerTest.test_install_by_line"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_cancel_ongoing_movements"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=3600000"
"rptest.tests.nodes_decommissioning_test.NodeDecommissionFailureReportingTest.test_allocation_failure_reporting"
"rptest.tests.acls_test.AccessControlListTestUpgrade.test_upgrade_sasl"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=False"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_explicit_activation"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_empty.num_to_upgrade=2"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e11-49da-acf9-e7d2d52e0588: "rptest.tests.cluster_features_test.FeaturesSingleNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_overlapping_changes.num_to_upgrade=2"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_crashed_node"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_node_finishes"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_old_node_join"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_static.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_get_features"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=True"
"rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade"
"rptest.tests.license_upgrade_test.UpgradeToLicenseChecks.test_basic_upgrade.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.2.next_release=.22.3"
"rptest.tests.offset_retention_test.OffsetRetentionDisabledAfterUpgrade.test_upgrade_from_pre_v23.initial_version=.22.2.9"
"rptest.tests.upgrade_test.UpgradeFromSpecificVersion.test_basic_upgrade"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e15-461b-98f6-f3f4add83f13: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=3600000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=False"
"rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_one_of_decommissioned_nodes"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=True"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_synthetic_old_node_join"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=True"
"rptest.tests.workload_upgrade_runner_test.RedpandaUpgradeTest.test_workloads_through_releases.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=False"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_license_upload_and_query"
"rptest.tests.cluster_features_test.FeaturesUpgradeAssertionTest.test_upgrade_assertion"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_invalid_destination.num_to_upgrade=2"
"rptest.tests.topic_creation_test.CreateTopicUpgradeTest.test_retention_config_on_upgrade_from_v22_2_to_v22_3.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.transactions_test.GATransaction_v22_1_UpgradeTest.upgrade_coordinator_test"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40235#018b8b12-6e18-4b90-8a98-df658a37a2f7: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_synthetic_too_new_node_join"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_rollback"
"rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_upgrade_compat"
"rptest.tests.controller_snapshot_test.ControllerSnapshotPolicyTest.test_upgrade_auto_enable"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=True"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_deletion_stops_move.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionCloudStorageTest.test_rolling_upgrade.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_moving_not_fully_initialized_partition.num_to_upgrade=2"
"rptest.tests.topic_creation_test.CreateTopicUpgradeTest.test_retention_upgrade_with_cluster_remote_write.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.transactions_test.GATransaction_v22_1_UpgradeTest.upgrade_topic_test"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db3b-499d-b8c7-f9884842adf2: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db42-4df8-8ced-824474b78247: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db45-4711-929c-5ad180e97c75: "rptest.tests.offset_for_leader_epoch_test.OffsetForLeaderEpochTest.test_offset_for_leader_epoch_transfer"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40311#018b905d-db3e-4d6b-be5b-61bb7c0b881e: "rptest.tests.rpk_registry_test.RpkRegistryTest.test_produce_consume_proto"

@piyushredpanda
Copy link
Contributor

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 4, 2023

new failures in https://buildkite.com/redpanda/redpanda/builds/42192#018c3450-60bd-4327-b78f-6c1cdfb0b985:

"rptest.tests.recovery_mode_test.RecoveryModeTest.test_recovery_mode"

new failures in https://buildkite.com/redpanda/redpanda/builds/43904#018d1e55-2030-45ce-a2b7-a6437ed73083:

"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.2.next_release=.22.3"

new failures in https://buildkite.com/redpanda/redpanda/builds/43904#018d1e55-202a-4620-9510-6a3dd28b53d3:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57c8-41c3-89a8-b06d7b32d1ab:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Delete==True.SpilloverManifestUploaded==True.TS_Spillover_ManifestDeleted==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57d3-4fbe-980c-16d74ee0f278:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57cc-4694-8929-6e0172af439c:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44004#018d2739-57cf-4952-afbb-4ad8cb55b15b:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True.SpilloverManifestUploaded==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Delete==True.SpilloverManifestUploaded==True.TS_Spillover_ManifestDeleted==True"
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.SpilloverManifestUploaded==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44240#018d3d9e-ffb1-41df-957c-5d67c059f032:

"rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/44535#018d6097-cb18-4eb6-9d0f-24e28ec59d17:

"rptest.tests.data_transforms_test.DataTransformsTest.test_tracked_offsets_cleaned_up"

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're at most doing a subscan of two segments. Is it possible that the upload range covers multiple segments? If so, why don't we need to accumulate the contents of the intermediate segments?

Maybe I'm missing something, but it seems possible that there are no data records in the left or right subscan, but there are entire segments in the middle that have offsets to upload and data timestamps that get skipped.

Curious whether you considered accumulating over the entire upload range rather than just left and right, and found it to be insufficient. Even if it'd be more IO, it feels like it'd be worth avoiding the complexity and introducing new file-based dependencies (figuring out file size, etc) on the storage layer

src/v/model/timeout_clock.h Outdated Show resolved Hide resolved

/// Result of the upload size calculation.
/// Contains size of the region in bytes and locks.
struct cloud_upload_parameters {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think "metadata" would be a more accurate name. To me "parameter" implies that these will affect how the upload happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually affects the created upload. This is a result of the reconciliation process which is used to find the upload candidate. Then the actual upload should be created out of this structure. The upload is a bytestream + metadata for the manifest. I'll try to come up with a better name but metadata is a bit loaded. Maybe upload_reconciliation_result?

private:
model::record_batch_reader _reader;
iobuf _buffer;
ssize_t _max_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why does this need to be signed?

Comment on lines 220 to 308
// Scan te beginning of the segment range
ss::future<result<subscan_res>> subscan_left(
std::vector<ss::lw_shared_ptr<storage::segment>> segments,
model::offset range_base,
model::timeout_clock::time_point deadline);

/// Scan the end of the segment range
ss::future<result<subscan_res>> subscan_right(
std::vector<ss::lw_shared_ptr<storage::segment>> segments,
model::offset range_last,
model::timeout_clock::time_point deadline);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like I'm missing a mental image of what "left" and "right" mean in this context. Could you elaborate? Maybe it'd help to be more explicit about what range_base and range_last are and how they are intended to be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subscans here are referring to having to scan partial segments, right? If so, I'm surprised that these don't take individual segments as arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to upload an offset range that may span multiple segments and it may also start and stop in the middle of the segment. We have to find an upload candidate based on start offset and end offset of the upload or based on start offset and desired size of the upload. The algorithm is

  • call subscan_left to find the exact beginning of the upload.
  • call subscan_right to find the end of the upload.
  • compute size and metadata of the upload based on two subscan results and segments in the middle.

The subscan uses segment index to find index entry closest to the target and then it scans up to the target to calculate precise file location. It may also scan past the target to find the data timestamp.

ss::future<result<subscan_res>> subscan(
ss::lw_shared_ptr<cluster::partition> part,
inclusive_offset_range range,
size_t initial_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be some offset type? Or perhaps name it to some filepos or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 204 to 206
/// the timestamp and offset of the 'target'.
/// The timestamp is a data timestamp (if the target is a config batch
/// the subscan will scan until the data batch is found and use first data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are referring to the members of subscan_res? If so, could you add them to the struct definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment

Comment on lines 216 to 223
/// Calculate upload size using segment indexes
ss::future<result<cloud_upload_parameters>> compute_upload_parameters(
inclusive_offset_range range, model::timeout_clock::time_point deadline);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to avoid this dependency on file index... Perhaps we should bake a size estimate interface into the storage layer or log reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but at the moment this is not possible.

co_return std::move(upl);
}

ss::future<result<void>> segment_upload::initialize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alternatively make this a static constructor, and make segment_upload's constructor take the params? Then we wouldn't need to worry about checking for inited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this initialization is asynchronous so we have to expose some sort of async method anyway.

// timestamp.
result->done = true;
co_return ss::stop_iteration::yes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible that returned timestamp isn't initialized if all records aren't data batches. Is that intentional?

If we're unable to find a timestamp in this range, I wonder if it makes sense to just use the preceding remote segment's max timestamp (not at this level of abstraction, somewhere else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible to do this and I was actually going to use it in this case (if we have preceding segment in the manifest). But if we scanning the end of the segment we don't have such option. This could be tuned in a followup when this code will be used to start uploads.

}
}

ss::future<result<segment_upload::subscan_res>> segment_upload::subscan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gate and logger aside, feels like this should be a static storage utility method

Eliminate scheduling point inside the loop. The scheduling point may
cause the iteration to become invalidated which is an UB.
Allow index entry to be an optional to get rid of the code that
initializes entry to default values.
@Lazin
Copy link
Contributor Author

Lazin commented Jan 29, 2024

Test failures is #16308

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

auto holders = co_await ss::when_all_succeed(
std::begin(f_locks), std::end(f_locks));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'll need to check if the segment is closed while the locks are held, and throw if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the close method of the segment is trying to acquire the write lock so it should be impossible to get the read lock for the closed segment. but I added the check just in case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does take the write lock, but it drops it once the segment is closed, after which even if we have the read lock, I don't think we'll be able to read the segment

src/v/storage/disk_log_impl.h Show resolved Hide resolved
src/v/storage/disk_log_impl.h Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
std::optional<entry> find_above_size_bytes(size_t distance);
/// Find entry by file offset (the value will undershoot or find precise
/// match)
std::optional<entry> find_below_size_bytes(size_t distance);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this question

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
@Lazin Lazin requested a review from andrwng January 31, 2024 14:25
@Lazin Lazin force-pushed the feature/split-upload-path branch 2 times, most recently from 5b594e0 to 8070e50 Compare January 31, 2024 16:19
@Lazin
Copy link
Contributor Author

Lazin commented Jan 31, 2024

CI failure is unrelated #16402

@Lazin Lazin changed the title archival: Add alternative upload mechanism archival: Add offset_range_size method to the storage::log Feb 1, 2024
@Lazin Lazin changed the title archival: Add offset_range_size method to the storage::log archival: Add offset_range_size method to the storage::log Feb 1, 2024
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the reviews here have dragged on here. Structurally, and ignoring side conversations about the right interface between storage and archival, I think this is looking pretty good.

While I appreciate that this has already split up the storage/archival changes, it'd really help as a reviewer to break things down even further. One approach to consider is to split out PRs into independtly testable units. For instance, this PR could probably have been split into:

  • new method for the segment_index
  • batch cache option for the log reader
  • get_file_offset() + maybe batch_size_accumulator
  • offset_range_size(start, last)
  • offset_range_size(start, size)

It's much harder to introduce bugs when the PRs are small. That said, given there's already quite a lot of review on this PR, I can understand if you would prefer to punt on this.

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
for (; it < _segs.end(); it++) {
if (it->get()->is_closed()) {
co_return std::nullopt;
}
if (*it == first_segment) {
vlog(
stlog.debug,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we be iterating over segments? It doesn't look like there are scheduling points below, but it's a little surprising to not use segments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in a followup

Comment on lines +2147 to +2161
if (num_segments == 1) {
// There are two cases here.
// 1. The segment has at least target_size bytes after
// base_file_pos,
// 2. The segment is too small. In this case we need to clamp
// the result.
truncate_after = first_segment_file_pos + target.target_size;
truncate_after = std::clamp(
truncate_after,
first_segment_file_pos,
it->get()->file_size());
} else {
// In this case we need to find the truncation point
// always starting from the beginning of the segment.
//
// prev is guaranteed to be smaller than target_size
// because we reached this branch.
auto prev = current_size - it->get()->file_size();
auto delta = target.target_size - prev;
truncate_after = delta;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about why this has to be two distinct cases. In either case, isn't the idea to remove current_size - target.target_size bytes? And aren't we guaranteed that it's this segment that pushed us over the top?

Would something like this work:

auto cur_seg_pos = it->get()->file_size(); // Above calculation `current_size` has assumed that we've accepted to the end of the file, and now we need to truncate.
auto truncate_by = current_size - target.target_size;
auto truncate_after = cur_seg_pos - truncate_by;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the computation is different in both cases

  • single segment, the current_size includes full segment size - the prefix which is not included into the result. we need to use the size of the truncated prefix to locate the end of the offset range.
  • more than one segment, current_size includes more than one segment, so we don't need to use the size of the prefix;

Copy link
Contributor

@andrwng andrwng Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use the size of the truncated prefix to locate the end of the offset range.

Right, but doesn't current_size already account for the missing (or not) prefix?

I suspect these are the same, translating my suggestion, ignoring the clamp:

case of 1 segment:

cur_seg_pos = first_seg_file_size
current_size = first_seg_file_size - first_seg_file_pos
truncate_by = first_seg_file_size - first_seg_file_pos - target.target_size
truncate_after = first_seg_file_size - (first_seg_file_size - first_seg_file_pos - target.target_size)
... = first_seg_file_pos + target.target_size (same as L2147, and sure we can clamp after)

case of more than 1 segment:

cur_seg_pos = iter->file_size()
current_size = size of previous segments + iter->file_size()
truncate_by = size of previous segments + iter->file_size() - target.target_size
truncate_after = iter->file_size() - (size of previous segments + iter->file_size() - target.target_size)
... = target.target_size - size of previous segments (same as L2160)

After writing it all out, I'm more convinced these are the same, though it's somewhat a nit. I do think it makes the code more readable though.

Comment on lines +2202 to +2204
} else if (current_size > target.min_size) {
vlog(
stlog.debug,
"Setting offset range to {} - {}",
first,
it->get()->offsets().committed_offset);
// We can include full segment to the list of segments
last_included_offset = it->get()->offsets().committed_offset;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this gated on current_size > target.min_size? Doesn't the current_size < target.min_size check below kind of encompass that? Or put another way, if the first segments we iterated through didn't meet the min_size requirement, why is it important to avoid setting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous version(s) of this code the variable was used to indicate that we found the offset range.
After the return type was changed to optional I started using std::nullopt to return empty result. I'm inclined to keep it the way it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation. IMO it makes it more confusing leaving it in, especially because there are so many branches in this code already

Lock the whole segment range in advance.
Also, fix the error in compacted test case by handling the situation
when the whole offset range is fully compacted and has size 0.
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments are pretty much nits. I know it's been a long process, but thanks for continuing to improve this code!

@Lazin Lazin merged commit e3acdec into redpanda-data:dev Feb 2, 2024
17 checks passed
@@ -153,6 +153,12 @@ class segment_index {
size_t filepos);
std::optional<entry> find_nearest(model::offset);
std::optional<entry> find_nearest(model::timestamp);
/// Find entry by file offset (the value may overshoot or find precise
/// match)
std::optional<entry> find_above_size_bytes(size_t distance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be const?

std::optional<entry> find_above_size_bytes(size_t distance);
/// Find entry by file offset (the value will undershoot or find precise
/// match)
std::optional<entry> find_below_size_bytes(size_t distance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be const?

struct batch_size_accumulator {
ss::future<ss::stop_iteration> operator()(model::record_batch b) {
vassert(
result_size_bytes != nullptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me like you could add a constructor to this struct and not pay this cost on each operator() invocation?

auto reader = co_await make_reader(reader_cfg);

try {
co_await std::move(reader).consume(acc, model::no_timeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that consume() allowed the consumer to return a value, which would mean, you probably don't need to pass in a pointer to size_bytes in order to get size_bytes back out after the consumer is done processing data from the reader?

src/v/storage/disk_log_impl.cc Show resolved Hide resolved
Comment on lines 2025 to 2027
const auto& offsets = it->get()->offsets();
auto r_lock = co_await it->get()->read_lock();
if (offsets.base_offset <= first && offsets.committed_offset >= first) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking anytime we have a reference to something accessed via an iterator held across co_await calls we should see a comment that explains why it is safe. can you add that here?

Comment on lines 2042 to 2044
auto file_pos = co_await get_file_offset(
*it, index_entry, first, false, io_priority);
auto sz = it->get()->file_size() - file_pos;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is any of this safe w.r.t. iterator invalidations?

@@ -3667,3 +3668,183 @@ FIXTURE_TEST(test_offset_range_size, storage_test_fixture) {

#endif
};

FIXTURE_TEST(test_offset_range_size2, storage_test_fixture) {
#ifdef NDEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should definitely be a comment explaining why this doesn't work in debug mode.

@@ -3848,3 +3848,175 @@ FIXTURE_TEST(test_offset_range_size2, storage_test_fixture) {

#endif
};

FIXTURE_TEST(test_offset_range_size_compacted, storage_test_fixture) {
#ifdef NDEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should definitely be a comment explaining why this doesn't work in debug mode.

@@ -4020,3 +4031,262 @@ FIXTURE_TEST(test_offset_range_size_compacted, storage_test_fixture) {

#endif
};

FIXTURE_TEST(test_offset_range_size2_compacted, storage_test_fixture) {
#ifdef NDEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't this work in debug mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants