cst/inv: locate latest report #16962

abhijat · 2024-03-08T01:31:34Z

Adds a utility to scan through all the dates available for a given inventory config and find the latest report which has been generated and is ready for download.

Returns the path to manifest describing the report found as well as report paths and the datetime.

The caller can then use the metadata to download the actual report.

Backports Required

Release Notes

none

An object-exists method is extracted from the segment-exists method, and the segment existence check is made to reuse the new method. The new method will be used in the next couple of changes.

vbotbuildovich · 2024-03-08T04:05:15Z

new failures in https://buildkite.com/redpanda/redpanda/builds/45839#018e1bf6-0767-458b-a486-baed655953d9:

"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"

new failures in https://buildkite.com/redpanda/redpanda/builds/46107#018e37ac-1422-4a69-a397-aa0b1722f6c1:

"rptest.tests.data_transforms_test.DataTransformsLeadershipChangingTest.test_leadership_changing_randomly"

list-object and object-exists methods are moved up to the cloud_storage_api interface, to ease testing in the next set of changes.

abhijat · 2024-03-08T14:46:28Z

Note for review: the core of the changes is in https://github.com/redpanda-data/redpanda/blob/fa416080b77b3a1add63bb75bd374b503aa6c1e1/src/v/cloud_storage/inventory/aws_ops.cc where a new method is added latest_report_metadata

The rest of the changes are unit tests and the supporting types.

vbotbuildovich · 2024-03-11T11:26:25Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45963#018e2cfa-e6e4-431e-8fd9-9ba09546af85

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46107#018e37bc-e1d4-4451-ad73-15bd1bd8063c

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46107#018e37bc-e1cb-4138-84ea-d7c1333ac201

abhijat · 2024-03-11T13:23:44Z

CI failure (gtest raft timeout) #14425

Lazin

LGTM overall
I think that the PR can be improved in two ways:

error handling
some extra logging in aws_ops.cc

Lazin · 2024-03-12T08:25:46Z

src/v/cloud_storage/types.cc

@@ -547,4 +547,14 @@ void transfer_details::on_backoff(remote_probe& probe) {
    run_callback(backoff_cb, probe);
 }

+std::ostream& operator<<(std::ostream& os, existence_check_type head) {
+    switch (head) {
+        using enum cloud_storage::existence_check_type;


Lazin · 2024-03-12T09:19:33Z

src/v/cloud_storage/inventory/aws_ops.cc

+    // inventory is ready, so we probe the checksum files from latest to
+    // earliest. We stop at the first checksum we find.
+    for (const auto& path_with_datetime : checksum_paths) {
+        if (const auto result = co_await remote.object_exists(


Why do we have to probe checksum objects?

There is a pattern that all three cloud vendors follow that the checksum file for a report is written to bucket last, so the checksum file being present guarantees that the actual report and its manifest are ready for download.

Lazin · 2024-03-12T09:33:55Z

src/v/cloud_storage/inventory/aws_ops.cc

@@ -111,4 +146,120 @@ ss::future<bool> aws_ops::inventory_configuration_exists(
    co_return dl_res == download_result::success;
 }

+ss::future<result<report_metadata, error_outcome>>
+aws_ops::latest_report_metadata(
+  cloud_storage::cloud_storage_api& remote, retry_chain_node& parent_rtc) {


could this method be noexcept?
it returns a result which can contain an error but it can also throw an exception which makes error handling more complicated because you have to check the result and catch potential exceptions

good point, I will try it out

added for this and the other method

src/v/cloud_storage/inventory/aws_ops.cc

Lazin · 2024-03-12T10:22:11Z

Maybe I also missed something or it's planned for later PR's but it'd be great to add a metric just to see if the feature is working and how much work is done using the inventory and with probing.

andrwng · 2024-03-12T22:59:05Z

src/v/cloud_storage/inventory/aws_ops.cc

+  cloud_storage::cloud_storage_api& remote, retry_chain_node& parent_rtc) {
+    // The prefix is generated according to
+    // https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-location.html
+    const auto prefix = cloud_storage_clients::object_key{


nit: maybe call this full_prefix or something to differentiate it from _prefix?

andrwng · 2024-03-12T23:02:10Z

src/v/cloud_storage/inventory/aws_ops.cc

+        return checksum_path(prefix().native(), datetime);
+    };
+
+    // The prefix may contain non-datetime folders such as hive data, which we


Wow, cool that S3's inventory natively supports Apache Hive as a first-class citizen

andrwng · 2024-03-13T00:29:06Z

src/v/cloud_storage/inventory/aws_ops.cc

+    const auto doc = parse_json_response(std::move(json_response));
+    if (doc.HasParseError()) {


nit: here may not be the best place, but it would be great for debugging if somewhere in the error path, we logged the bad json (maybe truncated, in case it's huge)

andrwng · 2024-03-13T00:57:27Z

src/v/cloud_storage/inventory/types.h

+using report_datetime = named_type<ss::sstring, struct report_datetime_t>;
+
+// Stores a path to one or more reports. While in most cases one report run will
+// generate just one CSV file, in some cases (e.g. Google with more than 1


Does this refer to the manifest.json example here https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-location.html ? It's curious that the manifest's files field is an array if it only expects to generate one file. (not really anything to do about it, just seems a bit odd)

I think this is a forward compatibility thing, they might be prepping for sharding the results into different files like google does. Our implementation is expected to always work with a list of files.

andrwng · 2024-03-13T01:11:43Z

src/v/cloud_storage/inventory/inv_ops.cc

 inv_ops::maybe_create_inventory_configuration(
  cloud_storage_api& remote, retry_chain_node& parent) {
    if (const auto exists = co_await inventory_configuration_exists(
          remote, parent);
-        exists) {
+        exists.has_value() && exists.value()) {
        co_return inventory_creation_result::already_exists;
    }


If the result contains an error, should we propagate that error up instead of creating a new one?

In this specific line there is no error as the config exists, so we return a success value. In general though, in earlier PRs we propagated the http client errors up to the caller, I initially thought of somehow mapping http client errors to values in the new error_outcome enum that we are using for inventory module, but I decided against it as the caller does not care about the http error type in this case, just that something has failed.

We need logging of the underlying error here which is missing right now, once logs are added here (WIP) I think this should be sufficient for troubleshooting.

However if you think returning the http client errors is clearer then I can give mapping the error outcomes another try.

I don't think progragating the exact http error is necessarily important (especially if you're adding some logging), but at least it seems like inventory_configuration_exists() can return failed_to_check_inv_status which seems like it should be bubbled up? As is it looks like we proceed to creation. I suppose it's likely that whatever error was seen in checking existence will also be seen below, but still seems like surprising behavior

I see what you mean now. Yeah that needs to be fixed, will change this now. Thanks for pointing out.

changed the structure and added error logging

andrwng · 2024-03-13T01:19:00Z

src/v/cloud_storage/inventory/aws_ops.cc

+              path_with_datetime.path,
+              parent_rtc,
+              existence_check_type::object);
+            result == download_result::success) {


It's a bit surprising to proceed in the event this result is an error. Wouldn't that mean we'd proceed with a knowingly stale inventory file?

My assumption was that if there is an error, it would be not_found because the report is still being prepared.

Additionally the caller (WIP PR) is going to compare the returned metadata's datetime with whatever data the caller has last consumed, so it will only every consume newer reports than whatever it last saw, but there is still a problem here.

If we are encountering a non-404 error even after retries built into remote, then we will probably encounter the same errors for all reports and should probably bail out at the first non 404 error instead of progressively trying older reports.

I see, thanks for explaining. Would you mind documenting that assumption in a comment, and/or make sure that if the assumption isn't true that we'll still do something reasonable? For instance, it doesn't necessarily seem impossible for some calls to get unlucky and time out, while subsequent calls succeed.

I've made the code path more explicit in the latest push, now if we fail with 404, we try the next latest report. In all other failure cases we abort and the caller can retry later.

A new method is added which performs the following actions: * scan the bucket for inventory reports * find the latest report which is ready for download * download the manifest for report and extract paths to report files Finally metadata for the report is returned which contains the path to manifest, paths to actual reports and the date-time as an ISO-8601 formatted string. This metadata can be used by the caller to determine if it wants the report (i.e. the date-time is newer than last consumed report by caller) and to actually download the report files.

abhijat · 2024-03-13T16:21:35Z

/ci-repeat

cloud_storage: add object-exists method

41815d6

An object-exists method is extracted from the segment-exists method, and the segment existence check is made to reuse the new method. The new method will be used in the next couple of changes.

github-actions bot added the area/redpanda label Mar 8, 2024

mergify bot mentioned this pull request Mar 8, 2024

cst/inv: Download latest report #16896

Closed

6 tasks

abhijat added 2 commits March 8, 2024 13:31

cloud_storage: move remote methods to interface

8b84f3a

list-object and object-exists methods are moved up to the cloud_storage_api interface, to ease testing in the next set of changes.

cst/inv: Extract common definitions for tests

0981c24

abhijat force-pushed the feat/inv-scrub/locate-reports branch 5 times, most recently from 35d45a4 to fa41608 Compare March 8, 2024 14:43

abhijat force-pushed the feat/inv-scrub/locate-reports branch from fa41608 to 2a94838 Compare March 11, 2024 09:14

abhijat requested review from andrwng, Lazin and andijcr March 11, 2024 13:29

dotnwat self-requested a review March 11, 2024 15:54

Lazin previously approved these changes Mar 12, 2024

View reviewed changes

andrwng reviewed Mar 13, 2024

View reviewed changes

abhijat dismissed Lazin’s stale review via 092c8f2 March 13, 2024 06:24

abhijat force-pushed the feat/inv-scrub/locate-reports branch 5 times, most recently from 6de8bdf to bc602ac Compare March 13, 2024 08:53

abhijat added 2 commits March 13, 2024 14:28

cst/inv: wire report metadata fetcher to inv ops

38cb577

cst/inv: Add tests for manifest parsing

1da4b1c

abhijat force-pushed the feat/inv-scrub/locate-reports branch from bc602ac to 1da4b1c Compare March 13, 2024 08:58

cst/inv: Add error logging

3d4c24b

Lazin approved these changes Mar 13, 2024

View reviewed changes

abhijat merged commit fd49299 into redpanda-data:dev Mar 14, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cst/inv: locate latest report #16962

cst/inv: locate latest report #16962

abhijat commented Mar 8, 2024

vbotbuildovich commented Mar 8, 2024 •

edited

abhijat commented Mar 8, 2024

vbotbuildovich commented Mar 11, 2024 •

edited

abhijat commented Mar 11, 2024

Lazin left a comment

Lazin Mar 12, 2024

Lazin Mar 12, 2024

abhijat Mar 12, 2024

Lazin Mar 12, 2024

abhijat Mar 12, 2024

abhijat Mar 13, 2024

Lazin commented Mar 12, 2024

andrwng Mar 12, 2024

abhijat Mar 13, 2024

andrwng Mar 12, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

abhijat Mar 13, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

andrwng Mar 13, 2024

abhijat Mar 13, 2024

abhijat commented Mar 13, 2024

		const auto doc = parse_json_response(std::move(json_response));
		if (doc.HasParseError()) {

cst/inv: locate latest report #16962

cst/inv: locate latest report #16962

Conversation

abhijat commented Mar 8, 2024

Backports Required

Release Notes

vbotbuildovich commented Mar 8, 2024 • edited

abhijat commented Mar 8, 2024

vbotbuildovich commented Mar 11, 2024 • edited

abhijat commented Mar 11, 2024

Lazin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijat commented Mar 13, 2024

vbotbuildovich commented Mar 8, 2024 •

edited

vbotbuildovich commented Mar 11, 2024 •

edited