cloud_storage/inventory: Add inventory create-config API for AWS #16611

abhijat · 2024-02-14T14:30:10Z

The inventory API is introduced for AWS. The primary object introduced here is cloud_storage::inventory::inv_ops which is not wired into any service yet. Eventually it will be used by the scrubber-inventory download service (yet to be created) to schedule and download reports.

This PR contains the create_inventory_configuration method. The method added here schedules reports to be created for the given frequency and format by creating a report configuration. It calls PutBucketInventoryConfiguration

The API is divided into high level (vendor agnostic router) and low level (vendor specific) objects. The low level APIs, of which only AWS is implemented here, will hide the vendor specific actions required to perform tasks such as:

creating an inventory config (implemented in this PR)
ascertaining if the latest report is ready
downloading the latest report
cleaning up old reports

The high level API contains a variant to the low level API, and visits methods in the low level API depending on which vendor redpanda is deployed for. The variant will be extended to Azure and GCP as these are implemented.

Why not add methods directly to cloud_storage::remote instead of adding a new abstraction layer: remote can already call the right vendor HTTP verbs using specific clients, so potentially the create-inventory call could be added there without much overhead.

However, remote generally deals with single, retryable HTTP calls (except for a couple of bulk operation cases), and it is built around segment/manifest operations. Only a couple of general purpose methods are exposed there.

While the current inventory config creation maps to a single PUT call, the next methods which will be implemented in future PRs such as finding the latest report, checking if latest report is ready etc. will span multiple HTTP calls, for example downloading a JSON manifest, parsing paths in it and downloading those paths, and generally require logic which is inventory API specific.

It is easier to add a high level API to manage these operations than to push these down into the s3 client or the abs client.

Also, the GCS and AWS APIs are exactly similar for cloud storage operations so the same s3 client caters to both, but these two vendors differ in the inventory management calls. If we try to push down the details into the S3 client, it will need to be aware of the backend and call different implementations for GCS and AWS.

Backports Required

Release Notes

none

vbotbuildovich · 2024-02-14T17:36:24Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45040#018da85d-63df-4910-bef6-61b91aa6c062

The create-inventory action for AWS is added. The inventory API is divided into a low level, vendor specific API and a high level vendor agnostic API. The high level API will be introduced in the next few commits. The low level API hides vendor specific details for common inventory related operations such as creating an inventory schedule, finding the latest inventory report, deleting old inventory reports etc. These actions differ in details across vendors. The high level API will use a variant to call the correct vendor API depending on where redpanda is deployed. This specific pull request focusses on the create-inventory operation for AWS. This operation schedules an inventory report to be generated on a given frequency (daily here) and produces reports of a given format (csv here). Additionally, a new interface called cloud_storage_api is created which remote now inherits from. This interface is used in the inventory APIs mainly to assist in testing. Googlemock can create mocks of this interface, thus avoiding the need to spin up an imposter to test the inventory APIs. While remote now inherits from the new interface and one of its methods is now virtual, this should not add any overhead for code paths which do not use the new interface.

The high level API which will eventually be used by the scrubber service is added here. It stores the vendor specific API as a variant field, and simply visits the variant for any operations. The low level API now inherits from a base_ops interface. The purpose is not runtime polymorphism but rather enforcing constraints on the variant stored in the high level API, such that any type added to the variant must provide methods from base_ops.

For inventory config creation, a specific upload object type is created and used when uploading to make logs clearer.

andrwng · 2024-02-16T02:24:49Z

src/v/cloud_storage/inventory/inv_ops.h

+
+namespace cloud_storage::inventory {
+
+using ops_t = inv_ops_variant<aws_ops>;


nit: might be a bit premature, but it feels like it'd be simpler to actually use this as a base class and use polymorphism. Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies

it'd be simpler to actually use this as a base class and use polymorphism

I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?

Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies

Do you mean that the implementation should be moved down to cloud_storage_clients? I briefly mentioned this in the cover letter but to elaborate, there are many differences between AWS and GCP implementations for inventory operations in terms of responses, request structure and even figuring out what the latest report is, which will require creating a GCP specific client.

We do not need a separate GCP client so far because the cloud_storage_clients module works on low level HTTP calls and is able to keep GCP and S3 under the same interface because of that.

The inventory related operations are higher level so it makes sense to me to implement them here rather than as more primitive HTTP calls. WDYT?

issues with complexity here that could be simplified?

I guess it'll depend on how the rest of the implementation goes. My comment is more of a knee-jerk reaction: maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.

That said, I can see how it'd be much simpler to use remote as the primitive instead of baking this deeper into something like the client, and I don't think variant vs polymorphism will change much given we're relying on such a high level primitive.

Good point about AWS vs GCP, though even that seems like a leaky abstraction that will eventually be ironed out and separated. I don't feel strongly that this should happen now though.

These are good points, I agree that we should implement things lower down in remote/cloud_storage_clients where possible, but I think these operations belong in a higher level API.

Right now we have the layers of implementation as:

cloud_storage_clients (s3/abs) : lowest layer - usually methods map directly to HTTP verbs. S3 and GCP are assumed to be the same

cloud_storage::remote: usually methods map to redpanda cloud storage object related actions, with some generic calls, eg segment operations, manifest operations. Provides retry, client pool etc.

inventory (as seen/planned here): methods map to inventory specific operations, will often contain sequences of multiple calls into remote and will need to manage the state of calls. eg

AWS download report: check if a checksum file is present, then downloads a CSV file from a computed path, might need downloading a manifest and parsing it first. https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-location.html

GCP download report: get the UUID for redpanda inventory config, maybe cache it in memory, check if that UUID's checksum is present, then download a JSON manifest, parse out the actual inventory file paths from it, and download those. These report file paths may be multiple as GCP divides report into shards, so handle returning the output stream from the multiple shards. https://cloud.google.com/storage/docs/insights/using-inventory-reports#download-reports

Azure: a recursive listing might be required to find the latest checksum and inventory, as reports are kept in a path which contains date and time components. https://learn.microsoft.com/en-us/azure/storage/blobs/blob-inventory#inventory-output

The inventory API methods will provide an interface to hide these platform specific details and allow the caller to just call eg download_latest_report and get a stream to use, and hide the successive calls required to do these actions.

It is certainly possible to put these into remote, but I think that it is easier to understand by keeping the inventory related methods secluded as these are (IMO) higher level operations.

maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.

There is quite a bit of platform/vendor specific differences in terms of API calls, the plan is to let aws_ops, gcs_ops and azure_ops (the last two to be implemented later) implement the platform specific details, and inv_ops just uses visit to proxy the calls to the underlying variant.

So the call chain looks like:

// AWS inv_ops (dl_report) -> aws_ops (dl_report) -> call 1 -> remote -> s3_client call 2 -> remote -> s3_client ... //GCP inv_ops (dl_report) -> gcs_ops (dl_report) -> call 1 -> remote -> s3_client call 2 -> remote -> s3_client call 3 -> remote -> s3_client call 4 -> remote -> s3_client ... // Azure ...

it'd be simpler to actually use this as a base class and use polymorphism

I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?

@abhijat

I'm not sure I understand the reasoning that there being a fixed set of implementations is motivation for using std::variant over polymorphism or pimpl. Isn't that always the case for a given code base, and if we add more cloud providers, that set changes. Can you elaborate on that?

I'd argue against std::variant unless the static typing is important, otherwise we end up generating extra code and also have the template programming complexity to deal with.

My reasoning is that when writing a utility such as a library which may be freely extended by classes which are not known at the time of the implementation of the base class, runtime polymorphism (interface based design) is a better fit.

Here we know upfront that there are only going to be three implementations, a sum type makes more sense to me.

In other words, saying that this object can only ever be one of vendor_api_a, vendor_api_b or vendor_api_c seems a better fit to me than saying that this object is a vendor_api which fulfills the expected contract.

The contract is still enforced by the vendor_ops_provider concept.

otherwise we end up generating extra code

Do you mean extra code generation by using the variant, or because of the templates? The template definition is only for the contracts and is fairly isolated.

I'm open to suggestions though, as I understand sum types have better ergonomics in some languages than others and std::variant is a recent addition, if there are any major concerns with using std::variant here, the design is not tied to it and we can make changes.

I still think the variant+visit based approach looks easier to understand and more concise than PIMPL.

Lazin

LGTM

github-actions bot added the area/redpanda label Feb 14, 2024

abhijat force-pushed the feat/inv-scrub/create-inv-config branch from 45aa4ba to a671f1b Compare February 14, 2024 14:31

abhijat force-pushed the feat/inv-scrub/create-inv-config branch 6 times, most recently from 889110c to 9c0957e Compare February 15, 2024 04:17

abhijat force-pushed the feat/inv-scrub/create-inv-config branch from 9c0957e to e72bf29 Compare February 15, 2024 05:36

abhijat changed the title ~~cloud_storage_inventory: AWS create inventory config (DNM)~~ cloud_storage/inventory: Add inventory create-config API for AWS Feb 15, 2024

abhijat added 2 commits February 15, 2024 11:28

cloud_storage/inventory: Set upload object type

46a4b96

For inventory config creation, a specific upload object type is created and used when uploading to make logs clearer.

abhijat force-pushed the feat/inv-scrub/create-inv-config branch from e72bf29 to 46a4b96 Compare February 15, 2024 05:58

abhijat requested review from Lazin, andijcr, andrwng and nvartolomei February 15, 2024 08:09

andrwng reviewed Feb 16, 2024

View reviewed changes

Lazin approved these changes Feb 17, 2024

View reviewed changes

abhijat merged commit 62bd1a6 into redpanda-data:dev Feb 19, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage/inventory: Add inventory create-config API for AWS #16611

cloud_storage/inventory: Add inventory create-config API for AWS #16611

abhijat commented Feb 14, 2024 •

edited

vbotbuildovich commented Feb 14, 2024

andrwng Feb 16, 2024

abhijat Feb 16, 2024

andrwng Feb 17, 2024

abhijat Feb 17, 2024

dotnwat Feb 29, 2024

abhijat Mar 1, 2024

abhijat Mar 1, 2024

Lazin left a comment


		namespace cloud_storage::inventory {

		using ops_t = inv_ops_variant<aws_ops>;

cloud_storage/inventory: Add inventory create-config API for AWS #16611

cloud_storage/inventory: Add inventory create-config API for AWS #16611

Conversation

abhijat commented Feb 14, 2024 • edited

Backports Required

Release Notes

vbotbuildovich commented Feb 14, 2024

andrwng Feb 16, 2024

Choose a reason for hiding this comment

abhijat Feb 16, 2024

Choose a reason for hiding this comment

andrwng Feb 17, 2024

Choose a reason for hiding this comment

abhijat Feb 17, 2024

Choose a reason for hiding this comment

dotnwat Feb 29, 2024

Choose a reason for hiding this comment

abhijat Mar 1, 2024

Choose a reason for hiding this comment

abhijat Mar 1, 2024

Choose a reason for hiding this comment

Lazin left a comment

Choose a reason for hiding this comment

abhijat commented Feb 14, 2024 •

edited