-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage/inventory: Add inventory create-config API for AWS #16611
cloud_storage/inventory: Add inventory create-config API for AWS #16611
Conversation
45aa4ba
to
a671f1b
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45040#018da85d-63df-4910-bef6-61b91aa6c062 |
889110c
to
9c0957e
Compare
The create-inventory action for AWS is added. The inventory API is divided into a low level, vendor specific API and a high level vendor agnostic API. The high level API will be introduced in the next few commits. The low level API hides vendor specific details for common inventory related operations such as creating an inventory schedule, finding the latest inventory report, deleting old inventory reports etc. These actions differ in details across vendors. The high level API will use a variant to call the correct vendor API depending on where redpanda is deployed. This specific pull request focusses on the create-inventory operation for AWS. This operation schedules an inventory report to be generated on a given frequency (daily here) and produces reports of a given format (csv here). Additionally, a new interface called cloud_storage_api is created which remote now inherits from. This interface is used in the inventory APIs mainly to assist in testing. Googlemock can create mocks of this interface, thus avoiding the need to spin up an imposter to test the inventory APIs. While remote now inherits from the new interface and one of its methods is now virtual, this should not add any overhead for code paths which do not use the new interface.
9c0957e
to
e72bf29
Compare
The high level API which will eventually be used by the scrubber service is added here. It stores the vendor specific API as a variant field, and simply visits the variant for any operations. The low level API now inherits from a base_ops interface. The purpose is not runtime polymorphism but rather enforcing constraints on the variant stored in the high level API, such that any type added to the variant must provide methods from base_ops.
For inventory config creation, a specific upload object type is created and used when uploading to make logs clearer.
e72bf29
to
46a4b96
Compare
|
||
namespace cloud_storage::inventory { | ||
|
||
using ops_t = inv_ops_variant<aws_ops>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be a bit premature, but it feels like it'd be simpler to actually use this as a base class and use polymorphism. Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be simpler to actually use this as a base class and use polymorphism
I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?
Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies
Do you mean that the implementation should be moved down to cloud_storage_clients
? I briefly mentioned this in the cover letter but to elaborate, there are many differences between AWS and GCP implementations for inventory operations in terms of responses, request structure and even figuring out what the latest report is, which will require creating a GCP specific client.
We do not need a separate GCP client so far because the cloud_storage_clients
module works on low level HTTP calls and is able to keep GCP and S3 under the same interface because of that.
The inventory related operations are higher level so it makes sense to me to implement them here rather than as more primitive HTTP calls. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issues with complexity here that could be simplified?
I guess it'll depend on how the rest of the implementation goes. My comment is more of a knee-jerk reaction: maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.
That said, I can see how it'd be much simpler to use remote
as the primitive instead of baking this deeper into something like the client, and I don't think variant vs polymorphism will change much given we're relying on such a high level primitive.
Good point about AWS vs GCP, though even that seems like a leaky abstraction that will eventually be ironed out and separated. I don't feel strongly that this should happen now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good points, I agree that we should implement things lower down in remote/cloud_storage_clients where possible, but I think these operations belong in a higher level API.
Right now we have the layers of implementation as:
- cloud_storage_clients (s3/abs) : lowest layer - usually methods map directly to HTTP verbs. S3 and GCP are assumed to be the same
- cloud_storage::remote: usually methods map to redpanda cloud storage object related actions, with some generic calls, eg segment operations, manifest operations. Provides retry, client pool etc.
- inventory (as seen/planned here): methods map to inventory specific operations, will often contain sequences of multiple calls into remote and will need to manage the state of calls. eg
- AWS download report: check if a checksum file is present, then downloads a CSV file from a computed path, might need downloading a manifest and parsing it first. https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-location.html
- GCP download report: get the UUID for redpanda inventory config, maybe cache it in memory, check if that UUID's checksum is present, then download a JSON manifest, parse out the actual inventory file paths from it, and download those. These report file paths may be multiple as GCP divides report into shards, so handle returning the output stream from the multiple shards. https://cloud.google.com/storage/docs/insights/using-inventory-reports#download-reports
- Azure: a recursive listing might be required to find the latest checksum and inventory, as reports are kept in a path which contains date and time components. https://learn.microsoft.com/en-us/azure/storage/blobs/blob-inventory#inventory-output
The inventory API methods will provide an interface to hide these platform specific details and allow the caller to just call eg download_latest_report
and get a stream to use, and hide the successive calls required to do these actions.
It is certainly possible to put these into remote, but I think that it is easier to understand by keeping the inventory related methods secluded as these are (IMO) higher level operations.
maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.
There is quite a bit of platform/vendor specific differences in terms of API calls, the plan is to let aws_ops
, gcs_ops
and azure_ops
(the last two to be implemented later) implement the platform specific details, and inv_ops
just uses visit
to proxy the calls to the underlying variant.
So the call chain looks like:
// AWS
inv_ops (dl_report) -> aws_ops (dl_report) ->
call 1 -> remote -> s3_client
call 2 -> remote -> s3_client
...
//GCP
inv_ops (dl_report) -> gcs_ops (dl_report) ->
call 1 -> remote -> s3_client
call 2 -> remote -> s3_client
call 3 -> remote -> s3_client
call 4 -> remote -> s3_client
...
// Azure
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be simpler to actually use this as a base class and use polymorphism
I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?
I'm not sure I understand the reasoning that there being a fixed set of implementations is motivation for using std::variant over polymorphism or pimpl. Isn't that always the case for a given code base, and if we add more cloud providers, that set changes. Can you elaborate on that?
I'd argue against std::variant unless the static typing is important, otherwise we end up generating extra code and also have the template programming complexity to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is that when writing a utility such as a library which may be freely extended by classes which are not known at the time of the implementation of the base class, runtime polymorphism (interface based design) is a better fit.
Here we know upfront that there are only going to be three implementations, a sum type makes more sense to me.
In other words, saying that this object can only ever be one of vendor_api_a, vendor_api_b or vendor_api_c seems a better fit to me than saying that this object is a vendor_api which fulfills the expected contract.
The contract is still enforced by the vendor_ops_provider
concept.
otherwise we end up generating extra code
Do you mean extra code generation by using the variant, or because of the templates? The template definition is only for the contracts and is fairly isolated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to suggestions though, as I understand sum types have better ergonomics in some languages than others and std::variant
is a recent addition, if there are any major concerns with using std::variant
here, the design is not tied to it and we can make changes.
I still think the variant+visit based approach looks easier to understand and more concise than PIMPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The inventory API is introduced for AWS. The primary object introduced here is
cloud_storage::inventory::inv_ops
which is not wired into any service yet. Eventually it will be used by the scrubber-inventory download service (yet to be created) to schedule and download reports.This PR contains the
create_inventory_configuration
method. The method added here schedules reports to be created for the given frequency and format by creating a report configuration. It calls PutBucketInventoryConfigurationThe API is divided into high level (vendor agnostic router) and low level (vendor specific) objects. The low level APIs, of which only AWS is implemented here, will hide the vendor specific actions required to perform tasks such as:
The high level API contains a variant to the low level API, and visits methods in the low level API depending on which vendor redpanda is deployed for. The variant will be extended to Azure and GCP as these are implemented.
Why not add methods directly to
cloud_storage::remote
instead of adding a new abstraction layer: remote can already call the right vendor HTTP verbs using specific clients, so potentially the create-inventory call could be added there without much overhead.However, remote generally deals with single, retryable HTTP calls (except for a couple of bulk operation cases), and it is built around segment/manifest operations. Only a couple of general purpose methods are exposed there.
While the current inventory config creation maps to a single PUT call, the next methods which will be implemented in future PRs such as finding the latest report, checking if latest report is ready etc. will span multiple HTTP calls, for example downloading a JSON manifest, parsing paths in it and downloading those paths, and generally require logic which is inventory API specific.
It is easier to add a high level API to manage these operations than to push these down into the s3 client or the abs client.
Also, the GCS and AWS APIs are exactly similar for cloud storage operations so the same s3 client caters to both, but these two vendors differ in the inventory management calls. If we try to push down the details into the S3 client, it will need to be aware of the backend and call different implementations for GCS and AWS.
Backports Required
Release Notes