Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage/inventory: Add inventory create-config API for AWS #16611

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Feb 14, 2024

The inventory API is introduced for AWS. The primary object introduced here is cloud_storage::inventory::inv_ops which is not wired into any service yet. Eventually it will be used by the scrubber-inventory download service (yet to be created) to schedule and download reports.

This PR contains the create_inventory_configuration method. The method added here schedules reports to be created for the given frequency and format by creating a report configuration. It calls PutBucketInventoryConfiguration

The API is divided into high level (vendor agnostic router) and low level (vendor specific) objects. The low level APIs, of which only AWS is implemented here, will hide the vendor specific actions required to perform tasks such as:

  • creating an inventory config (implemented in this PR)
  • ascertaining if the latest report is ready
  • downloading the latest report
  • cleaning up old reports

The high level API contains a variant to the low level API, and visits methods in the low level API depending on which vendor redpanda is deployed for. The variant will be extended to Azure and GCP as these are implemented.

Why not add methods directly to cloud_storage::remote instead of adding a new abstraction layer: remote can already call the right vendor HTTP verbs using specific clients, so potentially the create-inventory call could be added there without much overhead.

However, remote generally deals with single, retryable HTTP calls (except for a couple of bulk operation cases), and it is built around segment/manifest operations. Only a couple of general purpose methods are exposed there.

While the current inventory config creation maps to a single PUT call, the next methods which will be implemented in future PRs such as finding the latest report, checking if latest report is ready etc. will span multiple HTTP calls, for example downloading a JSON manifest, parsing paths in it and downloading those paths, and generally require logic which is inventory API specific.

It is easier to add a high level API to manage these operations than to push these down into the s3 client or the abs client.

Also, the GCS and AWS APIs are exactly similar for cloud storage operations so the same s3 client caters to both, but these two vendors differ in the inventory management calls. If we try to push down the details into the S3 client, it will need to be aware of the backend and call different implementations for GCS and AWS.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

@abhijat abhijat force-pushed the feat/inv-scrub/create-inv-config branch 6 times, most recently from 889110c to 9c0957e Compare February 15, 2024 04:17
The create-inventory action for AWS is added. The inventory API is
divided into a low level, vendor specific API and a high level vendor
agnostic API. The high level API will be introduced in the next few
commits.

The low level API hides vendor specific details for common inventory
related operations such as creating an inventory schedule, finding the
latest inventory report, deleting old inventory reports etc. These
actions differ in details across vendors.

The high level API will use a variant to call the correct vendor API
depending on where redpanda is deployed.

This specific pull request focusses on the create-inventory operation
for AWS. This operation schedules an inventory report to be generated on
a given frequency (daily here) and produces reports of a given format
(csv here).

Additionally, a new interface called cloud_storage_api is created which
remote now inherits from. This interface is used in the inventory APIs
mainly to assist in testing. Googlemock can create mocks of this
interface, thus avoiding the need to spin up an imposter to test the
inventory APIs.

While remote now inherits from the new interface and one of its methods
is now virtual, this should not add any overhead for code paths which do
not use the new interface.
@abhijat abhijat force-pushed the feat/inv-scrub/create-inv-config branch from 9c0957e to e72bf29 Compare February 15, 2024 05:36
@abhijat abhijat changed the title cloud_storage_inventory: AWS create inventory config (DNM) cloud_storage/inventory: Add inventory create-config API for AWS Feb 15, 2024
The high level API which will eventually be used by the scrubber service
is added here. It stores the vendor specific API as a variant field, and
simply visits the variant for any operations.

The low level API now inherits from a base_ops interface. The purpose is
not runtime polymorphism but rather enforcing constraints on the variant
stored in the high level API, such that any type added to the variant
must provide methods from base_ops.
For inventory config creation, a specific upload object type is created
and used when uploading to make logs clearer.

namespace cloud_storage::inventory {

using ops_t = inv_ops_variant<aws_ops>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be a bit premature, but it feels like it'd be simpler to actually use this as a base class and use polymorphism. Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be simpler to actually use this as a base class and use polymorphism

I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?

Or perhaps added to the cloud_storage_clients/client.h interface? That would feel a bit more natural, since it's a clear division between the different platforms that can be tested without cross dependencies

Do you mean that the implementation should be moved down to cloud_storage_clients? I briefly mentioned this in the cover letter but to elaborate, there are many differences between AWS and GCP implementations for inventory operations in terms of responses, request structure and even figuring out what the latest report is, which will require creating a GCP specific client.

We do not need a separate GCP client so far because the cloud_storage_clients module works on low level HTTP calls and is able to keep GCP and S3 under the same interface because of that.

The inventory related operations are higher level so it makes sense to me to implement them here rather than as more primitive HTTP calls. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues with complexity here that could be simplified?

I guess it'll depend on how the rest of the implementation goes. My comment is more of a knee-jerk reaction: maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.

That said, I can see how it'd be much simpler to use remote as the primitive instead of baking this deeper into something like the client, and I don't think variant vs polymorphism will change much given we're relying on such a high level primitive.

Good point about AWS vs GCP, though even that seems like a leaky abstraction that will eventually be ironed out and separated. I don't feel strongly that this should happen now though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points, I agree that we should implement things lower down in remote/cloud_storage_clients where possible, but I think these operations belong in a higher level API.

Right now we have the layers of implementation as:

  • cloud_storage_clients (s3/abs) : lowest layer - usually methods map directly to HTTP verbs. S3 and GCP are assumed to be the same
  • cloud_storage::remote: usually methods map to redpanda cloud storage object related actions, with some generic calls, eg segment operations, manifest operations. Provides retry, client pool etc.
  • inventory (as seen/planned here): methods map to inventory specific operations, will often contain sequences of multiple calls into remote and will need to manage the state of calls. eg

The inventory API methods will provide an interface to hide these platform specific details and allow the caller to just call eg download_latest_report and get a stream to use, and hide the successive calls required to do these actions.

It is certainly possible to put these into remote, but I think that it is easier to understand by keeping the inventory related methods secluded as these are (IMO) higher level operations.

maybe there will be less platform-specific bits in there than I'm imagining, but without knowing more about how each platform will be configured, stronger separation feels natural since that's what we do between s3/abs.

There is quite a bit of platform/vendor specific differences in terms of API calls, the plan is to let aws_ops, gcs_ops and azure_ops (the last two to be implemented later) implement the platform specific details, and inv_ops just uses visit to proxy the calls to the underlying variant.

So the call chain looks like:

// AWS
inv_ops (dl_report) -> aws_ops (dl_report) ->
  call 1 -> remote -> s3_client
  call 2 -> remote -> s3_client
  ...
//GCP
inv_ops (dl_report) -> gcs_ops (dl_report) ->
  call 1 -> remote -> s3_client
  call 2 -> remote -> s3_client
  call 3 -> remote -> s3_client
  call 4 -> remote -> s3_client
...
// Azure
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be simpler to actually use this as a base class and use polymorphism

I opted to use sum type here as the final set of implementations is a closed set of three implementations. Do you see issues with complexity here that could be simplified?

@abhijat

I'm not sure I understand the reasoning that there being a fixed set of implementations is motivation for using std::variant over polymorphism or pimpl. Isn't that always the case for a given code base, and if we add more cloud providers, that set changes. Can you elaborate on that?

I'd argue against std::variant unless the static typing is important, otherwise we end up generating extra code and also have the template programming complexity to deal with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that when writing a utility such as a library which may be freely extended by classes which are not known at the time of the implementation of the base class, runtime polymorphism (interface based design) is a better fit.

Here we know upfront that there are only going to be three implementations, a sum type makes more sense to me.

In other words, saying that this object can only ever be one of vendor_api_a, vendor_api_b or vendor_api_c seems a better fit to me than saying that this object is a vendor_api which fulfills the expected contract.

The contract is still enforced by the vendor_ops_provider concept.

otherwise we end up generating extra code

Do you mean extra code generation by using the variant, or because of the templates? The template definition is only for the contracts and is fairly isolated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions though, as I understand sum types have better ergonomics in some languages than others and std::variant is a recent addition, if there are any major concerns with using std::variant here, the design is not tied to it and we can make changes.

I still think the variant+visit based approach looks easier to understand and more concise than PIMPL.

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhijat abhijat merged commit 62bd1a6 into redpanda-data:dev Feb 19, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants