Skip to content

chore(rfcs): Add AWS S3 source RFC#4197

Merged
jamtur01 merged 5 commits intomasterfrom
rfc-aws-s3-source
Oct 14, 2020
Merged

chore(rfcs): Add AWS S3 source RFC#4197
jamtur01 merged 5 commits intomasterfrom
rfc-aws-s3-source

Conversation

@jszwedko
Copy link
Copy Markdown
Collaborator

Signed-off-by: Jesse Szwedko jesse@szwedko.me

Closes #4155

@binarylogic
Copy link
Copy Markdown
Contributor

@cmmarslender, @zcapper, @szibis, @mikhno-s, @gburd since you all expressed interest in this feature feel free to comment/review. We'd love feedback on the proposed approach so we can get it right on the first implementation. Thanks!

@jszwedko jszwedko mentioned this pull request Sep 29, 2020
@gburd
Copy link
Copy Markdown

gburd commented Sep 30, 2020

LGTM


It will not cover parsing of events within the objects. Vector's current
approach for this is to delegate to transforms. For example, it is likely we'll
want to add a transform to extract CloudTrail events stored as objects in S3.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we close this in favor of a transform? If so, we should create a transform issue to cover that future work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a feature that still needs to exist.
I am not sure if customers would want to configure their AWS services to invoke additional costs in order to work-around a limitation in their log router. Not that an S3 bucket of CloudTrail logs would be particularly expensive, but it's the principle of the matter. (that said, my org is already routing CloudTrail to s3)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CloudTrail, specifically, I think the only options for ingestion by vector are shipping to S3 or shipping to CloudWatch Logs. Are you aware of any other options? LookupEvents is limited to 2 req/s which seems like it would make it infeasible for scraping.

I think the naive aws_cloudwatch_logs source (#3077) is still relevant apart from CloudTrail.

We may want a transform specifically to handle CloudTrail events in S3, but I'm not 100% sure. Just using the json_parser transform seems like it might be sufficient. Again, this relates back to the idea of Vector config macros that would allow users to more easily configure ingesting CloudTrail events from S3 without needing to string together a source with relevant transforms.

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md
Comment thread rfcs/2020-09-29-4155-aws-s3-source.md Outdated
Comment on lines +56 to +57
bucket = "my-bucket" # required
region = "us-east-1" # required, required when endpoint = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if this information is encoded into the SQS messages? And if so, we wouldn't need to require these options, right? Finally, could you add an SQS message example to the RFC?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good point, it is encoded in the bucket notification message (https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html). This will require us to initialize the S3 client after message consumption, but maybe that's OK. I'll update and add example notification.

Notably the other "prior art" using the SQS approach require the bucket and region in the configuration.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added example message and updated config to not include the bucket and region.

Comment on lines +81 to +147
All [custom S3 object metadata
key/values](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata)
will be set as fields on the log event.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this prior art? I'm curious which other tools do this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FluentD does this: https://github.com/fluent/fluent-plugin-s3 (see add_object_metadata). We could make it conditional, but I don't see a downside of just doing it by default.

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md
Comment thread rfcs/2020-09-29-4155-aws-s3-source.md Outdated

## Future work

* Perhaps allow deletion of S3 objects after they've been processed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add reclassification of the storage type to this list. Ex: moving an object to glacier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded future work comment to include this note.

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md
It will not cover parsing of events within the objects. Vector's current
approach for this is to delegate to transforms. For example, it is likely we'll
want to add a transform to extract CloudTrail events stored as objects in S3.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding Metrics to the scope?

One incredibly useful feature of this would be to see :

  • The rate of logs-files being processed.
  • The rate of new messages(logs) arriving in SQS

So that I can quickly calculate if the Vector ingest is keeping up with the load, or if I need to introduce more replicas.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the CloudWatch Metrics for SQS is the best place to get queue metrics and would let you know if the consumers are keeping up, but I'll add some internal event (from which internal metrics are derived) notes to this RFC.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added notes on internal events.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Copy link
Copy Markdown
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Definitely agree with the SQS strategy as our first, since it seems the most robust. Alternative implementations can be driven by demand. I also very much agree with keeping the scope relatively limited for the initial version and pushing more complex questions out to future work.

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md
Copy link
Copy Markdown
Contributor

@jamtur01 jamtur01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me as an initial implementation.

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md
@jamtur01 jamtur01 merged commit 1b54b7b into master Oct 14, 2020
@jamtur01 jamtur01 deleted the rfc-aws-s3-source branch October 14, 2020 16:13
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
* chore(rfcs): Add AWS S3 source RFC

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* Clarifying prior art approaches

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* Stray ;

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* tabs to spaces

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* Feedback

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aws_s3 source RFC

7 participants