chore(rfcs): Add AWS S3 source RFC by jszwedko · Pull Request #4197 · vectordotdev/vector

jszwedko · 2020-09-29T19:34:05Z

Signed-off-by: Jesse Szwedko jesse@szwedko.me

binarylogic · 2020-09-29T21:19:03Z

@cmmarslender, @zcapper, @szibis, @mikhno-s, @gburd since you all expressed interest in this feature feel free to comment/review. We'd love feedback on the proposed approach so we can get it right on the first implementation. Thanks!

gburd · 2020-09-30T00:44:15Z

LGTM

jamtur01 · 2020-09-30T15:16:17Z

+
+It will not cover parsing of events within the objects. Vector's current
+approach for this is to delegate to transforms. For example, it is likely we'll
+want to add a transform to extract CloudTrail events stored as objects in S3.


Would we close this in favor of a transform? If so, we should create a transform issue to cover that future work.

It might be a feature that still needs to exist.
I am not sure if customers would want to configure their AWS services to invoke additional costs in order to work-around a limitation in their log router. Not that an S3 bucket of CloudTrail logs would be particularly expensive, but it's the principle of the matter. (that said, my org is already routing CloudTrail to s3)

For CloudTrail, specifically, I think the only options for ingestion by vector are shipping to S3 or shipping to CloudWatch Logs. Are you aware of any other options? LookupEvents is limited to 2 req/s which seems like it would make it infeasible for scraping.

I think the naive aws_cloudwatch_logs source (#3077) is still relevant apart from CloudTrail.

We may want a transform specifically to handle CloudTrail events in S3, but I'm not 100% sure. Just using the json_parser transform seems like it might be sufficient. Again, this relates back to the idea of Vector config macros that would allow users to more easily configure ingesting CloudTrail events from S3 without needing to string together a source with relevant transforms.

binarylogic · 2020-09-30T15:19:18Z

+  bucket = "my-bucket" # required
+  region = "us-east-1" # required, required when endpoint = ""


I'm curious if this information is encoded into the SQS messages? And if so, we wouldn't need to require these options, right? Finally, could you add an SQS message example to the RFC?

Hmm, that's a good point, it is encoded in the bucket notification message (https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html). This will require us to initialize the S3 client after message consumption, but maybe that's OK. I'll update and add example notification.

Notably the other "prior art" using the SQS approach require the bucket and region in the configuration.

Added example message and updated config to not include the bucket and region.

binarylogic · 2020-09-30T15:20:05Z

+All [custom S3 object metadata
+key/values](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata)
+will be set as fields on the log event.


Is this prior art? I'm curious which other tools do this.

FluentD does this: https://github.com/fluent/fluent-plugin-s3 (see add_object_metadata). We could make it conditional, but I don't see a downside of just doing it by default.

binarylogic · 2020-09-30T15:23:58Z

+
+## Future work
+
+* Perhaps allow deletion of S3 objects after they've been processed


I'd also add reclassification of the storage type to this list. Ex: moving an object to glacier.

Expanded future work comment to include this note.

rrichardson · 2020-09-30T16:31:02Z

+It will not cover parsing of events within the objects. Vector's current
+approach for this is to delegate to transforms. For example, it is likely we'll
+want to add a transform to extract CloudTrail events stored as objects in S3.
+


How about adding Metrics to the scope?

One incredibly useful feature of this would be to see :

The rate of logs-files being processed.

The rate of new messages(logs) arriving in SQS

So that I can quickly calculate if the Vector ingest is keeping up with the load, or if I need to introduce more replicas.

I feel like the CloudWatch Metrics for SQS is the best place to get queue metrics and would let you know if the consumers are keeping up, but I'll add some internal event (from which internal metrics are derived) notes to this RFC.

Added notes on internal events.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

lukesteensen

This looks great! Definitely agree with the SQS strategy as our first, since it seems the most robust. Alternative implementations can be driven by demand. I also very much agree with keeping the scope relatively limited for the initial version and pushing more complex questions out to future work.

jamtur01

Looks great to me as an initial implementation.

* chore(rfcs): Add AWS S3 source RFC Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Clarifying prior art approaches Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Stray ; Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * tabs to spaces Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Feedback Signed-off-by: Jesse Szwedko <jesse@szwedko.me> Signed-off-by: Brian Menges <brian.menges@anaplan.com>

jszwedko requested review from binarylogic, jamtur01 and lukesteensen September 29, 2020 19:34

jszwedko mentioned this pull request Sep 29, 2020

New aws_s3 source #1017

Closed

jamtur01 reviewed Sep 30, 2020

View reviewed changes

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md

binarylogic reviewed Sep 30, 2020

View reviewed changes

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md

binarylogic reviewed Sep 30, 2020

View reviewed changes

rrichardson reviewed Sep 30, 2020

View reviewed changes

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md

rrichardson reviewed Sep 30, 2020

View reviewed changes

jszwedko added 5 commits September 30, 2020 12:44

chore(rfcs): Add AWS S3 source RFC

53bb072

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

Clarifying prior art approaches

a8a9066

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

Stray ;

b693da8

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

tabs to spaces

a6c7821

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

Feedback

c90aa50

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

jszwedko force-pushed the rfc-aws-s3-source branch from fd14dc7 to c90aa50 Compare September 30, 2020 16:45

jszwedko requested review from binarylogic and jamtur01 October 1, 2020 14:03

binarylogic mentioned this pull request Oct 1, 2020

Add aws_elb parser #4245

Closed

lukesteensen approved these changes Oct 1, 2020

View reviewed changes

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md

jamtur01 approved these changes Oct 1, 2020

View reviewed changes

binarylogic reviewed Oct 2, 2020

View reviewed changes

Comment thread rfcs/2020-09-29-4155-aws-s3-source.md

binarylogic approved these changes Oct 9, 2020

View reviewed changes

jamtur01 merged commit 1b54b7b into master Oct 14, 2020

jamtur01 deleted the rfc-aws-s3-source branch October 14, 2020 16:13

		bucket = "my-bucket" # required
		region = "us-east-1" # required, required when endpoint = ""


		## Future work

		* Perhaps allow deletion of S3 objects after they've been processed

Conversation

jszwedko commented Sep 29, 2020

Uh oh!

binarylogic commented Sep 29, 2020

Uh oh!

gburd commented Sep 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukesteensen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jamtur01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants