Conversation
|
@cmmarslender, @zcapper, @szibis, @mikhno-s, @gburd since you all expressed interest in this feature feel free to comment/review. We'd love feedback on the proposed approach so we can get it right on the first implementation. Thanks! |
|
LGTM |
|
|
||
| It will not cover parsing of events within the objects. Vector's current | ||
| approach for this is to delegate to transforms. For example, it is likely we'll | ||
| want to add a transform to extract CloudTrail events stored as objects in S3. |
There was a problem hiding this comment.
Would we close this in favor of a transform? If so, we should create a transform issue to cover that future work.
There was a problem hiding this comment.
It might be a feature that still needs to exist.
I am not sure if customers would want to configure their AWS services to invoke additional costs in order to work-around a limitation in their log router. Not that an S3 bucket of CloudTrail logs would be particularly expensive, but it's the principle of the matter. (that said, my org is already routing CloudTrail to s3)
There was a problem hiding this comment.
For CloudTrail, specifically, I think the only options for ingestion by vector are shipping to S3 or shipping to CloudWatch Logs. Are you aware of any other options? LookupEvents is limited to 2 req/s which seems like it would make it infeasible for scraping.
I think the naive aws_cloudwatch_logs source (#3077) is still relevant apart from CloudTrail.
We may want a transform specifically to handle CloudTrail events in S3, but I'm not 100% sure. Just using the json_parser transform seems like it might be sufficient. Again, this relates back to the idea of Vector config macros that would allow users to more easily configure ingesting CloudTrail events from S3 without needing to string together a source with relevant transforms.
| bucket = "my-bucket" # required | ||
| region = "us-east-1" # required, required when endpoint = "" |
There was a problem hiding this comment.
I'm curious if this information is encoded into the SQS messages? And if so, we wouldn't need to require these options, right? Finally, could you add an SQS message example to the RFC?
There was a problem hiding this comment.
Hmm, that's a good point, it is encoded in the bucket notification message (https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html). This will require us to initialize the S3 client after message consumption, but maybe that's OK. I'll update and add example notification.
Notably the other "prior art" using the SQS approach require the bucket and region in the configuration.
There was a problem hiding this comment.
Added example message and updated config to not include the bucket and region.
| All [custom S3 object metadata | ||
| key/values](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata) | ||
| will be set as fields on the log event. |
There was a problem hiding this comment.
Is this prior art? I'm curious which other tools do this.
There was a problem hiding this comment.
FluentD does this: https://github.com/fluent/fluent-plugin-s3 (see add_object_metadata). We could make it conditional, but I don't see a downside of just doing it by default.
|
|
||
| ## Future work | ||
|
|
||
| * Perhaps allow deletion of S3 objects after they've been processed |
There was a problem hiding this comment.
I'd also add reclassification of the storage type to this list. Ex: moving an object to glacier.
There was a problem hiding this comment.
Expanded future work comment to include this note.
| It will not cover parsing of events within the objects. Vector's current | ||
| approach for this is to delegate to transforms. For example, it is likely we'll | ||
| want to add a transform to extract CloudTrail events stored as objects in S3. | ||
|
|
There was a problem hiding this comment.
How about adding Metrics to the scope?
One incredibly useful feature of this would be to see :
- The rate of logs-files being processed.
- The rate of new messages(logs) arriving in SQS
So that I can quickly calculate if the Vector ingest is keeping up with the load, or if I need to introduce more replicas.
There was a problem hiding this comment.
I feel like the CloudWatch Metrics for SQS is the best place to get queue metrics and would let you know if the consumers are keeping up, but I'll add some internal event (from which internal metrics are derived) notes to this RFC.
There was a problem hiding this comment.
Added notes on internal events.
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
fd14dc7 to
c90aa50
Compare
lukesteensen
left a comment
There was a problem hiding this comment.
This looks great! Definitely agree with the SQS strategy as our first, since it seems the most robust. Alternative implementations can be driven by demand. I also very much agree with keeping the scope relatively limited for the initial version and pushing more complex questions out to future work.
jamtur01
left a comment
There was a problem hiding this comment.
Looks great to me as an initial implementation.
* chore(rfcs): Add AWS S3 source RFC Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Clarifying prior art approaches Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Stray ; Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * tabs to spaces Signed-off-by: Jesse Szwedko <jesse@szwedko.me> * Feedback Signed-off-by: Jesse Szwedko <jesse@szwedko.me> Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Signed-off-by: Jesse Szwedko jesse@szwedko.me
Closes #4155