-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New aws_s3
source
#1017
Comments
@ticon-mg, to clarify, you're referring to source archives? You can download those from the Github repo and releases in the interim. |
Hi! |
Ah, that makes more sense. Thanks! |
+1 Use cases Implementation
|
@zcapper |
@NikitaGl I'm fond of the event notifications approach because it's stateless (in Vector) and could be scaled horizontally to work with very large buckets and inventories. FWIW Splunk seems to have both a generic and event notification based S3 connectors. |
I think an interesting approach is to treat S3 like a file system. There are utilities that do exactly this, but I’m not advocating them as a solution for Vector, but more as a mental frame to think through. |
The right approach and what fluentd is doing is to use s3-sqs system, which includes basically reading filename from sqs message and read corresponding data. Fluentd has similar plugin as https://github.com/fluent/fluent-plugin-s3. The issue with a generic plugin is detection of new file require to list all file. As the number of files increase, time taken to discover new file increases, so as api cost for continuous list. Sqs would be way cheaper. |
What would be great to use the S3 sink and S3 source with the same features to have that as failover to deliver things reliably? |
I like the idea. In our context we are not using kafka instead using firehose delivery to convert the records in ORC and deliver to s3. But even with fluentd, I found that when we hit by firehose throttling the worker keep on taking data from queue our buffer keep on growing. This is causing good data loss and we are forced to run firehose in over povisioned mode. The downside is that we are getting small chunks in s3. Also from our setup we are using fluentd copy to push data to firehose and grep filter, so that we push only write data to elasticsearch. Probably we can something more robust mechanism to handle these downstream to handle some destination based buffering size. Something like
PS Firehose ingestion is way faster than es ingestion, and it itself do buffering, we can easily do |
I am interested in attempting to build this source plugin. As the label says, it As input config options, I guess its a sort of union of [sources.my_source_id]
type = "aws_s3" # required
bucket = "s3://my-bucket" # required - (I'm assuming we want to use the S3 protocol, and not generic http through the s3 http server)
ignore_older = 86400 # optional, no default, seconds
include = ["/var/log/nginx/*.log"] # required
start_at_beginning = false # optional, default
polling_interval_ms = 30000
encoding = "json"
folder_template = "AWSLogs/<my account id>/elasticloadbalancing/us-west-2/%Y/%m/%D/*.log.gz" Edit Note, I, personally, am more interested in a polling model than using SQS for events. Thoughts? |
Hi @rrichardson, Thanks for your interest in contributing this! We've had a number of people ask for it. Typically, for a new source like this where there are some decisions to be made, we like to do an RFC first to allow us to discuss and make those decisions before implementation; the goal being to reduce the amount of rework that might be needed if the discussion happened as part of the pull request. An example of this is the recently added Would you be open to writing up a short RFC based on the template? I'm happy to focus on giving you support in the form of feedback and suggestions in the RFC and the PR. I think what you posted is a good start. I'd suggest thinking about:
Thanks again! Let me know if I can offer any additional guidance. |
@jszwedko - Thanks. I will put together an RFC. I am trying to wrap up a couple projects first, but I have this scheduled in my next sprint. :) |
Some notes from another user: One of the challenges with collection from S3 is that polling/scanning a bucket does not scale well; AWS only allows you to list the contents against a prefix (max 1000 objects at a time) and it's possible to see scanning times extending to hours, or even multiple days. Three documented options exist for collection from S3 without needing to re-scan the bucket:
SQS is probably the best approach for typical collectors; it relies on a client to connect to the SQS and retrieve details of new events. SNS is probably the best approach for SaaS-based collection; SNS can proactively notify a publicly accessible endpoint that a new object exists. DynamoDB is probably the best approach where more specific details need to be kept about events, or a permanent record maintained of what assets are available in a large S3 bucket (archive use case?).
|
Hey @rrichardson, I'm actually going to be starting on an implementation of this source using SQS for bucket notifications this week. If you were also planning on working on this soon, I'd love to collaborate on an RFC and make sure I don't step on your toes with implementation. We have a discord (https://discord.gg/sfFzZ6) that we could use for more real-time communication. Otherwise, I can ping you when the RFC is up (hopefully tomorrow) and you can leave any feedback. I'll plan to leave a hole in the vector configuration spec for the polling strategy that you can fill in later. |
RFC: #4197 . I proposed configuration specifying the SQS implementation as a "strategy" to allow room for specifying a polling strategy as well. cc/ @rrichardson |
See RFC: https://github.com/timberio/vector/blob/master/rfcs/2020-09-29-4155-aws-s3-source.md Fixes #1017 This is the initial implementation of an `aws_s3` source that relies an AWS SQS for bucket notifications to inform of new S3 objects to consume. See RFC for discussion of other approaches and why this one was chosen. The source does have an enum `strategy` configuration to allow for additional approaches (like SNS or long polling) to be supported. The basic flow is: User setup: * bucket is created * queue is created * bucket is configured to notify the queue for ObjectCreated events * vector is configured with the `aws_s3` source using the queue configuration whereupon it will process the ObjectCreated events to read each S3 object. Example configuration: ```toml [sources.s3] type = "aws_s3" region = "us-east-1" strategy = "sqs" compression = "gzip" sqs.queue_name = "jesse-test-flow-logs" sqs.delete_message = false [sinks.console] type = "console" inputs = ["s3"] encoding.codec = "json" ``` The commits can be viewed in-order, but the resulting diff probably isn't to bad either (it's mostly just `sources/aws_s3`). It may be worth looking at the added cue documentation first. The source also behaves very much like the `file` source where it emits one event per-line, but also supports the same multiline configuration that the `file` source supports. **Note** there is a rough edge here where the `multiline` config supports a `timeout_ms` option that isn't really applicable here but is applied just the same. Future work: 1. Additional codec support (like `application/ndjson`). For now, this acts very much like the `file` source. This could be looped into the general work around codecs #4278 2. Additional compression formats (Snappy, LZ4, Zip). This was requested in the original issue. I started with just supporting the formats that were supported out-of-the-box by the `async_compression` crate we are using. 3. (potential) multi-region support. Currently we only support reading from a queue and a bucket in the same region. I expect this will cover most cases since AWS requires the bucket to publish notifications to a queue in the same region. One could forward messages from a queue in one region to another, but this seems unlikely. I'd prefer to wait and see if anyone asks for multi-region support; especially given that fluentbit and filebeat have the same restriction. 4. Concurrent processing. Right now one message is processed at a time which leads to predictable behavior, but we may observe some performance improvements by processing multiple objects at once. The benefit should be vetted though, the process may be limited by incoming network bandwidth anyway. 5. Refresh the message visibility timeout. Right now, the visibility timeout is set once, when the message is retrieved, but we could refresh this timeout if we are still processing a message when it gets close to the end of the timeout to avoid another vector instance picking it up. This would let users have the best of both worlds: short visibility timeouts to quickly reprocess messages when a `vector` instance falls over, but also avoiding concurrent processing of messages for large objects where the processing time exceeds the visibility timeout. I'll create issues for 2 and 5. I think the others can be left until we observe their necessity.
See RFC: https://github.com/timberio/vector/blob/master/rfcs/2020-09-29-4155-aws-s3-source.md Fixes vectordotdev#1017 This is the initial implementation of an `aws_s3` source that relies an AWS SQS for bucket notifications to inform of new S3 objects to consume. See RFC for discussion of other approaches and why this one was chosen. The source does have an enum `strategy` configuration to allow for additional approaches (like SNS or long polling) to be supported. The basic flow is: User setup: * bucket is created * queue is created * bucket is configured to notify the queue for ObjectCreated events * vector is configured with the `aws_s3` source using the queue configuration whereupon it will process the ObjectCreated events to read each S3 object. Example configuration: ```toml [sources.s3] type = "aws_s3" region = "us-east-1" strategy = "sqs" compression = "gzip" sqs.queue_name = "jesse-test-flow-logs" sqs.delete_message = false [sinks.console] type = "console" inputs = ["s3"] encoding.codec = "json" ``` The commits can be viewed in-order, but the resulting diff probably isn't to bad either (it's mostly just `sources/aws_s3`). It may be worth looking at the added cue documentation first. The source also behaves very much like the `file` source where it emits one event per-line, but also supports the same multiline configuration that the `file` source supports. **Note** there is a rough edge here where the `multiline` config supports a `timeout_ms` option that isn't really applicable here but is applied just the same. Future work: 1. Additional codec support (like `application/ndjson`). For now, this acts very much like the `file` source. This could be looped into the general work around codecs vectordotdev#4278 2. Additional compression formats (Snappy, LZ4, Zip). This was requested in the original issue. I started with just supporting the formats that were supported out-of-the-box by the `async_compression` crate we are using. 3. (potential) multi-region support. Currently we only support reading from a queue and a bucket in the same region. I expect this will cover most cases since AWS requires the bucket to publish notifications to a queue in the same region. One could forward messages from a queue in one region to another, but this seems unlikely. I'd prefer to wait and see if anyone asks for multi-region support; especially given that fluentbit and filebeat have the same restriction. 4. Concurrent processing. Right now one message is processed at a time which leads to predictable behavior, but we may observe some performance improvements by processing multiple objects at once. The benefit should be vetted though, the process may be limited by incoming network bandwidth anyway. 5. Refresh the message visibility timeout. Right now, the visibility timeout is set once, when the message is retrieved, but we could refresh this timeout if we are still processing a message when it gets close to the end of the timeout to avoid another vector instance picking it up. This would let users have the best of both worlds: short visibility timeouts to quickly reprocess messages when a `vector` instance falls over, but also avoiding concurrent processing of messages for large objects where the processing time exceeds the visibility timeout. I'll create issues for 2 and 5. I think the others can be left until we observe their necessity. Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Needs Source: AWS S3
The text was updated successfully, but these errors were encountered: