-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option for setting maximal flush interval to the file
sink
#2174
Comments
file
sinkfile
sink
I'm curious why they would want this? It seems much more efficient to stream the data to disk and then rotate the file once it reaches a certain size. @gfrankliu could you provide more detail on your use case? That'll help us make sure we're solving the problem optimally. |
file
sinkfile
sink
The use case is to use "vector" as a content receiver on a socket. The volume of incoming lines are unknown, and we want a new file to be created every X seconds (eg: 30s).
and the second process watches /tmp/tag1 directory for new files. |
Thanks for the extra info. @LucioFranco this would switch the sink from streaming to batching. I assume this is a large enough change that would warrant an entirely new sink if we decided to do this? |
How do we see this interacting with partitioning files? Would we recreate all partitions every 30sec? I think we could most likely fit something like this into our current file sink. |
In fluentd, the option "append true/false" will decide if append/truncate when the filename already exists. |
I think it is worthwhile to add "batch" for file sink, especially other sink already has it, eg: https://vector.dev/docs/reference/sinks/gcp_cloud_storage/ |
Our use case is really a local buffer/proxy layer to transfer content/logs to cloud provider (GCS or S3 buckets)
|
@gfrankliu This sounds like a perfect use case for our on disk buffer which is designed to store the logs on disk while the WAN requests may be slow. This combined with the gcs sink should work quite well for your use case. https://vector.dev/docs/reference/sinks/gcp_cloud_storage/#type Happy to answer questions around our disk buffer as well :) I think in general we'd like to keep our file sink decently simple since most use cases for it are simple or are some sort of work around where we could improve other parts like the on disk buffer. |
How does "source" interacts with "sink" when disk buffer is used? Will "source" save to the disk file always and "sink" reads from file to send to GCS, or will "source" passes the content to "sink" directly and sink only puts on disk if it can't write fast enough to GCS? |
@gfrankliu think of the buffer as a "catch-all" that sits in front of sinks. Data accumulates in the buffer and the sink reads from it. So if the sink is performing well, it will keep the buffer at a minimal size, otherwise, data will accumulate up to |
Sorry for bumping such an old post... I am intending to stream events from an auto scaled amount of http servers into some streaming application where I want to avoid maintaining a Kafka cluster just for that. At the moment I am looking into spark structured streaming. Spark workers would be able to read from files directly so I was thinking of using vector file sink.
Any suggestions on that? Happy to open a new discussion if feasible. |
A new option which would set the maximal time period between consecutive flushes could be added to the
file
sink. It was requested in the chat:The text was updated successfully, but these errors were encountered: