Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option for setting maximal flush interval to the file sink #2174

Open
ghost opened this issue Mar 30, 2020 · 11 comments
Open

Add an option for setting maximal flush interval to the file sink #2174

ghost opened this issue Mar 30, 2020 · 11 comments
Labels
domain: templating Anything related to templating Vector's configuration values have: should We should have this feature, but is not required. It is medium priority. meta: feedback Anything related to customer/user feedback. sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@ghost
Copy link

ghost commented Mar 30, 2020

A new option which would set the maximal time period between consecutive flushes could be added to the file sink. It was requested in the chat:

is there a way to define the flush interval? eg: I have "socket" as sources and "file" as sinks, can I "create the file" (flush out the source to the file) every 30 seconds?

@ghost ghost added type: enhancement A value-adding code change that enhances its existing functionality. sink: file Anything `file` sink related meta: feedback Anything related to customer/user feedback. labels Mar 30, 2020
@ghost ghost changed the title Add an option setting maximum flush interval to the file sink Add an option setting maximal flush interval to the file sink Mar 30, 2020
@binarylogic
Copy link
Contributor

I'm curious why they would want this? It seems much more efficient to stream the data to disk and then rotate the file once it reaches a certain size. @gfrankliu could you provide more detail on your use case? That'll help us make sure we're solving the problem optimally.

@ghost ghost changed the title Add an option setting maximal flush interval to the file sink Add an option for setting maximal flush interval to the file sink Mar 30, 2020
@gfrankliu
Copy link

The use case is to use "vector" as a content receiver on a socket. The volume of incoming lines are unknown, and we want a new file to be created every X seconds (eg: 30s).
There is a consumer process picking up those files on the same server (or a sidecar in the same pod), so once the filename shows up, we consider it flushed and don't want its content to further change.
Currently we use fluentd to flush out to file

<match tag1.**>
  @type file
  path /tmp/tag1/${tag}
  <buffer tag,time>
    timekey 30
    timekey_wait 1s
    flush_at_shutdown true
    @type file
    path /tmp/tmpbuf/tag1
  </buffer>
  append true
</match>

and the second process watches /tmp/tag1 directory for new files.

@binarylogic
Copy link
Contributor

Thanks for the extra info. @LucioFranco this would switch the sink from streaming to batching. I assume this is a large enough change that would warrant an entirely new sink if we decided to do this?

@LucioFranco
Copy link
Contributor

How do we see this interacting with partitioning files? Would we recreate all partitions every 30sec?

I think we could most likely fit something like this into our current file sink.

@gfrankliu
Copy link

In fluentd, the option "append true/false" will decide if append/truncate when the filename already exists.

@gfrankliu
Copy link

I think it is worthwhile to add "batch" for file sink, especially other sink already has it, eg: https://vector.dev/docs/reference/sinks/gcp_cloud_storage/

@gfrankliu
Copy link

Our use case is really a local buffer/proxy layer to transfer content/logs to cloud provider (GCS or S3 buckets)

  1. Our applications don't want to get blocked because of talking over WAN to cloud providers directly.
  2. If we have a local buffer/proxy layer on the LAN, our applications can send logs or other content quickly to the "proxy" without affecting other main business functions.
  3. The reason we have two processes on the "proxy": first process (eg: fluentd) to quickly receive and store on the local disk, and second process to send to GCS buckets. If the WAN is down, we won't lose the data since the files are already on the disk. If the WAN gets slow and blocks the second process, it won't affect the first process and block the "receiving" side. If we use vector to combine two processes in one, can it address our concerns: a) it won't slow down the source socket receiving if the sink to GCS is down/slow. b) we won't lose data if the sink is down for a day. c) if we restart the vector, it won't lose the un-sent data in the sink.

@LucioFranco
Copy link
Contributor

@gfrankliu This sounds like a perfect use case for our on disk buffer which is designed to store the logs on disk while the WAN requests may be slow. This combined with the gcs sink should work quite well for your use case.

https://vector.dev/docs/reference/sinks/gcp_cloud_storage/#type

Happy to answer questions around our disk buffer as well :)

I think in general we'd like to keep our file sink decently simple since most use cases for it are simple or are some sort of work around where we could improve other parts like the on disk buffer.

@gfrankliu
Copy link

How does "source" interacts with "sink" when disk buffer is used? Will "source" save to the disk file always and "sink" reads from file to send to GCS, or will "source" passes the content to "sink" directly and sink only puts on disk if it can't write fast enough to GCS?

@binarylogic
Copy link
Contributor

@gfrankliu think of the buffer as a "catch-all" that sits in front of sinks. Data accumulates in the buffer and the sink reads from it. So if the sink is performing well, it will keep the buffer at a minimal size, otherwise, data will accumulate up to buffer.max_size and the apply buffer.when_full behavior. I hope that helps.

@binarylogic binarylogic added domain: templating Anything related to templating Vector's configuration values have: should We should have this feature, but is not required. It is medium priority. labels Aug 7, 2020
@scMarkus
Copy link
Contributor

Sorry for bumping such an old post...
I think I have a similar use case like the described once where I would like to utilize some short interval batching configs link in s3 sink.

I am intending to stream events from an auto scaled amount of http servers into some streaming application where I want to avoid maintaining a Kafka cluster just for that. At the moment I am looking into spark structured streaming. Spark workers would be able to read from files directly so I was thinking of using vector file sink.
Since there is still the need to have some synchronous meta information which I am thinking of handling with alluxio. This would enable me to either:

  1. Use Alluxio FUSE to write files into the colocated Alluxio workers and let them handle the rest or
  2. Use Alluxio S3 which feels like more communication overhead to me (which I don't know for sure). Furthermore I am not sure how then the load distribution and hot spotting would be handled. or
  3. The perfect solution would be some implementation supporting short-circuit. But I have only seen this working with the Alluxio client jar.

Any suggestions on that? Happy to open a new discussion if feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: templating Anything related to templating Vector's configuration values have: should We should have this feature, but is not required. It is medium priority. meta: feedback Anything related to customer/user feedback. sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

4 participants