Add an option for setting maximal flush interval to the `file` sink #2174

ghost · 2020-03-30T11:38:33Z

A new option which would set the maximal time period between consecutive flushes could be added to the file sink. It was requested in the chat:

is there a way to define the flush interval? eg: I have "socket" as sources and "file" as sinks, can I "create the file" (flush out the source to the file) every 30 seconds?

The text was updated successfully, but these errors were encountered:

binarylogic · 2020-03-30T12:57:16Z

I'm curious why they would want this? It seems much more efficient to stream the data to disk and then rotate the file once it reaches a certain size. @gfrankliu could you provide more detail on your use case? That'll help us make sure we're solving the problem optimally.

gfrankliu · 2020-03-30T16:48:29Z

The use case is to use "vector" as a content receiver on a socket. The volume of incoming lines are unknown, and we want a new file to be created every X seconds (eg: 30s).
There is a consumer process picking up those files on the same server (or a sidecar in the same pod), so once the filename shows up, we consider it flushed and don't want its content to further change.
Currently we use fluentd to flush out to file

<match tag1.**>
  @type file
  path /tmp/tag1/${tag}
  <buffer tag,time>
    timekey 30
    timekey_wait 1s
    flush_at_shutdown true
    @type file
    path /tmp/tmpbuf/tag1
  </buffer>
  append true
</match>

and the second process watches /tmp/tag1 directory for new files.

binarylogic · 2020-03-30T16:50:37Z

Thanks for the extra info. @LucioFranco this would switch the sink from streaming to batching. I assume this is a large enough change that would warrant an entirely new sink if we decided to do this?

LucioFranco · 2020-03-30T17:00:17Z

How do we see this interacting with partitioning files? Would we recreate all partitions every 30sec?

I think we could most likely fit something like this into our current file sink.

gfrankliu · 2020-03-30T17:45:42Z

In fluentd, the option "append true/false" will decide if append/truncate when the filename already exists.

gfrankliu · 2020-03-30T17:51:22Z

I think it is worthwhile to add "batch" for file sink, especially other sink already has it, eg: https://vector.dev/docs/reference/sinks/gcp_cloud_storage/

gfrankliu · 2020-03-30T18:09:20Z

Our use case is really a local buffer/proxy layer to transfer content/logs to cloud provider (GCS or S3 buckets)

Our applications don't want to get blocked because of talking over WAN to cloud providers directly.
If we have a local buffer/proxy layer on the LAN, our applications can send logs or other content quickly to the "proxy" without affecting other main business functions.
The reason we have two processes on the "proxy": first process (eg: fluentd) to quickly receive and store on the local disk, and second process to send to GCS buckets. If the WAN is down, we won't lose the data since the files are already on the disk. If the WAN gets slow and blocks the second process, it won't affect the first process and block the "receiving" side. If we use vector to combine two processes in one, can it address our concerns: a) it won't slow down the source socket receiving if the sink to GCS is down/slow. b) we won't lose data if the sink is down for a day. c) if we restart the vector, it won't lose the un-sent data in the sink.

LucioFranco · 2020-03-30T19:37:08Z

@gfrankliu This sounds like a perfect use case for our on disk buffer which is designed to store the logs on disk while the WAN requests may be slow. This combined with the gcs sink should work quite well for your use case.

https://vector.dev/docs/reference/sinks/gcp_cloud_storage/#type

Happy to answer questions around our disk buffer as well :)

I think in general we'd like to keep our file sink decently simple since most use cases for it are simple or are some sort of work around where we could improve other parts like the on disk buffer.

gfrankliu · 2020-04-07T00:59:03Z

How does "source" interacts with "sink" when disk buffer is used? Will "source" save to the disk file always and "sink" reads from file to send to GCS, or will "source" passes the content to "sink" directly and sink only puts on disk if it can't write fast enough to GCS?

binarylogic · 2020-04-07T01:02:40Z

@gfrankliu think of the buffer as a "catch-all" that sits in front of sinks. Data accumulates in the buffer and the sink reads from it. So if the sink is performing well, it will keep the buffer at a minimal size, otherwise, data will accumulate up to buffer.max_size and the apply buffer.when_full behavior. I hope that helps.

scMarkus · 2023-10-16T18:56:04Z

Sorry for bumping such an old post...
I think I have a similar use case like the described once where I would like to utilize some short interval batching configs link in s3 sink.

I am intending to stream events from an auto scaled amount of http servers into some streaming application where I want to avoid maintaining a Kafka cluster just for that. At the moment I am looking into spark structured streaming. Spark workers would be able to read from files directly so I was thinking of using vector file sink.
Since there is still the need to have some synchronous meta information which I am thinking of handling with alluxio. This would enable me to either:

Use Alluxio FUSE to write files into the colocated Alluxio workers and let them handle the rest or
Use Alluxio S3 which feels like more communication overhead to me (which I don't know for sure). Furthermore I am not sure how then the load distribution and hot spotting would be handled. or
The perfect solution would be some implementation supporting short-circuit. But I have only seen this working with the Alluxio client jar.

Any suggestions on that? Happy to open a new discussion if feasible.

ghost added type: enhancement A value-adding code change that enhances its existing functionality. sink: file Anything `file` sink related meta: feedback Anything related to customer/user feedback. labels Mar 30, 2020

ghost changed the title ~~Add an option setting maximum flush interval to the file sink~~ Add an option setting maximal flush interval to the file sink Mar 30, 2020

ghost changed the title ~~Add an option setting maximal flush interval to the file sink~~ Add an option for setting maximal flush interval to the file sink Mar 30, 2020

binarylogic added domain: templating Anything related to templating Vector's configuration values have: should We should have this feature, but is not required. It is medium priority. labels Aug 7, 2020

jszwedko mentioned this issue Apr 29, 2024

Add concurrent batching to the file sink #20394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option for setting maximal flush interval to the `file` sink #2174

Add an option for setting maximal flush interval to the `file` sink #2174

ghost commented Mar 30, 2020

binarylogic commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

binarylogic commented Mar 30, 2020

LucioFranco commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

LucioFranco commented Mar 30, 2020

gfrankliu commented Apr 7, 2020

binarylogic commented Apr 7, 2020

scMarkus commented Oct 16, 2023

Add an option for setting maximal flush interval to the file sink #2174

Add an option for setting maximal flush interval to the file sink #2174

Comments

ghost commented Mar 30, 2020

binarylogic commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

binarylogic commented Mar 30, 2020

LucioFranco commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

gfrankliu commented Mar 30, 2020

LucioFranco commented Mar 30, 2020

gfrankliu commented Apr 7, 2020

binarylogic commented Apr 7, 2020

scMarkus commented Oct 16, 2023

Add an option for setting maximal flush interval to the `file` sink #2174

Add an option for setting maximal flush interval to the `file` sink #2174