Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should vector be maintaining open file handles to ignore_older files older than the cutoff? #3567

Open
jszwedko opened this issue Aug 25, 2020 · 17 comments
Labels
source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@jszwedko
Copy link
Member

jszwedko commented Aug 25, 2020

I'm not sure if this is a bug or expected behavior, but it looks like, when using the ignore_older config for the file source, Vector still maintains an open file handle to the files with a modified time before the cutoff.

From gitter: https://gitter.im/timberio-vector/community?at=5f456c28c3aa024ef99e4907 . The user was trying to limit the number of open file handles, by using the ignore_older config.

If it is expected, we should probably call it out in the docs as I was not expecting that. It seemingly limits its usefulness in avoiding resource consumption.

Vector Version

vector 0.11.0 (g8b4ff32 x86_64-unknown-linux-gnu 2020-08-25)

Vector Configuration File

data_dir = "/tmp/vector"

[sources.in]
  type = "file" # required
  ignore_older = 10  # optional, no default, seconds
  include = ["/tmp/log/*.log"] # required

[sinks.http]
  type = "console"
  inputs = ["in"]
  encoding.codec = "json"

Debug Output

Aug 25 16:57:23.252  INFO vector: Log level "debug" is enabled.
Aug 25 16:57:23.256  INFO vector: Loading configs. path=["/tmp/test.toml"]
Aug 25 16:57:23.276  INFO vector::topology: Running healthchecks.
Aug 25 16:57:23.276  INFO vector::topology: Starting source "in"
Aug 25 16:57:23.277  INFO vector::topology::builder: Healthcheck: Passed.
Aug 25 16:57:23.277  INFO vector::topology: Starting sink "http"
Aug 25 16:57:23.277  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-573-g8b4ff32" released="Tue, 25 Aug 2020 20:48:03 +0000" arch="x86_64"
Aug 25 16:57:23.277  INFO source{name=in type=file}: vector::sources::file: Starting file server. include=["/tmp/log/*.log"] exclude=[]
Aug 25 16:57:23.279  INFO source{name=in type=file}:file_server: vector::internal_events::file: found new file to watch. path="/tmp/log/a.log"
Aug 25 16:57:23.280 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:24.315 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:25.340 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:27.391 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
^CAug 25 16:57:28.368  INFO vector: Vector has stopped.
Aug 25 16:57:28.370  INFO vector::topology: Shutting down... Waiting on: in, http. 59 seconds left
Aug 25 16:57:28.370 DEBUG source{name=in type=file}: vector::topology::builder: Finished
Aug 25 16:57:28.370 DEBUG sink{name=http type=console}: vector::topology::builder: Finished

Expected Behavior

Vector does not open the file

Actual Behavior

Vector opens the file

Additional Context

$ lsof -p 27049 | grep a.log
vector  27049 CORP\jesse   15r      REG              259,3    104581 178538005 /tmp/log/a.log
@jszwedko jszwedko added the type: bug A code related bug. label Aug 25, 2020
@jszwedko jszwedko changed the title Should vector be maintaining open file handles to cutoff files? Should vector be maintaining open file handles to ignore_older files older than the cutoff? Aug 25, 2020
@ktff
Copy link
Contributor

ktff commented Sep 13, 2020

So description for ignore_older option is somewhat misleading/incomplete. Currently it is:

Ignore files with a data modification date that does not exceed this age.

which doesn't say that it will collect newer data from the file. That's why it's tailing them, as it only ignores the old data not the file. So a clearer description would be:

Ignore existing data in files with a data modification date older than this age. Subsequent data will be collected.

This is the expected behavior, so just the documentation should be updated.

@binarylogic
Copy link
Contributor

binarylogic commented Sep 14, 2020

Hm, that's not how I thought the option to works. I think the behavior should change. The idea is to ignore older files as if they don't exist, not just older data. This ensures that Vector does not open file handles for the file, etc.

@binarylogic
Copy link
Contributor

binarylogic commented Sep 14, 2020

@ktff, we're working on an RFC to improve the file source in #3480. I think we should address this there.

@binarylogic binarylogic added the source: file Anything `file` source related label Sep 14, 2020
@lukesteensen
Copy link
Member

The idea is to ignore older files as if they don't exist, not just older data.

What would you want to happen with a file that initially has not been modified since ignore_older but then starts getting writes? If you want to start reading only the new data, we'd need to keep track of where that starts somehow. Right now we do it with the file cursor.

@binarylogic
Copy link
Contributor

That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.

@jszwedko
Copy link
Member Author

That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.

I think I like that, but it does seem like it would make it trickier to figure out which file contents were new. I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.

@lukesteensen
Copy link
Member

I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.

Yeah, you'd at least need to make sure the fingerprint was the same or weird things could happen.

Another confusing case would be if Vector is started, ignores a file for an old mtime, then Vector is stopped, then the file is written to, then Vector is started again. At that point, it would (depending on the config) read the whole file from the beginning, including data it had been ignoring to that point. Right now I think we avoid that via the normal checkpointing of open files.

@ktff
Copy link
Contributor

ktff commented Sep 15, 2020

I think the behavior should change

In that case whatever solution we come up with to avoid holding file handles it should be useable for other files as well. So we can reduce total file handle usage further. This way we can avoid some of the special casing and use the normal checkpointing to avoid the issues @lukesteensen mentioned.

@vbichov
Copy link

vbichov commented Jan 3, 2021

Is there a workaround to that issue?

In my use-case, there are many short-lived jobs that generate roughly 100000 files per month. Now due to business requirements, we can't delete the files before that.
I'm looking for a way to limit the number of tailed files somehow (I.E by looking at ctime). Any way to do that?

@ktff
Copy link
Contributor

ktff commented Jan 3, 2021

@vbichov

Is there a workaround to that issue?

One way is to point Vector to a folder with symlinks to the files and then have two services/scripts that will create symlinks and delete older ones.

@jszwedko jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. and removed type: bug A code related bug. labels Aug 3, 2022
@AzimovZaur
Copy link

AzimovZaur commented Mar 20, 2023

Hi, are you planning to make changes so that this parameter will force the vector to ignore old files and not open the handle?

@AzimovZaur
Copy link

Hi,
I did change in /lib/file-source/src/file_watcher/mod.rs that FileWatcher.new() function returned a FileWatcher structure with the variable is_dead: true if variable too_old is true.
So the old files were not opened and the vector works with directories where are a lot of old files.
But there are 2 problems:

  1. If the record add is in an old file(by chance), then the file will be re-read from the beginning(ignored option read_from = "end")
  2. If the file did put on watcher, then it is not removed from watcher during the ignore_older_secs interval (or after what interval will it be removed from watcher?)

If such functionality appeared in future versions it would be very good, since now it is not possible to use it due to the large number of open files with directories where are a lot of old files

@ethack
Copy link

ethack commented Aug 16, 2023

I've been using a workaround where, periodically, I add patterns for old files to my exclude config so that it doesn't consider older files.

exclude = [
    # hack to reduce start-up time and file descriptor usage
    "**/2022-*",
    "**/2023-01-*",
    "**/2023-02-*",
    "**/2023-03-*",
    "**/2023-04-*",
    "**/2023-05-*",
    "**/2023-06-*",
]

This is a pain to keep up to date and it would be great if I only had to set ignore_older_secs.

@jesseorr
Copy link

It would be nice if this file sink syntax:

[sinks.my_sink_id]
type = "file"
inputs = [ "my-source-or-transform-id" ]
path = "/tmp/vector-%Y-%m-%d.log"

Would work to apply a date string variable to the file source, such as this:

[sources.my_source_id]
type = "file"
include = [ "/var/log/**/%Y-%m-%d*.log" ]

This would let vector include files with timestamps that are generated in real time by the remote applications, while ignoring the older files. This would look like an enhanced glob match were some variables are included and must be resolved prior to the glob string being applied.

This would solve @ethack's issue, as well as a number of other applications that I've seen where datestamps are included in the active log file and where rotation is frequent, leading to many open file handles and heavy load by Vector to watch these files despite tuning.

@jszwedko
Copy link
Member Author

Coming back to this, I think that Vector could avoid keeping an open file handle to ignored files by:

  • On startup, only open files where there is new data by comparing the checkpoint offset with the size of the file
  • When an EOF is reached and there are no new writes for some, configurable, period of time, close the file and only reopen if there were new writes

@lukesteensen curious if you have thoughts.

@lukesteensen
Copy link
Member

Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via exclude. Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.

@fitz123
Copy link

fitz123 commented Mar 4, 2024

Our workflow involves generating numerous files daily, with only a few requiring active updates. The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden.

A feature that allows Vector to intelligently ignore files based on their modification date, without maintaining open file handles, would greatly alleviate our current struggles. This would optimize resource usage and reduce manual overhead in our workflow. We strongly believe that such a feature would benefit many users facing similar challenges and hope to see it prioritized in Vector's development roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

9 participants