New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should vector be maintaining open file handles to ignore_older
files older than the cutoff?
#3567
Comments
ignore_older
files older than the cutoff?
So description for
which doesn't say that it will collect newer data from the file. That's why it's tailing them, as it only ignores the old data not the file. So a clearer description would be:
This is the expected behavior, so just the documentation should be updated. |
Hm, that's not how I thought the option to works. I think the behavior should change. The idea is to ignore older files as if they don't exist, not just older data. This ensures that Vector does not open file handles for the file, etc. |
What would you want to happen with a file that initially has not been modified since |
That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it. |
I think I like that, but it does seem like it would make it trickier to figure out which file contents were new. I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though. |
Yeah, you'd at least need to make sure the fingerprint was the same or weird things could happen. Another confusing case would be if Vector is started, ignores a file for an old mtime, then Vector is stopped, then the file is written to, then Vector is started again. At that point, it would (depending on the config) read the whole file from the beginning, including data it had been ignoring to that point. Right now I think we avoid that via the normal checkpointing of open files. |
In that case whatever solution we come up with to avoid holding file handles it should be useable for other files as well. So we can reduce total file handle usage further. This way we can avoid some of the special casing and use the normal checkpointing to avoid the issues @lukesteensen mentioned. |
Is there a workaround to that issue? In my use-case, there are many short-lived jobs that generate roughly 100000 files per month. Now due to business requirements, we can't delete the files before that. |
One way is to point Vector to a folder with symlinks to the files and then have two services/scripts that will create symlinks and delete older ones. |
Hi, are you planning to make changes so that this parameter will force the vector to ignore old files and not open the handle? |
Hi,
If such functionality appeared in future versions it would be very good, since now it is not possible to use it due to the large number of open files with directories where are a lot of old files |
I've been using a workaround where, periodically, I add patterns for old files to my exclude = [
# hack to reduce start-up time and file descriptor usage
"**/2022-*",
"**/2023-01-*",
"**/2023-02-*",
"**/2023-03-*",
"**/2023-04-*",
"**/2023-05-*",
"**/2023-06-*",
] This is a pain to keep up to date and it would be great if I only had to set |
It would be nice if this file sink syntax: [sinks.my_sink_id]
type = "file"
inputs = [ "my-source-or-transform-id" ]
path = "/tmp/vector-%Y-%m-%d.log" Would work to apply a date string variable to the file source, such as this: [sources.my_source_id]
type = "file"
include = [ "/var/log/**/%Y-%m-%d*.log" ] This would let vector include files with timestamps that are generated in real time by the remote applications, while ignoring the older files. This would look like an enhanced glob match were some variables are included and must be resolved prior to the glob string being applied. This would solve @ethack's issue, as well as a number of other applications that I've seen where datestamps are included in the active log file and where rotation is frequent, leading to many open file handles and heavy load by Vector to watch these files despite tuning. |
Coming back to this, I think that Vector could avoid keeping an open file handle to ignored files by:
@lukesteensen curious if you have thoughts. |
Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via |
Our workflow involves generating numerous files daily, with only a few requiring active updates. The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden. A feature that allows Vector to intelligently ignore files based on their modification date, without maintaining open file handles, would greatly alleviate our current struggles. This would optimize resource usage and reduce manual overhead in our workflow. We strongly believe that such a feature would benefit many users facing similar challenges and hope to see it prioritized in Vector's development roadmap. |
I'm not sure if this is a bug or expected behavior, but it looks like, when using the
ignore_older
config for thefile
source, Vector still maintains an open file handle to the files with a modified time before the cutoff.From gitter: https://gitter.im/timberio-vector/community?at=5f456c28c3aa024ef99e4907 . The user was trying to limit the number of open file handles, by using the
ignore_older
config.If it is expected, we should probably call it out in the docs as I was not expecting that. It seemingly limits its usefulness in avoiding resource consumption.
Vector Version
Vector Configuration File
Debug Output
Expected Behavior
Vector does not open the file
Actual Behavior
Vector opens the file
Additional Context
The text was updated successfully, but these errors were encountered: