-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/filelog] Filelog receiver missing reading log lines in high throughput scenario #35137
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
There are some circumstances where the receiver would not be able to keep up with log rotation. Specifically, if the files are being rotated to a location where they are not found in subsequent polling intervals, or if the log files are deleted before the receiver has had a chance to consume them (due to a rotation configuration that defines the max number of backups). It sounds like you have a high throughput scenario so it seems possible. Are the rotated files being deleted eventually, and if so, can we establish in some way the frequency? If you can provide an estimate of the max number of backups and the average log size in bytes, then I think we can work out whether or not such a limitation is in play. The other factor which I'm wondering about is whether the debug exporter is too slow to keep up, and may in fact be causing backpressure on the receiver which unnecessarily delays reading the logs. You could consider using the recievers:
filelog:
include: ...
connectors:
count:
processors:
deltatocumulative:
exporters:
debug:
service:
pipelines:
logs/in:
receivers: [ filelog ]
exporters: [ count ]
metrics/out:
receivers: [ count ]
processors: [ deltatocumulative ] # aggregate "+1"s into cumulative count
exporters: [ debug ] # only print the counts, rather than the logs themselves |
Let me share a little more around the directory and how the log rotation is being handled. Below is the directory structure.
The logs are continuously written to Here is the snapshot that describes how quickly For handling the case where debug exporter was too slowThe config you shared with
I used it without the
Since the Open questions/ideas @djaglowski
|
@djaglowski It seems like this issue might be related to this issue. What do you say? Reducing the polling interval can be beneficial, but only up to a certain point. @ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count. |
Just to confirm, the file is definitely moved and
I'm not sure the status of the The count maxing at 100 is a consequence to batches being emitted by the filelog receiver once they reach 100 logs.
I agree with this. If you're not using the
Only if the receiver is able to keep up with ingestion. For example, if you set the poll interval to 10ns but it takes 100ms to read each time, then you're essentially just telling it to start a new poll as soon as the previous one finishes. All that said, it sounds like there may be a real issue here. I'd like to reiterate my request for a decent estimate of the average log size. This would help us establish what the behavior should look like. We know:
If we know the average log size, we can compute how often the file should be rotating. This may or may not lead to an explanation but I'd appreciate if we can start there so I can understand the scenario better.
@VihasMakwana I don't see the connection. That issue requires there to be multiple separate files. This issue is about a single file written to by multiple processes. |
Here are the updated details about the logs (average taken over 5 different log files) @djaglowski.
I have asked the team to confirm this. For reference, we are using
I used the Update as on Sep 16We set the At least the limited testing we could do today (manually picking random I will continue to test and update you guys here. PS: Keeping it open for some more time, till we get everything stabilized. |
I see. What I meant to say is that reducing the poll interval will only help to a certain extent; if the file is large, it might not respect the poll_interval and could still cause issues. |
""Once the file reaches 250MB in size, it is renamed and archived (rotated), and a new log file begins."" - team Therefore no truncation takes place @djaglowski. |
Also, what do you guys think is better, esp in high throughput scenarios? A. Reading logs from a file using the filelog receiver. I know the answer might depend on multiple factors but in general what is better. |
In theory it is more efficient to bypass the files, since you then don't have to write, read, and reinterpret the logs again. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
@djaglowski can we get this issue reopened/tagged as an issue on the container parser instead? I think I've found an issue in the way that container parser and the filelog receiver interact. We have a high throughput service with lots of log file rotations as well and I've done some digging and think I've found evidence of an issue. My current theory is that the log file rotation done by Kubernetes is causing some of the last log file lines to be dropped from the collection. This leads to some missed % of logs based on the polling interval of the filelog receiver. Normally, the filelog receiver keeps file handles open across polls and so even if the log file is renamed as part of rotation, the filelog receiver will still try to finish collecting log lines from the file. However, in the container parser specifically the parser checks the file name when it tries to parse each line and if it fails to match the pattern of known log files then it will reject the log line and give this error 2025-02-19T23:35:47.074Z error reader/reader.go:214 failed to process token
{
"kind": "receiver",
"name": "filelog",
"data_type": "logs",
"component": "fileconsumer",
"path": "/var/log/pods/prod_apiserver-99b84bb6d-kv82j_1f5e5133-93e6-4742-bd6b-bf7be8f1d229/apiserver/0.log.20250219-233547",
"error": "failed to handle attribute mappings: failed to detect a valid log path"
} This means that any log lines that appear between the last poll on a current With Kubernetes defaults, the file log rotation check happens every 10s and the default file log max size is 10Mb. That means that in high throughput services that produce more than 10Mb of logs every 10s, the file is being rotated as often as possible (every 10s). So the percentage of logs that land in the above interval would directly correlated with the poll interval divided by the rotation interval (200ms / 10s ~ 0.02 or 2%). The longer a service goes between log file rotations the lower the % of dropped logs would be. I did some testing at various polling intervals in a minikube setup with a mock log generator that rotated files every 10s and generated 400,000 logs over 5 minutes. I found the following log losses
I also spot checked a bunch of the rotated log files off the actual host Proposed SolutionI think that the fix for this might be as simple as updating the regex pattern that's used during the file check to include the recently rotated file log format as well but wanted to confirm that even if the log file gets renamed as part of parsing the filelog receiver won't interpret that as a brand new file and reingest all of it and will correctly only ingest the remaining lines that it hasn't yet. |
@obs-gh-alexlew That seems like a valid problem and it should be fixed. Thanks for your detailed analysis. I'm opening this issue and will wait for @djaglowski's thoughts. |
I agree that sounds like a problem. I would expect that rotated files continue to report the original file attributes. |
Okay cool! Should be a small fix then to update that regex. I can try to take this on and open a PR. |
I see #38440 does this, but I think your diagnosis of the problem pointed at a root cause that would be a better solution:
I think it's worth exploring whether the original file name can be kept, rather than associating the logs with the rotated file. If we'd been a bit faster to read them, they would be, so it seems odd that this should change depending on marginal timing issues. I'm going to approve #38440 because it offers an immediate solution to this particular situation and doesn't have much downside (presumably just a minor performance hit on the regex eval). However, I'll open another issue describing the change which would resolve the root cause. |
Component(s)
receiver/filelog
What happened?
Description
I am trying to read logs from a log file.
A bit about the log files
Is there a limit on how quickly can filelog receiver read logs from a file ?
A snippet from the log file
Expected Result
All the log lines are read, processed and sent to next hop.
Actual Result
The filelog receiver misses reading lot of log lines. In fact almost 70% of the log lines are missed while reading.
Collector version
v0.104.0
Environment information
Environment
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CPU: 4vCPU
RAM: 8GB
OpenTelemetry Collector configuration
Internal Telemetry
In the below image the log receiver skips tons of lines.

The text was updated successfully, but these errors were encountered: