-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the glob_minimum_cooldown_ms
default of 5 seconds correct?
#7840
Comments
cc @lukesteensen @MOZGIII since you both worked on profiling to land on this number. |
We should aim to resolve this before the 0.15.0 release before we release the new 5 second cool down if we anticipate it could cause performance issues. We could also just revert back to the 60s cooldown until we are able to dig into this more to keep the behavior as-is. |
To add some context here, we made the change to 60 seconds as part of a larger set of changes to help performance in the file source. At the time, searching for new files was part of the main loop along with normal files reads. We found that in some cases (e.g. many files to fingerprint), the search time was dominating and leading to relatively few reads and therefore low throughput. Increasing this value was a simple way to spend more of the loop reading from files and less time looking for new ones. Edit: I misremembered and the following is incorrect
|
Realizing now that I misremembered and the above is not correct. It is the writing of the checkpoint file that was moved to a separate task, not the searching for new files. We made plans to extract a new task for that, but we have not yet done the work. That makes me a bit more hesitant to ship this lower default without doing the work to extract that task. |
Possibly related to #7934 |
Pending investigation in #7840 This is just to make sure we don't accidentally release the new default of 5s before we are confident in it. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
* chore(ci): Lower glob_minimum_cooldown_ms for k8s e2e tests In #7980 we put the default back to 60000 ms until we can take a look in #7840 however this results in the tests taking 5 hours rather than 15 minutes so this drops the cooldown just for the tests. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
I'm not sure the value of 60 seconds is correct either. I experimented with this by creating a pod that essentially runs:
This pod seems to be creating logs that are rotated faster than 60 seconds (running in Kind), and so entire files were being missed. Reducing this value to 5 seconds did alleviate this problem. I am however still dropping records early on in the sequence, I haven't yet worked out why. It is possible that the value of this is different for different situations, in which case it would be well worth exploring those situations to document them - or coming up with an alternative method that doesn't require this parameter at all. |
We do need to document it for sure, I can do that as part of the PR I'm putting together to resolve the rotation issue as well. @StephenWakely if you want to try my fix I can send that over to you. |
Agreed; a secondary reason we might want to drop it is simply to start reading logs from pods more quickly. 60s could be quite a while to wait for logs. |
Why are we polling for files? Why not inotify? |
My understanding is that we went with polling to avoid all of the rough edges of inotify and for simple cross-platform support. @lukesteensen may be able to comment further on that. |
Unfortunately, file based scrapers are inherently subject to these kinds of problems, with no way to really "win" with the architectures we have in play today. |
Imo, 5 ms should be a good default for the average, no-so-high-logs-volume cluster. For the clusters with high load and a lot of logs generated, we should encourage people to tweak both I suggest doing performance benches similar to what we did last time after introducing this change - cause the profiles might've changed. Ofc, with evented fd watcher it could be improved further, ultimately to the point where no tradeoff like this is needed. Maybe it is time to add that complexity to the file server? |
Some recent prs #7693, #7738 have changed this value changed this value from 60 seconds to 5 seconds.
However, discussing with @binarylogic the value of 60 seconds was chosen very carefully as 5 seconds is too fast and was causing performance issues.
I want to document here what those issues were and we need to decide if we want to revert this change back to 60 seconds.
The text was updated successfully, but these errors were encountered: