Is the `glob_minimum_cooldown_ms` default of 5 seconds correct? #7840

StephenWakely · 2021-06-11T15:06:31Z

Some recent prs #7693, #7738 have changed this value changed this value from 60 seconds to 5 seconds.

However, discussing with @binarylogic the value of 60 seconds was chosen very carefully as 5 seconds is too fast and was causing performance issues.

I want to document here what those issues were and we need to decide if we want to revert this change back to 60 seconds.

binarylogic · 2021-06-12T16:35:15Z

cc @lukesteensen @MOZGIII since you both worked on profiling to land on this number.

jszwedko · 2021-06-16T20:11:45Z

We should aim to resolve this before the 0.15.0 release before we release the new 5 second cool down if we anticipate it could cause performance issues.

We could also just revert back to the 60s cooldown until we are able to dig into this more to keep the behavior as-is.

lukesteensen · 2021-06-16T20:26:33Z

To add some context here, we made the change to 60 seconds as part of a larger set of changes to help performance in the file source. At the time, searching for new files was part of the main loop along with normal files reads. We found that in some cases (e.g. many files to fingerprint), the search time was dominating and leading to relatively few reads and therefore low throughput. Increasing this value was a simple way to spend more of the loop reading from files and less time looking for new ones.

Edit: I misremembered and the following is incorrect

Since that change, we made an additional change to remove searching for new files from the read loop completely. It now happens on an independent task and will not directly reduce the rate at which we read from the target files. There is still some concern that doing this more often will increase the total amount of work Vector is doing, and potentially add some contention for CPU time, IO resources, etc, but the original root problem has been solved more completely. This makes me think that reducing the value should be fine, but it's something to keep an eye on for sure.

lukesteensen · 2021-06-22T14:43:20Z

Realizing now that I misremembered and the above is not correct. It is the writing of the checkpoint file that was moved to a separate task, not the searching for new files. We made plans to extract a new task for that, but we have not yet done the work. That makes me a bit more hesitant to ship this lower default without doing the work to extract that task.

spencergilbert · 2021-06-22T14:57:10Z

Possibly related to #7934

Pending investigation in #7840 This is just to make sure we don't accidentally release the new default of 5s before we are confident in it. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

…7980) Pending investigation in #7840 This is just to make sure we don't accidentally release the new default of 5s before we are confident in it. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

In #7980 we put the default back to 60000 ms until we can take a look in #7840 however this results in the tests taking 5 hours rather than 15 minutes so this drops the cooldown just for the tests. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* chore(ci): Lower glob_minimum_cooldown_ms for k8s e2e tests In #7980 we put the default back to 60000 ms until we can take a look in #7840 however this results in the tests taking 5 hours rather than 15 minutes so this drops the cooldown just for the tests. Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

StephenWakely · 2021-06-25T11:43:00Z

I'm not sure the value of 60 seconds is correct either.

I experimented with this by creating a pod that essentially runs:

sh -c for ((i=0; i <= 50000; i++)); do echo "COUNT $i"; done

This pod seems to be creating logs that are rotated faster than 60 seconds (running in Kind), and so entire files were being missed. Reducing this value to 5 seconds did alleviate this problem. I am however still dropping records early on in the sequence, I haven't yet worked out why.

It is possible that the value of this is different for different situations, in which case it would be well worth exploring those situations to document them - or coming up with an alternative method that doesn't require this parameter at all.

spencergilbert · 2021-06-25T12:36:41Z

We do need to document it for sure, I can do that as part of the PR I'm putting together to resolve the rotation issue as well. @StephenWakely if you want to try my fix I can send that over to you.

jszwedko · 2021-06-25T15:05:41Z

Agreed; a secondary reason we might want to drop it is simply to start reading logs from pods more quickly. 60s could be quite a while to wait for logs.

portante · 2021-06-26T02:20:43Z

Why are we polling for files? Why not inotify?

jszwedko · 2021-06-28T21:32:47Z

Why are we polling for files? Why not inotify?

My understanding is that we went with polling to avoid all of the rough edges of inotify and for simple cross-platform support. @lukesteensen may be able to comment further on that.

portante · 2021-06-29T03:09:59Z

Unfortunately, file based scrapers are inherently subject to these kinds of problems, with no way to really "win" with the architectures we have in play today.

MOZGIII · 2021-09-17T06:27:08Z

Imo, 5 ms should be a good default for the average, no-so-high-logs-volume cluster. For the clusters with high load and a lot of logs generated, we should encourage people to tweak both glob_minimum_cooldown_ms and max log file size on the docker / CRI side. People will have to effectively balance the two together to get to the level where the files are rotated slow enough (via increasing the file size) so that they're reliably captured by the file watcher (adjusted via the glob_minimum_cooldown_ms), while keeping glob_minimum_cooldown_ms high enough so it doesn't hurt the performance.

I suggest doing performance benches similar to what we did last time after introducing this change - cause the profiles might've changed.

Ofc, with evented fd watcher it could be improved further, ultimately to the point where no tradeoff like this is needed. Maybe it is time to add that complexity to the file server?

StephenWakely added type: task Generic non-code related tasks source: kubernetes_logs Anything `kubernetes_logs` source related labels Jun 11, 2021

jszwedko added this to the Vector 0.15.0 milestone Jun 16, 2021

jszwedko assigned StephenWakely and lukesteensen Jun 16, 2021

jszwedko mentioned this issue Jun 22, 2021

chore(kubernetes_logs source): Revert glob minimum cooldown to 60s #7980

Merged

jszwedko mentioned this issue Jun 23, 2021

Lower glob_minimum_cooldown to 5s for k8s e2e tests #8009

Closed

jszwedko mentioned this issue Jun 24, 2021

chore(ci): Lower glob_minimum_cooldown_ms for k8s e2e tests #8024

Merged

jszwedko removed this from the Vector 0.15.0 milestone Jun 29, 2021

This was referenced Jul 7, 2021

Log files keep on getting opened and closed #7948

Closed

chore(file source): Write checkpoints only if they changed #8176

Merged

jszwedko unassigned StephenWakely and lukesteensen Dec 28, 2022

mrzor mentioned this issue May 21, 2024

inotify support for kubernetes_logs #20541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the `glob_minimum_cooldown_ms` default of 5 seconds correct? #7840

Is the `glob_minimum_cooldown_ms` default of 5 seconds correct? #7840

StephenWakely commented Jun 11, 2021

binarylogic commented Jun 12, 2021

jszwedko commented Jun 16, 2021

lukesteensen commented Jun 16, 2021 •

edited

Loading

lukesteensen commented Jun 22, 2021

spencergilbert commented Jun 22, 2021

StephenWakely commented Jun 25, 2021

spencergilbert commented Jun 25, 2021

jszwedko commented Jun 25, 2021

portante commented Jun 26, 2021

jszwedko commented Jun 28, 2021

portante commented Jun 29, 2021

MOZGIII commented Sep 17, 2021

Is the glob_minimum_cooldown_ms default of 5 seconds correct? #7840

Is the glob_minimum_cooldown_ms default of 5 seconds correct? #7840

Comments

StephenWakely commented Jun 11, 2021

binarylogic commented Jun 12, 2021

jszwedko commented Jun 16, 2021

lukesteensen commented Jun 16, 2021 • edited Loading

lukesteensen commented Jun 22, 2021

spencergilbert commented Jun 22, 2021

StephenWakely commented Jun 25, 2021

spencergilbert commented Jun 25, 2021

jszwedko commented Jun 25, 2021

portante commented Jun 26, 2021

jszwedko commented Jun 28, 2021

portante commented Jun 29, 2021

MOZGIII commented Sep 17, 2021

Is the `glob_minimum_cooldown_ms` default of 5 seconds correct? #7840

Is the `glob_minimum_cooldown_ms` default of 5 seconds correct? #7840

lukesteensen commented Jun 16, 2021 •

edited

Loading