Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the glob_minimum_cooldown_ms default of 5 seconds correct? #7840

Open
StephenWakely opened this issue Jun 11, 2021 · 12 comments
Open

Is the glob_minimum_cooldown_ms default of 5 seconds correct? #7840

StephenWakely opened this issue Jun 11, 2021 · 12 comments
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: task Generic non-code related tasks

Comments

@StephenWakely
Copy link
Contributor

Some recent prs #7693, #7738 have changed this value changed this value from 60 seconds to 5 seconds.

However, discussing with @binarylogic the value of 60 seconds was chosen very carefully as 5 seconds is too fast and was causing performance issues.

I want to document here what those issues were and we need to decide if we want to revert this change back to 60 seconds.

@StephenWakely StephenWakely added type: task Generic non-code related tasks source: kubernetes_logs Anything `kubernetes_logs` source related labels Jun 11, 2021
@binarylogic
Copy link
Contributor

cc @lukesteensen @MOZGIII since you both worked on profiling to land on this number.

@jszwedko jszwedko added this to the Vector 0.15.0 milestone Jun 16, 2021
@jszwedko
Copy link
Member

We should aim to resolve this before the 0.15.0 release before we release the new 5 second cool down if we anticipate it could cause performance issues.

We could also just revert back to the 60s cooldown until we are able to dig into this more to keep the behavior as-is.

@lukesteensen
Copy link
Member

lukesteensen commented Jun 16, 2021

To add some context here, we made the change to 60 seconds as part of a larger set of changes to help performance in the file source. At the time, searching for new files was part of the main loop along with normal files reads. We found that in some cases (e.g. many files to fingerprint), the search time was dominating and leading to relatively few reads and therefore low throughput. Increasing this value was a simple way to spend more of the loop reading from files and less time looking for new ones.

Edit: I misremembered and the following is incorrect

Since that change, we made an additional change to remove searching for new files from the read loop completely. It now happens on an independent task and will not directly reduce the rate at which we read from the target files. There is still some concern that doing this more often will increase the total amount of work Vector is doing, and potentially add some contention for CPU time, IO resources, etc, but the original root problem has been solved more completely. This makes me think that reducing the value should be fine, but it's something to keep an eye on for sure.

@lukesteensen
Copy link
Member

Realizing now that I misremembered and the above is not correct. It is the writing of the checkpoint file that was moved to a separate task, not the searching for new files. We made plans to extract a new task for that, but we have not yet done the work. That makes me a bit more hesitant to ship this lower default without doing the work to extract that task.

@spencergilbert
Copy link
Contributor

Possibly related to #7934

jszwedko added a commit that referenced this issue Jun 22, 2021
Pending investigation in
#7840

This is just to make sure we don't accidentally release the new default
of 5s before we are confident in it.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
jszwedko added a commit that referenced this issue Jun 22, 2021
…7980)

Pending investigation in
#7840

This is just to make sure we don't accidentally release the new default
of 5s before we are confident in it.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
jszwedko added a commit that referenced this issue Jun 24, 2021
In #7980 we put the default back to 60000 ms until we can take a look in
 #7840 however this results in the tests taking 5 hours rather than 15
minutes so this drops the cooldown just for the tests.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
jszwedko added a commit that referenced this issue Jun 24, 2021
* chore(ci): Lower glob_minimum_cooldown_ms for k8s e2e tests

In #7980 we put the default back to 60000 ms until we can take a look in
 #7840 however this results in the tests taking 5 hours rather than 15
minutes so this drops the cooldown just for the tests.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
@StephenWakely
Copy link
Contributor Author

I'm not sure the value of 60 seconds is correct either.

I experimented with this by creating a pod that essentially runs:

sh -c for ((i=0; i <= 50000; i++)); do echo "COUNT $i"; done

This pod seems to be creating logs that are rotated faster than 60 seconds (running in Kind), and so entire files were being missed. Reducing this value to 5 seconds did alleviate this problem. I am however still dropping records early on in the sequence, I haven't yet worked out why.

It is possible that the value of this is different for different situations, in which case it would be well worth exploring those situations to document them - or coming up with an alternative method that doesn't require this parameter at all.

@spencergilbert
Copy link
Contributor

We do need to document it for sure, I can do that as part of the PR I'm putting together to resolve the rotation issue as well. @StephenWakely if you want to try my fix I can send that over to you.

@jszwedko
Copy link
Member

Agreed; a secondary reason we might want to drop it is simply to start reading logs from pods more quickly. 60s could be quite a while to wait for logs.

@portante
Copy link

Why are we polling for files? Why not inotify?

@jszwedko
Copy link
Member

Why are we polling for files? Why not inotify?

My understanding is that we went with polling to avoid all of the rough edges of inotify and for simple cross-platform support. @lukesteensen may be able to comment further on that.

@portante
Copy link

Unfortunately, file based scrapers are inherently subject to these kinds of problems, with no way to really "win" with the architectures we have in play today.

@MOZGIII
Copy link
Contributor

MOZGIII commented Sep 17, 2021

Imo, 5 ms should be a good default for the average, no-so-high-logs-volume cluster. For the clusters with high load and a lot of logs generated, we should encourage people to tweak both glob_minimum_cooldown_ms and max log file size on the docker / CRI side. People will have to effectively balance the two together to get to the level where the files are rotated slow enough (via increasing the file size) so that they're reliably captured by the file watcher (adjusted via the glob_minimum_cooldown_ms), while keeping glob_minimum_cooldown_ms high enough so it doesn't hurt the performance.

I suggest doing performance benches similar to what we did last time after introducing this change - cause the profiles might've changed.

Ofc, with evented fd watcher it could be improved further, ultimately to the point where no tradeoff like this is needed. Maybe it is time to add that complexity to the file server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

7 participants