Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic log supression #6132

Closed
calmh opened this issue Nov 7, 2019 · 1 comment
Closed

Dynamic log supression #6132

calmh opened this issue Nov 7, 2019 · 1 comment
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug) frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion

Comments

@calmh
Copy link
Member

calmh commented Nov 7, 2019

Sometimes we log a shitload of data - when we're stuck retrying failing sync for lots of items, for example, or when there's a bug and we get stuck retrying something like socket accepts thousands of times a second. We do want to log most of this, but at a reasonable rate. Some requirements I have on the logging:

  • Failed items must be logged. At least once, and with some indication that failures are ongoing. I don't need to see every time everything fails, but a grep for something that has failed must show something.

  • Repeated log entries are fine if the volume is low. For example, if I manually pause and resume a folder I want log output for that in real time, even if I do it twice in succession.

  • We should not log repeated log entries infinitely.

I propose that we add some sort of suppression to the logging layer. Something like the following would work for me.

  1. Assume we keep track on the number of log entries produced per minute, over at least the last three to five minutes.

  2. Once every minute we rotate the buckets and make the decision on whether suppression is on or off based on comparing the rate to some threshold. If the rate is low enough to not cause problems (say, <100 messages/minute on average over the last five minutes) we don't do any kind of suppression, if it's higher we enable suppression. There is a log entry when we enable or disable suppression.

  3. Assume also that we keep track of the hashes of log messages we've printed (sans timestamp etc), in the same one minute buckets.

  4. When we're in suppression and a log entry matches an existing hash we don't log it and just bump a counter somewhere instead.

  5. Once a minute we log the number of suppressed messages and clear the counter.

The result would be that under normal circumstances there isn't any suppression. Once we end up in a state with 100k failed items we'd log them once, trip the threshold, next time just say "100k previously seen log messages suppressed since $last_minute" and that'd be that.

The cost here would be keeping hashes of the last few minutes log entries. In normal operation that's roughly zero. When there's a shitload of entries we will incur som memory cost and may want a limit. If we limit ourselves to say 1M log messages tracked that might be maybe 32 megabytes of hashes and the same again in map overhead and whatnot. I think that's an OK price to pay. There could of course be an off switch.

As a special case we could always do a comparison to exactly the last message printed and not repeat that, instead saying "last message repeated 48736 times" once every minute or so. This just for when we get stuck in some buggy accept loop.

@calmh calmh added the enhancement New features or improvements of some kind, as opposed to a problem (bug) label Nov 7, 2019
calmh added a commit to calmh/syncthing that referenced this issue Nov 7, 2019
With this change, certain identical log messages are filtered out,
reducing log volume when there are large numbers of repeats.
@imsodin
Copy link
Member

imsodin commented Nov 30, 2019

From #6134 (comment) by calmh:

So, I think this works, but do we need it? We have fixed the puller output, which is the large logging culprit in the normal case. We have added rotation, which avoids murdering Windows folks' systems by default. We can still get stuck in logging loops due to bugs, but the rotation will act as a fuse for that and for people with systemd/syslog setups that's unfortunate but their log systems can do deduplication as required... I think we're good without this for now.

@imsodin imsodin closed this as completed Nov 30, 2019
@st-review st-review added the frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion label Nov 30, 2020
@syncthing syncthing locked and limited conversation to collaborators Nov 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug) frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion
Projects
None yet
Development

No branches or pull requests

3 participants