Dynamic log supression #6132
Labels
enhancement
New features or improvements of some kind, as opposed to a problem (bug)
frozen-due-to-age
Issues closed and untouched for a long time, together with being locked for discussion
Sometimes we log a shitload of data - when we're stuck retrying failing sync for lots of items, for example, or when there's a bug and we get stuck retrying something like socket accepts thousands of times a second. We do want to log most of this, but at a reasonable rate. Some requirements I have on the logging:
Failed items must be logged. At least once, and with some indication that failures are ongoing. I don't need to see every time everything fails, but a grep for something that has failed must show something.
Repeated log entries are fine if the volume is low. For example, if I manually pause and resume a folder I want log output for that in real time, even if I do it twice in succession.
We should not log repeated log entries infinitely.
I propose that we add some sort of suppression to the logging layer. Something like the following would work for me.
Assume we keep track on the number of log entries produced per minute, over at least the last three to five minutes.
Once every minute we rotate the buckets and make the decision on whether suppression is on or off based on comparing the rate to some threshold. If the rate is low enough to not cause problems (say, <100 messages/minute on average over the last five minutes) we don't do any kind of suppression, if it's higher we enable suppression. There is a log entry when we enable or disable suppression.
Assume also that we keep track of the hashes of log messages we've printed (sans timestamp etc), in the same one minute buckets.
When we're in suppression and a log entry matches an existing hash we don't log it and just bump a counter somewhere instead.
Once a minute we log the number of suppressed messages and clear the counter.
The result would be that under normal circumstances there isn't any suppression. Once we end up in a state with 100k failed items we'd log them once, trip the threshold, next time just say "100k previously seen log messages suppressed since $last_minute" and that'd be that.
The cost here would be keeping hashes of the last few minutes log entries. In normal operation that's roughly zero. When there's a shitload of entries we will incur som memory cost and may want a limit. If we limit ourselves to say 1M log messages tracked that might be maybe 32 megabytes of hashes and the same again in map overhead and whatnot. I think that's an OK price to pay. There could of course be an off switch.
As a special case we could always do a comparison to exactly the last message printed and not repeat that, instead saying "last message repeated 48736 times" once every minute or so. This just for when we get stuck in some buggy accept loop.
The text was updated successfully, but these errors were encountered: