Skip to content

[Feature Request]: Better ILogger sampling and filtering #82465

Closed
@noahfalk

Description

@noahfalk

Problem:

Large scale .NET customers are constantly complaining that logging is too expensive to turn on in production, perf, or stress scenarios. Their external log storage supports sampling of logs to limit storage costs, but that doesn't address the perf impact of collecting and exporting the data from the app.

Each log level adds an order of magnitude to the number of records produced, making collection prohibitive, but not collecting them reduces the chances of detecting problems earlier, or further optimizing performance.

Existing partial solutions:

  • When log statements are created in code controlled by the application developer they could add logic to make the log conditional. For example they might write if(Activity.Current?.IsRecorded) { _logger.LogInfo(...); }. This solution doesn't work for log statements in other libraries and it feels awkward to tie logging code from all over to a specific log sampling strategy.
  • When logging with the console sink developers can use QueueFullMode to drop log messages that can't be flushed fast enough. This does avoid some performance issues with blocking on console output, but with little control on which messages get preserved the log might be filled with low value messages while important content got dropped.

Better potential solutions:

UPDATE 1/12: I have added a more specific proposed solution in the discussion below

There are several things we might want to look into and they aren't mutually exclusive:

  • filtering by event ID - this would allow users to turn on/off specific messages from a library without necessarily seeing all other messages at the same logging level.
  • random sampling - The app developer can select some % of the total messages they want to keep and each message uses an RNG to decide if it will be filtered out.
  • request based sampling - Rather than deciding on a message by message basis what should be preserved, this approach tries to decide for an entire request will it be logged or not. This can be an improvement over random sampling because it ensures that when you do see message in the log such as an error or some interesting state you can also find all the other log messages for the same request which lead up to it.
  • rate limiting - rather than trying to define a specific fraction or type of message to filter in advance, we could conditionally filter messages only when load spikes. This can be a useful safety backstop to ensure that sudden failure or load increase on logging doesn't spillover into other problems because the logging is doing too much IO or using too much CPU.
    • combining rate limiting and event ID to only send N per time window for each ID

Ideally any/all of the above options would be configurable in code or via reloadable configuration and they are generic enough to work with any log source and sink. SRE may want to use config settings to mitigate an active issue where a production is logging too much or they may want to proactively configure these options to curate their log output. Being able to change configuration on the fly enables rapid exploration of problem hypotheses without requiring taking the service offline to update.

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions