-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
We have some services that can generate an enormous amount of log events in case of failures, and we are using dedupe to protect our log aggregators from being overwhelmed.
However in cases where a message is repeated continuously, e.g. an application keeps logging the same error, dedupe will only let the message through once, and never again. If you miss the initial message, and look at the logs at a later date, you will not find any log events, erroneously believing that nothing of importance is happening. Whereas in reality the application may be desperately spewing out thousands of error messages.
Consider the following situation:
sources:
log_generator:
type: demo_logs
format: shuffle
lines:
- 'greetings'
- 'welcome'
- 'hello'
interval: 0.2
transforms:
dedupe:
type: dedupe
inputs:
- log_generator
fields:
match:
- message
sinks:
output:
type: console
inputs:
- dedupe
encoding:
codec: jsonThis will let the event though only once, and never again:
{"host":"localhost","message":"hello","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:57.977945319Z"}
{"host":"localhost","message":"greetings","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:58.179360057Z"}
{"host":"localhost","message":"welcome","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:58.778797868Z"}Attempted Solutions
We worked it around by adding a truncated timestamp to log events to get them to output at least once a second:
sources:
log_generator:
type: demo_logs
format: shuffle
lines:
- 'greetings'
- 'welcome'
- 'hello'
interval: 0.2
transforms:
add_secs_timestamp:
type: remap
inputs:
- log_generator
source: |
.secs_timestamp = to_string!(to_unix_timestamp!(.timestamp, "seconds"))
dedupe:
type: dedupe
inputs:
- add_secs_timestamp
fields:
match:
- message
- secs_timestamp
sinks:
output:
type: console
inputs:
- dedupe
encoding:
codec: jsonThis will make sure that we see the event at least once a second:
{"host":"localhost","message":"hello","secs_timestamp":"1744188513","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:33.446753593Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188513","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:33.647865211Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188514","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:34.047107454Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188514","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:34.246730722Z"}
{"host":"localhost","message":"greetings","secs_timestamp":"1744188515","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:35.046117586Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188515","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:35.246558373Z"}
{"host":"localhost","message":"greetings","secs_timestamp":"1744188516","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:36.046720858Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188516","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:36.246820751Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188517","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:37.046530422Z"}We also considered the reduce transformation, but found it hard to use in a generic manner, as aggregation changes the log message format, which we do not want.
Proposal
It would be useful to have a way to limit how long the events spend in the deduplication queue before letting them through.
Two options spring to mind:
- Let an event though every
max_duplicatesnumber of events. - Have maximum time
expire_after_msan event is kept in the queue the just like indedupe.
transform:
dedupe:
type: dedupe
inputs:
- logs
max_duplicates: 100
expire_after_ms: 1000
fields:
match:
- messageReferences
No response
Version
No response