Skip to content

dedupe hides events that occur continuously #22847

@ilinas

Description

@ilinas

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

We have some services that can generate an enormous amount of log events in case of failures, and we are using dedupe to protect our log aggregators from being overwhelmed.

However in cases where a message is repeated continuously, e.g. an application keeps logging the same error, dedupe will only let the message through once, and never again. If you miss the initial message, and look at the logs at a later date, you will not find any log events, erroneously believing that nothing of importance is happening. Whereas in reality the application may be desperately spewing out thousands of error messages.

Consider the following situation:

sources:
  log_generator:
    type: demo_logs
    format: shuffle
    lines:
      - 'greetings'
      - 'welcome'
      - 'hello'
    interval: 0.2

transforms:
  dedupe:
    type: dedupe
    inputs:
      - log_generator
    fields:
      match:
        - message

sinks:
  output:
    type: console
    inputs:
      - dedupe
    encoding:
      codec: json

This will let the event though only once, and never again:

{"host":"localhost","message":"hello","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:57.977945319Z"}
{"host":"localhost","message":"greetings","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:58.179360057Z"}
{"host":"localhost","message":"welcome","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:33:58.778797868Z"}

Attempted Solutions

We worked it around by adding a truncated timestamp to log events to get them to output at least once a second:

sources:
  log_generator:
    type: demo_logs
    format: shuffle
    lines:
      - 'greetings'
      - 'welcome'
      - 'hello'
    interval: 0.2

transforms:
  add_secs_timestamp:
    type: remap
    inputs:
      - log_generator
    source: |
      .secs_timestamp = to_string!(to_unix_timestamp!(.timestamp, "seconds"))

  dedupe:
    type: dedupe
    inputs:
      - add_secs_timestamp
    fields:
      match:
        - message
        - secs_timestamp

sinks:
  output:
    type: console
    inputs:
      - dedupe
    encoding:
      codec: json

This will make sure that we see the event at least once a second:

{"host":"localhost","message":"hello","secs_timestamp":"1744188513","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:33.446753593Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188513","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:33.647865211Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188514","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:34.047107454Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188514","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:34.246730722Z"}
{"host":"localhost","message":"greetings","secs_timestamp":"1744188515","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:35.046117586Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188515","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:35.246558373Z"}
{"host":"localhost","message":"greetings","secs_timestamp":"1744188516","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:36.046720858Z"}
{"host":"localhost","message":"welcome","secs_timestamp":"1744188516","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:36.246820751Z"}
{"host":"localhost","message":"hello","secs_timestamp":"1744188517","service":"vector","source_type":"demo_logs","timestamp":"2025-04-09T08:48:37.046530422Z"}

We also considered the reduce transformation, but found it hard to use in a generic manner, as aggregation changes the log message format, which we do not want.

Proposal

It would be useful to have a way to limit how long the events spend in the deduplication queue before letting them through.

Two options spring to mind:

  1. Let an event though every max_duplicates number of events.
  2. Have maximum time expire_after_ms an event is kept in the queue the just like in dedupe.
transform:
  dedupe:
    type: dedupe
    inputs:
      - logs
    max_duplicates: 100
    expire_after_ms: 1000
    fields:
      match:
        - message

References

No response

Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    transform: dedupeAnything `dedupe` transform relatedtype: featureA value-adding code addition that introduce new functionality.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions