Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

sbalmos · 2023-08-01T14:09:48Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Since upgrading to 0.31 (from 0.29 in the case of this instance), there has been a marked uptick in dropped events to my splunk_hec_logs sink, which is backed by a disk buffer. The error indicates the events are dropped due to an InvalidProtobufPayload error reading from disk.

2023-08-01T13:18:46.116488Z ERROR sink{component_kind="sink" component_id=staging_splunk_hec component_type=splunk_hec_logs component_name=staging_splunk_hec}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true

I can't find it at the moment, but seem to remember another issue or discussion where the underlying protobuf library was now implementing a 4MB size limit and potentially truncating messages larger than that. Maybe that is also related?

Configuration

data_dir: /vector-data-dir
acknowledgements:
  enabled: true
api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false
sources:
  kafka_in:
    type: kafka
    bootstrap_servers: kafka-kafka-bootstrap.kafka:9093
    group_id: '${KAFKA_CONSUMER_GROUP_ID}'
    topics:
      - ^[^_].+
    librdkafka_options:
      "topic.blacklist": "^strimzi.+"
    decoding:
      codec: json
    sasl:
      enabled: true
      mechanism: SCRAM-SHA-512
      username: '${KAFKA_CONSUMER_USERNAME}'
      password: '${KAFKA_CONSUMER_PASSWORD}'
transforms:
  msg_router:
    type: route
    inputs:
      - kafka_in
    route:
      staging: includes(array!(.destinations), "staging")
      # a few other routes
  staging_filter:
    type: filter
    inputs:
      - msg_router.staging
    condition: .vector_metadata.exclude != true
  staging_throttler:
    type: sample
    inputs:
      - staging_filter
    rate: 20 # 5%
  staging_metadata:
    type: remap
    inputs:
      - staging_throttler
    source: |-
      .host = .vector_metadata.node
      if exists(.vector_metadata.host) {
        .host = .vector_metadata.host
      }
      .splunk.metadata.index = .vector_metadata.index
      .splunk.metadata.source = .vector_metadata.source
      .splunk.metadata.sourcetype = .vector_metadata.sourcetype
sinks:
  staging_splunk_hec:
    type: splunk_hec_logs
    inputs:
      - staging_metadata
    endpoint: https://hec.splunk.staging:8088
    default_token: '${STAGING_HEC_TOKEN}'
    encoding:
      codec: text
    index: '{{ splunk.metadata.index }}'
    source: '{{ splunk.metadata.source }}'
    sourcetype: '{{ splunk.metadata.sourcetype }}'
    acknowledgements:
      query_interval: 30
      retry_limit: 60
    request:
      timeout_secs: 1200
      retry_max_duration_secs: 300
      concurrency: adaptive
    buffer:
      type: disk
      max_size: 5368709120 # 5Gi

Version

vector 0.31.0 (x86_64-unknown-linux-gnu 0f13b22 2023-07-06 13:52:34.591204470)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

neuronull · 2023-08-02T16:30:21Z

Hi @sbalmos , that could be related to protobuf request size limits.
Would you be able to share your config to help get a sense of what might be going on?

sbalmos · 2023-08-02T18:52:58Z

For this instance, which is mainly a message router to different destinations, it's not that interesting. I've updated the Configuration section of the original post.

sbalmos · 2023-08-03T21:36:18Z

Found the original issue referencing the tonic lib change creating a 4MB limit. #17926 . Just linking it here for reference, not sure if it's all related.

pront · 2023-08-08T14:22:35Z

Can you share a sample input that I can use in my local kafka producer in order to trigger this error?

sbalmos · 2023-08-08T22:47:34Z

I do not have one, since I can't backtrack what input causes the error and thus ends up being dropped by the buffer / sink. I would have to surmise you could possibly trigger it by creating a massive input event of some sort, something that will definitely end up being over 4MB in size whether encoded in protobuf or json.

jszwedko · 2023-08-10T16:47:59Z

I think that 4 MB limit was only added for the tonic gRPC server for decoding incoming requests, not for protobuf decoding generally 🤔

@sbalmos it sounds like you are seeing more of these errors in 0.31.0 vs 0.29.0? If so, I'm wondering if we could try to bisect down to identify a specific commit that causes the issue.

jszwedko · 2023-09-14T14:10:58Z

Looking at this again, it sounds like it could be the case that events written to a buffer by v0.29.0 couldn't be read by v0.31.0. It'd be worth trying that as a stand-alone test case.

sbalmos · 2023-09-14T16:05:41Z

Gah, haven't really had time yet to get back to this, totally my fault. We've since gone all-0.32.1 and the issue's still present, so it's not a disk buffer format backwards incompatibility. I've got it on my todo list to trace back the usages of InvalidProtobufPayload to see if anything might make sense.

tobz · 2023-09-14T17:11:31Z

If you're still seeing the problem on fresh installs of 0.32, with fresh disk buffers, then I would agree that it's not related to compatibility issues of the disk buffer files between versions.

Technically speaking, the error is most commonly expected due to something like a compatibility issue (mismatch in the data vs the Protocol Buffers definition, etc) but it can also be triggered purely from the perspective of "is this even valid Protocol Buffers data at all?" It sounds like you're experiencing the latter.

If there's any more information you can share at all, it would be appreciated/helpful. Stuff like the number of events it reports as dropping in a single go (which would be a proxy to understanding the record size in the buffer itself), the size of input events if you're able to roughly calculate that, etc.

fpytloun · 2024-05-24T09:25:54Z

I just hit this error on multiple of my vector instances after restart which then fails to start:

2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

Is there a way how to at least workaround this? Eg. to determine broken buffer and delete it. I tried vector validate but it pass fine.

Best thing I got is this:

timeout 10 vector --graceful-shutdown-limit-secs 1 -t 1; [ $? -eq 124 ] || rm -rf /var/lib/vector/buffer

Boot being stucked due to buffer corruption is worse case scenario. In my environment it is better to wipe out broken buffer than keep failing.

fpytloun · 2024-05-31T12:57:02Z

Another workaround is to add this into entrypoint before starting vector process:

  now=$(date +%s)
  last_startup=$(cat /var/lib/vector/startup || echo 0)
  last_startup_age=$[ ($now - $last_startup) / 60 ]

  if find /var/lib/vector/buffer -type f -name buffer.db -mmin +$last_startup_age | grep buffer.db >/dev/null; then
    log_error "Cleaning vector buffer as it was not updated since last startup ($last_startup_age minutes) to fix startup in case of buffer corruption"
    rm -rf /var/lib/vector/buffer
  fi

  echo "$now" > /var/lib/vector/startup

It will simply delete buffer directory if buffer.db was not modified since last startup.

sbalmos added the type: bug A code related bug. label Aug 1, 2023

pront added the sink: splunk_hec Anything `splunk_hec` sink related label Aug 11, 2023

tobz added the domain: buffers Anything related to Vector's memory/disk buffers label Oct 16, 2023

sbalmos mentioned this issue Dec 5, 2023

Protobuf default recursion depth limit of 32 - fails disk buffering, codec, and Vector source/sink #19315

Open

hillmandj mentioned this issue Feb 12, 2024

Unknown buffer disk limitation #19759

Open

fpytloun mentioned this issue May 24, 2024

Vector validate should also validate disk buffer integrity #20561

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

sbalmos commented Aug 1, 2023 •

edited

Loading

neuronull commented Aug 2, 2023

sbalmos commented Aug 2, 2023

sbalmos commented Aug 3, 2023

pront commented Aug 8, 2023

sbalmos commented Aug 8, 2023

jszwedko commented Aug 10, 2023

jszwedko commented Sep 14, 2023

sbalmos commented Sep 14, 2023

tobz commented Sep 14, 2023 •

edited

Loading

fpytloun commented May 24, 2024 •

edited

Loading

fpytloun commented May 31, 2024 •

edited

Loading

Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

Comments

sbalmos commented Aug 1, 2023 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

neuronull commented Aug 2, 2023

sbalmos commented Aug 2, 2023

sbalmos commented Aug 3, 2023

pront commented Aug 8, 2023

sbalmos commented Aug 8, 2023

jszwedko commented Aug 10, 2023

jszwedko commented Sep 14, 2023

sbalmos commented Sep 14, 2023

tobz commented Sep 14, 2023 • edited Loading

fpytloun commented May 24, 2024 • edited Loading

fpytloun commented May 31, 2024 • edited Loading

sbalmos commented Aug 1, 2023 •

edited

Loading

tobz commented Sep 14, 2023 •

edited

Loading

fpytloun commented May 24, 2024 •

edited

Loading

fpytloun commented May 31, 2024 •

edited

Loading