Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130

Open
sbalmos opened this issue Aug 1, 2023 · 11 comments
Open
Labels
domain: buffers Anything related to Vector's memory/disk buffers sink: splunk_hec Anything `splunk_hec` sink related type: bug A code related bug.

Comments

@sbalmos
Copy link
Contributor

sbalmos commented Aug 1, 2023

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Since upgrading to 0.31 (from 0.29 in the case of this instance), there has been a marked uptick in dropped events to my splunk_hec_logs sink, which is backed by a disk buffer. The error indicates the events are dropped due to an InvalidProtobufPayload error reading from disk.

2023-08-01T13:18:46.116488Z ERROR sink{component_kind="sink" component_id=staging_splunk_hec component_type=splunk_hec_logs component_name=staging_splunk_hec}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true

I can't find it at the moment, but seem to remember another issue or discussion where the underlying protobuf library was now implementing a 4MB size limit and potentially truncating messages larger than that. Maybe that is also related?

Configuration

data_dir: /vector-data-dir
acknowledgements:
  enabled: true
api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false
sources:
  kafka_in:
    type: kafka
    bootstrap_servers: kafka-kafka-bootstrap.kafka:9093
    group_id: '${KAFKA_CONSUMER_GROUP_ID}'
    topics:
      - ^[^_].+
    librdkafka_options:
      "topic.blacklist": "^strimzi.+"
    decoding:
      codec: json
    sasl:
      enabled: true
      mechanism: SCRAM-SHA-512
      username: '${KAFKA_CONSUMER_USERNAME}'
      password: '${KAFKA_CONSUMER_PASSWORD}'
transforms:
  msg_router:
    type: route
    inputs:
      - kafka_in
    route:
      staging: includes(array!(.destinations), "staging")
      # a few other routes
  staging_filter:
    type: filter
    inputs:
      - msg_router.staging
    condition: .vector_metadata.exclude != true
  staging_throttler:
    type: sample
    inputs:
      - staging_filter
    rate: 20 # 5%
  staging_metadata:
    type: remap
    inputs:
      - staging_throttler
    source: |-
      .host = .vector_metadata.node
      if exists(.vector_metadata.host) {
        .host = .vector_metadata.host
      }
      .splunk.metadata.index = .vector_metadata.index
      .splunk.metadata.source = .vector_metadata.source
      .splunk.metadata.sourcetype = .vector_metadata.sourcetype
sinks:
  staging_splunk_hec:
    type: splunk_hec_logs
    inputs:
      - staging_metadata
    endpoint: https://hec.splunk.staging:8088
    default_token: '${STAGING_HEC_TOKEN}'
    encoding:
      codec: text
    index: '{{ splunk.metadata.index }}'
    source: '{{ splunk.metadata.source }}'
    sourcetype: '{{ splunk.metadata.sourcetype }}'
    acknowledgements:
      query_interval: 30
      retry_limit: 60
    request:
      timeout_secs: 1200
      retry_max_duration_secs: 300
      concurrency: adaptive
    buffer:
      type: disk
      max_size: 5368709120 # 5Gi

Version

vector 0.31.0 (x86_64-unknown-linux-gnu 0f13b22 2023-07-06 13:52:34.591204470)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@sbalmos sbalmos added the type: bug A code related bug. label Aug 1, 2023
@neuronull
Copy link
Contributor

Hi @sbalmos , that could be related to protobuf request size limits.
Would you be able to share your config to help get a sense of what might be going on?

@sbalmos
Copy link
Contributor Author

sbalmos commented Aug 2, 2023

For this instance, which is mainly a message router to different destinations, it's not that interesting. I've updated the Configuration section of the original post.

@sbalmos
Copy link
Contributor Author

sbalmos commented Aug 3, 2023

Found the original issue referencing the tonic lib change creating a 4MB limit. #17926 . Just linking it here for reference, not sure if it's all related.

@pront
Copy link
Contributor

pront commented Aug 8, 2023

Can you share a sample input that I can use in my local kafka producer in order to trigger this error?

@sbalmos
Copy link
Contributor Author

sbalmos commented Aug 8, 2023

I do not have one, since I can't backtrack what input causes the error and thus ends up being dropped by the buffer / sink. I would have to surmise you could possibly trigger it by creating a massive input event of some sort, something that will definitely end up being over 4MB in size whether encoded in protobuf or json.

@jszwedko
Copy link
Member

I think that 4 MB limit was only added for the tonic gRPC server for decoding incoming requests, not for protobuf decoding generally 馃

@sbalmos it sounds like you are seeing more of these errors in 0.31.0 vs 0.29.0? If so, I'm wondering if we could try to bisect down to identify a specific commit that causes the issue.

@pront pront added the sink: splunk_hec Anything `splunk_hec` sink related label Aug 11, 2023
@jszwedko
Copy link
Member

Looking at this again, it sounds like it could be the case that events written to a buffer by v0.29.0 couldn't be read by v0.31.0. It'd be worth trying that as a stand-alone test case.

@sbalmos
Copy link
Contributor Author

sbalmos commented Sep 14, 2023

Gah, haven't really had time yet to get back to this, totally my fault. We've since gone all-0.32.1 and the issue's still present, so it's not a disk buffer format backwards incompatibility. I've got it on my todo list to trace back the usages of InvalidProtobufPayload to see if anything might make sense.

@tobz
Copy link
Contributor

tobz commented Sep 14, 2023

If you're still seeing the problem on fresh installs of 0.32, with fresh disk buffers, then I would agree that it's not related to compatibility issues of the disk buffer files between versions.

Technically speaking, the error is most commonly expected due to something like a compatibility issue (mismatch in the data vs the Protocol Buffers definition, etc) but it can also be triggered purely from the perspective of "is this even valid Protocol Buffers data at all?" It sounds like you're experiencing the latter.

If there's any more information you can share at all, it would be appreciated/helpful. Stuff like the number of events it reports as dropping in a single go (which would be a proxy to understanding the record size in the buffer itself), the size of input events if you're able to roughly calculate that, etc.

@fpytloun
Copy link
Contributor

fpytloun commented May 24, 2024

I just hit this error on multiple of my vector instances after restart which then fails to start:

2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

Is there a way how to at least workaround this? Eg. to determine broken buffer and delete it. I tried vector validate but it pass fine.

Best thing I got is this:

timeout 10 vector --graceful-shutdown-limit-secs 1 -t 1; [ $? -eq 124 ] || rm -rf /var/lib/vector/buffer

Boot being stucked due to buffer corruption is worse case scenario. In my environment it is better to wipe out broken buffer than keep failing.

@fpytloun
Copy link
Contributor

fpytloun commented May 31, 2024

Another workaround is to add this into entrypoint before starting vector process:

  now=$(date +%s)
  last_startup=$(cat /var/lib/vector/startup || echo 0)
  last_startup_age=$[ ($now - $last_startup) / 60 ]

  if find /var/lib/vector/buffer -type f -name buffer.db -mmin +$last_startup_age | grep buffer.db >/dev/null; then
    log_error "Cleaning vector buffer as it was not updated since last startup ($last_startup_age minutes) to fix startup in case of buffer corruption"
    rm -rf /var/lib/vector/buffer
  fi

  echo "$now" > /var/lib/vector/startup

It will simply delete buffer directory if buffer.db was not modified since last startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers sink: splunk_hec Anything `splunk_hec` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

6 participants