Vector Kafka sink losing log events after a full Kafka cluster restart #10398

marcojck · 2021-12-10T18:44:51Z

Summary

Vector Kafka sink losing log events after a full Kafka cluster restart.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

vector 0.18.1 (x86_64-unknown-linux-gnu c4adb60 2021-11-30)

Vector Configuration File

data_dir: ./data

api:
  address: 127.0.0.1:8686
  enabled: true
  playground: true

sources:
  vector_metrics:
    type: internal_metrics
    namespace: vector
    scrape_interval_secs: 5

  logs:
    type: file
    acknowledgements:
      enabled: true
    include:
      - ./*.log

sinks:
  vector_metrics_prometheus_exporter:
    type: prometheus_exporter
    address: 0.0.0.0:9598
    inputs:
      - vector_metrics

  kafka:
    type: kafka
    inputs: [logs]
    bootstrap_servers: broker1:9092,broker2:9092,broker3:9092
    topic: logs
    compression: zstd
    encoding:
      codec: json
    healthcheck:
      enabled: true
    buffer:
      type: disk
      max_size: 262144000 # 250 MiB
    tls:
      enabled: true
      ca_file: ./ca.pem
      verify_certificate: true
    sasl:
      enabled: true
      mechanism: SCRAM-SHA-512
      username: vector
      password: ******
    librdkafka_options:
      client.id: "log-agent"

Debug Output

A gist with just the highlights of the Vector log can be viewed here.
Full log file can be downloaded here.

Expected Behavior

After a full Kafka cluster restart, the Vector Kafka sink successfully sends all queued log events.

Actual Behavior

After a full Kafka cluster restart, the Vector Kafka sink reconnects to the cluster, but (apparently) all previously queued events are lost.

Additional Context

I'm testing Vector Kafka Sink "at least once" delivery guarantees under critical failures scenarios. My tests show that the Vector Kafka Sink, even when the file buffer type is used, loses some events after a full Kafka cluster restart.

It seems all buffered events (as shown by vector_kafka_queue_messages metric) are lost when the Kafka cluster goes online again after the restart.

IMPORTANT: I've observed losses only when the destination topic has more than one replica. Single replicated topics (even with more than one partition) behave as expected and no messages are lost.

My test environment is:

Kafka cluster with 3 (three) brokers.
Destination topic settings:
- Partitions: 1
- Replication factor: 3
- min.insync.replicas: 2
Single source log file with 500,000 lines.

These are the steps my tests followed:

A few seconds after starting Vector, I gracefully shutdown all cluster brokers.
Vector shows Kafka connections errors in the log and halts the event processing pipeline. The metric vector_kafka_queue_messages shows that there are 100,000 messages (the default value for the queue.buffering.max.messages rdkafka library property) waiting to be sent.
Start all cluster brokers at the same time.
Vector reconnects to the Kafka cluster, logs many producer errors, and after a few seconds start sending events again to the Kafka topic.
When all the 500,000 source log file lines are processed, Vector top command shows (coincidentally or not) that the Kafka sink received 500,000 events in, but only 400,000 events out. Also, the total number of messages in the Kafka topic is 400,000.

I've repeated the steps above many times with consistent results.

Is there anything I can do to fix this? Or this is the expected behavior of the Kafka sink?

Thanks in advance.

Marco.

The text was updated successfully, but these errors were encountered:

spencergilbert · 2021-12-10T19:53:52Z

@bruceg is this perhaps related (similarly) to rebalancing/acking issue you're looking at? New cluster would imply rebalance/repartitioning I think.

bruceg · 2021-12-11T00:33:03Z

Yes, this is almost certainly the same issue as #9587.

marcojck · 2022-01-17T15:42:43Z

@bruceg , you and @spencergilbert mentioned issue #10434 as related to this one, but I'm getting a really hard time trying to understand such a relation.

If I got it right, issue #10434 has to do with vector kafka source (i.e, a kafka consumer) committing offsets to wrong partitions due to a cluster rebalance. On other hand, this issue seems to be caused by a kafka producer sending messages to the wrong partition after a cluster rebalance.

Considering consumer and producer functions/abstractions, as provided by kafka client (librdkafka in our case), are essentially distinct, what would fix both issues would be "flush (drop) all the outstanding finalizers and start over" in the vector code as mentioned in #10434?

Thanks!

jszwedko · 2022-01-19T23:14:08Z

Yeah, agreed, this could be related to #10434 but will require a distinct fix since this is regarding the kafka sink losing events, not the kafka source.

StephenWakely · 2023-09-12T14:44:34Z

I am unable to reproduce this issue.

What I'm finding is that once the buffers fill up, back pressure gets applied to the file source so the source will no longer read from the file. Once Kafka is restarted, the file source starts reading again. I'm not seeing any data loss.

A lot has changed since this issue was originally raised - in particular we have a new implementation of the disk buffers which may be what has resolved the issue.

@marcojck would you be able to check again and let us know if this issue is still a problem for you?

marcojck added the type: bug A code related bug. label Dec 10, 2021

spencergilbert added the sink: kafka Anything `kafka` sink related label Dec 10, 2021

bruceg mentioned this issue Dec 13, 2021

fix(kafka source): Flush commits on rebalance events #10434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Kafka sink losing log events after a full Kafka cluster restart #10398

Vector Kafka sink losing log events after a full Kafka cluster restart #10398

marcojck commented Dec 10, 2021 •

edited

spencergilbert commented Dec 10, 2021

bruceg commented Dec 11, 2021

marcojck commented Jan 17, 2022 •

edited

jszwedko commented Jan 19, 2022

StephenWakely commented Sep 12, 2023

Vector Kafka sink losing log events after a full Kafka cluster restart #10398

Vector Kafka sink losing log events after a full Kafka cluster restart #10398

Comments

marcojck commented Dec 10, 2021 • edited

Summary

Community Note

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Additional Context

spencergilbert commented Dec 10, 2021

bruceg commented Dec 11, 2021

marcojck commented Jan 17, 2022 • edited

jszwedko commented Jan 19, 2022

StephenWakely commented Sep 12, 2023

marcojck commented Dec 10, 2021 •

edited

marcojck commented Jan 17, 2022 •

edited