Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Kafka sink losing log events after a full Kafka cluster restart #10398

Open
marcojck opened this issue Dec 10, 2021 · 5 comments
Open
Labels
sink: kafka Anything `kafka` sink related type: bug A code related bug.

Comments

@marcojck
Copy link

marcojck commented Dec 10, 2021

Summary

Vector Kafka sink losing log events after a full Kafka cluster restart.

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

vector 0.18.1 (x86_64-unknown-linux-gnu c4adb60 2021-11-30)

Vector Configuration File

data_dir: ./data

api:
  address: 127.0.0.1:8686
  enabled: true
  playground: true

sources:
  vector_metrics:
    type: internal_metrics
    namespace: vector
    scrape_interval_secs: 5

  logs:
    type: file
    acknowledgements:
      enabled: true
    include:
      - ./*.log

sinks:
  vector_metrics_prometheus_exporter:
    type: prometheus_exporter
    address: 0.0.0.0:9598
    inputs:
      - vector_metrics

  kafka:
    type: kafka
    inputs: [logs]
    bootstrap_servers: broker1:9092,broker2:9092,broker3:9092
    topic: logs
    compression: zstd
    encoding:
      codec: json
    healthcheck:
      enabled: true
    buffer:
      type: disk
      max_size: 262144000 # 250 MiB
    tls:
      enabled: true
      ca_file: ./ca.pem
      verify_certificate: true
    sasl:
      enabled: true
      mechanism: SCRAM-SHA-512
      username: vector
      password: ******
    librdkafka_options:
      client.id: "log-agent"

Debug Output

  • A gist with just the highlights of the Vector log can be viewed here.

  • Full log file can be downloaded here.

Expected Behavior

After a full Kafka cluster restart, the Vector Kafka sink successfully sends all queued log events.

Actual Behavior

After a full Kafka cluster restart, the Vector Kafka sink reconnects to the cluster, but (apparently) all previously queued events are lost.

Additional Context

I'm testing Vector Kafka Sink "at least once" delivery guarantees under critical failures scenarios. My tests show that the Vector Kafka Sink, even when the file buffer type is used, loses some events after a full Kafka cluster restart.

It seems all buffered events (as shown by vector_kafka_queue_messages metric) are lost when the Kafka cluster goes online again after the restart.

IMPORTANT: I've observed losses only when the destination topic has more than one replica. Single replicated topics (even with more than one partition) behave as expected and no messages are lost.

My test environment is:

  • Kafka cluster with 3 (three) brokers.

  • Destination topic settings:

    • Partitions: 1

    • Replication factor: 3

    • min.insync.replicas: 2

  • Single source log file with 500,000 lines.

These are the steps my tests followed:

  1. A few seconds after starting Vector, I gracefully shutdown all cluster brokers.

  2. Vector shows Kafka connections errors in the log and halts the event processing pipeline. The metric vector_kafka_queue_messages shows that there are 100,000 messages (the default value for the queue.buffering.max.messages rdkafka library property) waiting to be sent.

  3. Start all cluster brokers at the same time.

  4. Vector reconnects to the Kafka cluster, logs many producer errors, and after a few seconds start sending events again to the Kafka topic.

  5. When all the 500,000 source log file lines are processed, Vector top command shows (coincidentally or not) that the Kafka sink received 500,000 events in, but only 400,000 events out. Also, the total number of messages in the Kafka topic is 400,000.

I've repeated the steps above many times with consistent results.

Is there anything I can do to fix this? Or this is the expected behavior of the Kafka sink?

Thanks in advance.

Marco.

@marcojck marcojck added the type: bug A code related bug. label Dec 10, 2021
@spencergilbert
Copy link
Contributor

@bruceg is this perhaps related (similarly) to rebalancing/acking issue you're looking at? New cluster would imply rebalance/repartitioning I think.

@spencergilbert spencergilbert added the sink: kafka Anything `kafka` sink related label Dec 10, 2021
@bruceg
Copy link
Member

bruceg commented Dec 11, 2021

Yes, this is almost certainly the same issue as #9587.

@marcojck
Copy link
Author

marcojck commented Jan 17, 2022

@bruceg , you and @spencergilbert mentioned issue #10434 as related to this one, but I'm getting a really hard time trying to understand such a relation.

If I got it right, issue #10434 has to do with vector kafka source (i.e, a kafka consumer) committing offsets to wrong partitions due to a cluster rebalance. On other hand, this issue seems to be caused by a kafka producer sending messages to the wrong partition after a cluster rebalance.

Considering consumer and producer functions/abstractions, as provided by kafka client (librdkafka in our case), are essentially distinct, what would fix both issues would be "flush (drop) all the outstanding finalizers and start over" in the vector code as mentioned in #10434?

Thanks!

@jszwedko
Copy link
Member

Yeah, agreed, this could be related to #10434 but will require a distinct fix since this is regarding the kafka sink losing events, not the kafka source.

@StephenWakely
Copy link
Contributor

I am unable to reproduce this issue.

What I'm finding is that once the buffers fill up, back pressure gets applied to the file source so the source will no longer read from the file. Once Kafka is restarted, the file source starts reading again. I'm not seeing any data loss.

A lot has changed since this issue was originally raised - in particular we have a new implementation of the disk buffers which may be what has resolved the issue.

@marcojck would you be able to check again and let us know if this issue is still a problem for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: kafka Anything `kafka` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

5 participants