Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle rdkafka Message consumption error: Fatal (Local: Fatal error) #13665

Closed
fpytloun opened this issue Jul 21, 2022 · 2 comments · Fixed by #10302
Closed

Handle rdkafka Message consumption error: Fatal (Local: Fatal error) #13665

fpytloun opened this issue Jul 21, 2022 · 2 comments · Fixed by #10302
Labels
source: kafka Anything `kafka` source related type: bug A code related bug.

Comments

@fpytloun
Copy link
Contributor

fpytloun commented Jul 21, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When kafka has under-replicated partitions, Vector crashes after a while.

With disabled rack awareness, error looks like this:

2022-07-21T16:47:10.352136Z ERROR source{component_kind="source" component_id=in_kafka_access component_type=kafka component_name=in_kafka_access}: vector::internal_events::kafka: Failed to read message. error=Message consumption error: Fatal (Local: Fatal error) error_code="reading_message" error_type="reader_failed" stage="receiving"
thread 'vector-worker' panicked at 'called `Option::unwrap()` on a `None` value', /cargo/registry/src/github.com-1ecc6299db9ec823/rdkafka-0.27.0/src/topic_partition_list.rs:201:43

https://github.com/fede1024/rust-rdkafka/blob/master/src/topic_partition_list.rs#L201

With rack awareness enabled:

*** /cargo/registry/src/github.com-1ecc6299db9ec823/rdkafka-sys-4.0.0+1.6.1/librdkafka/src/rdkafka_partition.c:346:rd_kafka_toppar_set_fetch_state: assert: thrd_is_current(rktp->rktp_rkt->rkt_rk->rk_thread) ***

which is probably this issue: confluentinc/librdkafka#3569 (vector: #8750)

For first issue, Vector should probably handle this kind of errors and re-initialize client without panicking.

Configuration

## Kafka source
    [sources.in_kafka_access]
      type = "kafka"
      bootstrap_servers = "kafka-0.sre-mydomain.pa4-par-gc.compass.tenant.int.ves.io:9093,kafka-1.sre-ves.pa4-par-gc.compass.tenant.int.ves.io:9093,kafka-2.sre-ves.pa4-par-gc.compass.tenant.int.ves.io:9093"
      group_id = "vector_access"
      topics = ["^fluentd.svcfw.(apiaccess|secevent)(_.*)?$"]
      librdkafka_options."client.id" = "${HOSTNAME}.pa4-par-gc-int-mydomain-io"
      librdkafka_options."client.rack" = "pa4-par-gc-int-mydomain-io"
      librdkafka_options."group.instance.id" = "${HOSTNAME}.pa4-par-gc-int-mydomain-io"
      # Prefer roundrobin balancing to spread load more evenly
      librdkafka_options."partition.assignment.strategy" = "roundrobin,range"
      librdkafka_options."session.timeout.ms" = "60000"
      topic_key = "_topic"
      partition_key = "_partition"
      offset_key = "_offset"

      tls.enabled = true
      tls.ca_file = "/domain/secrets/identity/server_ca_with_compass.crt"
      tls.crt_file = "/domain/secrets/identity/client.crt"
      tls.key_file = "/domain/secrets/identity/client.key"

Version

vector 0.21.1 (x86_64-unknown-linux-gnu)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

fede1024/rust-rdkafka#279

@jszwedko
Copy link
Member

Hi @fpytloun ! Thanks for this report.

I think this will be fixed by #10302 since the issue was fixed in rdkafka in 0.28.0 (https://github.com/fede1024/rust-rdkafka/blob/master/changelog.md).

@fpytloun
Copy link
Contributor Author

fpytloun commented Aug 3, 2022

FYI: this issue also occurs on producer. When there's some glitch or issue with upstream kafka, I start to see this message:

2022-08-03T06:52:00.031666Z ERROR sink{component_kind="sink" component_id=out_fluent_kafka component_type=kafka component_name=out_fluent_kafka}:request{request_id=3794715}: vector_core::stream::driver: Service call failed. error=KafkaError (Message production error: Fatal (Local: Fatal error)) request_id=3794715

And see no output into Kafka until I restart Vector (messages seems to be simply lost).

At least liveness probe should probably kill such instance, unfortunately vector_events_discarded_total was not populated (might be a bug, right?). But if I compare vector_events_in_total with vector_events_out_total, I can clearly see that there was an issue:

image

So either I'll extend liveness probe or ideally vector's health endpoint should be able to reflect this kind of issues.

Prometheus query in percentage for each sink. Basically means rate of "in-flight events":

(sum(rate(vector_events_in_total{component_kind="sink", component_name!="prometheus_exporter"}[5m])) by (site,pod,component_name) - sum(rate(vector_events_out_total{component_kind="sink", component_name!="prometheus_exporter"}[5m])) by (site,pod,component_name))*100/sum(rate(vector_events_in_total{component_kind="sink", component_name!="prometheus_exporter"}[5m])) by (site,pod,component_name) > 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: kafka Anything `kafka` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants