Change ruby-kafka to rdkafka-ruby #97

JHK · 2018-10-16T10:39:57Z

This PR will change the underlying library to the C-based rdkafka-ruby. It is still work in progress as some things are not updated yet:

Tests
Documentation on configuration
Documentation on breaking changes
Instrumentation of rdkafka-ruby (Instrumentation support karafka/rdkafka-ruby#54)

README.md

lib/racecar/config.rb

JHK · 2018-10-17T20:25:45Z

README.md

@@ -237,8 +237,6 @@ end
 The consumers will checkpoint their positions from time to time in order to be able to recover from failures. This is called _committing offsets_, since it's done by tracking the offset reached in each partition being processed, and committing those offset numbers to the Kafka offset storage API. If you can tolerate more double-processing after a failure, you can increase the interval between commits in order to better performance. You can also do the opposite if you prefer less chance of double-processing.

 * `offset_commit_interval` – How often to save the consumer's position in Kafka. Default is every 10 seconds.
-* `offset_commit_threshold` – How many messages to process before forcing a checkpoint. Default is 0, which means there's no limit. Setting this to e.g. 100 makes the consumer stop every 100 messages to checkpoint its position.
-* `offset_retention_time` - How long committed offsets will be retained. Defaults to the broker setting.


Those configuration values are not supported by rdkafka. Or did I miss just them?

offset_commit_threshold is a ruby-kafka thing, but offset_retention_time should be supported – it's part of the protocol.

Okay I tried to dig a bit deeper there. It seems like the underlying C library librdkafka does not expose offset retention as configuration option. Looking at the current Apache Kafka Documentation I can only find it as broker configuration value.

Personally I don't think we would need that option to be set from the consumer side. Therefore I would vote for ditching it to be able to make progress.

It's actually something I use in production :/ it's nice for knowing how long a consumer group can be suspended for before it loses its offsets...

Okay, I've put it into my TODO list for going from alpha to beta. This one might be tricky to implement trough all the layers involved though.

I found this issue being related to that topic. They once thought about implementing it per commit until this was removed from the protocol definition.

Hmm. It's still supported by ruby-kafka...

Looks like it is part of the latest OffsetCommit request (version 4), but not part of TxnOffsetCommit. http://kafka.apache.org/protocol.html#The_Messages_OffsetCommit

FlorianDoublet · 2018-10-19T09:30:01Z

Good initiative 👍 !

lib/racecar/config.rb

dasch · 2018-10-19T13:08:54Z

lib/racecar/consumer_set.rb

+      @consumers = @config.subscriptions.map do |subscription|
+        consumer = Rdkafka::Config.new(rdkafka_config(subscription)).consumer
+        consumer.subscribe subscription.topic
+        consumer


Does an Rdkafka consumer not support multiple subscriptions?

Ah, is it because rdkafka-ruby doesn't support per-subscription config? That's supported by the protocol though – does librdkafka not support that?

It does, but not with different configuration options (like auto.offset.reset, which is quite handy to define in the handler for different topics).

We'll see if we run into performance issues with many subscriptions then.

That's correct, there can be only one config per client. This is a limitation imposed by the C lib.

to avoid duplication in prioviding allowed_values in config

thijsc · 2018-11-14T10:06:26Z

Awesome that you're making this move. I'm a maintainer of rdkafka-ruby and happy to help out if anything comes up.

mensfeld · 2018-11-14T10:09:03Z

@thijsc from my perspective the only thing that is missing in rdkafka-ruby is a batch support. It can be emulated like @JHK did but having in in the rdkafka-ruby would help a lot.

thijsc · 2018-11-14T10:14:39Z

Took a quick look at d00dbc7, but not sure how this works exactly. What are you achieving with these batches?

JHK · 2018-11-14T10:55:18Z

@thijsc I tried to explain my use case in an issue on rdkafka-ruby: karafka/rdkafka-ruby#57

dasch · 2018-11-14T11:49:49Z

@thijsc we've got a ruby-kafka + racecar + delivery boy Slack channel; do you want to join so that we can discuss things there?

https://ruby-kafka-slack.herokuapp.com/

* master: Add note to changelog Add current dir to LOAD_PATH only when --require option is used Update version to v0.5.0.beta2 Remove alpha warning for producing messages Add `command` to the k8s example Add example of configuring consumer in k8s Describe how to deploy with Kubernetes RSpec: enable zero-monkey patch mode Allow using Bundler 2 Allow configuring `sasl_over_ssl`. Replace #present? with #nil? to work outside of Rails Fix group_id/group_id_prefix documentation v0.5.0.beta1 Describe manual heartbeats in the README Allow triggering heartbeats manually from a consumer Mention message headers

* wip: initial implementation * wip: add basic tests & fix bugs * bump rdkafka version to ensure pause is available * refactor pause class, use rdkafka's assignment over subscription -- latter doesn't list partitions at all * document exponential backoff; fix disabling pause * seek back in order not to skip over broken batches * explain how heartbeats/long running processing works with librdkafka * improve long-running task docs

gaurish · 2019-07-20T13:15:17Z

What is remaining here or blockers? Anything we could to push this along? As using native librdkafka would have immense benefits.

JHK · 2019-07-23T08:19:23Z

Features

I consider this PR is almost complete:

Pause support requires another release of rdkafka-ruby. Once there is one we wanted to bump the version here. See also Bump librdkafka to 1.1.0 karafka/rdkafka-ruby#78
Further instrumentation support: See Instrumentation support karafka/rdkafka-ruby#54
There are conflicts 😉

Stability

We are running this PR in production for some weeks now using Kafka 2.2. rdkafka-ruby hasn't been the source of any problems we've faced. I plan to publish a blogpost for our use case we had to work on this soon.

Breaking Changes

This change requires a major release as it introduces some breaking changes. Those need to get documented carefully and sorted out with the core maintainers of this gem. See also https://github.com/zendesk/racecar/pull/97/files#diff-4ac32a78649ca5bdd8e0ba38b7006a1e

Documentation

The breaking changes require some re-write of the README. Those texts also need to get approved by the core maintainers of this gem. See also https://github.com/zendesk/racecar/pull/97/files#diff-04c6e90faac2675aa89e2176d2eec7d8

As we run this code on production already we consider it good enough for our use-case. Obviously we also would have a benefit when this code re-joins the main branch, but we would happily be willing to accept help to make this happen. @dasch did I conclude correct here or have I missed something from your point of view?

thijsc · 2019-07-23T09:55:56Z

I just released rdkafka-ruby with all the latest additions. (pause was already in 0.5.0 by the way). I haven't made any progress on the instrumentation support. It's not something we have a need for ourselves, so haven't felt a lot of urgency.

If it's a bocker to get this pull shipped we should get it done though. I'll take a look at it this week.

mensfeld · 2019-07-23T09:58:07Z

@thijsc I have work in progress with something similar for Waterdrop and Karafka, the instrumentation hooks would be really helpful.

JHK · 2019-07-25T08:16:44Z

FYI: Here we have described our use-case that lead to this PR: https://tech.xing.com/asynchronous-state-replication-9e504aeac0de

dasch · 2019-08-01T17:10:28Z

@JHK I am returning to work next week and will give this a review. Thanks for all your hard work!

lib/racecar/consumer_set.rb

dasch · 2019-08-05T13:40:34Z

lib/racecar/consumer_set.rb

+      @last_poll_read_nil_message = true if msg.nil?
+    end
+
+    # XXX: messages are not guaranteed to be from the same partition


I think that could break existing consumers.

Oh, it certainly did. However, ConsumerSet is an entirely new class and we fixed this issue in Racecar::Runner in this commit: c8902cf#diff-2b2ca276543198a74d7942069ba0b48bR45

On master Racecar::Runner#consumer is public, but in this branch it's actually private, so it would break existing custom users already I think. Additionally the methods are named each_batch (master) and batch_poll (this branch).

I'm aware the behaviour change is not ideal, especially because it is potentially performance degrading (see linked commit message). If someone decides to use the ConsumerSet#batch_poll directly, they might be in for a surprise because it's not necessarily visible in testing. Naming the method something like batch_poll_from_subscribed_partitions would make it clearer, but it's not an exactly nice name. I'm fine with renaming this because it's not used often, but ideally this would be solved directly within librdkafka.

@thijsc would this be something that librdkafka could handle?

I asked in librdkafka's channel if there is a recommended approach to handle this in general (https://gitter.im/edenhill/librdkafka?at=5d4d71d21db76d0ffea12263). Let's see what they recommend, maybe I'm just missing some obvious batch poll method that supports this more or less out of the box.

from the chat:

@breunigs what you can do is extract the per-partition queue, see queue_get_partition() and forward it to NULL and poll each partition queue explicitly.

Header for queue_get_partition is here: https://github.com/edenhill/librdkafka/blob/675e764b0fb013835c3ead74415bdb62b0aa7903/src/rdkafka.h#L2859-L2872

I took a brief look and this seems to allow a very fine grained setup of how message are consumed. Since there's a rebalance callback, old queues can probably be destroyed through that, but maybe queue_get_partition → poll → queue_destroy in the main loop is easier.
librdkafka also works on all assigned partitions semi-simultaneously, so we'd need something similar to switch queues if we stall on a single partition, or introduce a max time to ensure partitions are worked off more or less evenly. There'd also need to be a call to determine how many potential partitions a topic has, or at least how many we're currently subscribed to.

Apart from handling the details, exposing this function through rdkafka-ruby sounds reasonable, but reimplementing the polling/queue switching handling is probably on racecar side.

Judging from gut feeling only, this seems effort and not a good trade-off. It took me a couple of iterations until I had all the bugs worked out on another feature which works with the lower-level APIs because it requires a very good understanding of Kafka and librdkafka. However, for us this non-optimal batching also incurs only a small performance penalty, so it was much easier to just start an additional process.

dasch · 2019-08-05T13:48:22Z

Can you rebase and resolve the conflicts?

I think the best plan for rolling this out is to cut a v1.0.0 release from the current master and create a stable branch, i.e. v1-stable. Then we merge this PR into master and cut v2.0.0.alpha1.

@JHK will you be available for helping solve the many issues that will undoubtedly crop up? I unfortunately have very limited bandwidth.

…the meantime

breunigs · 2019-08-09T13:25:04Z

+1 on the rollout/branch naming approach

Regarding availability: yes, we do have some time to support that. We don't have dedicated time for this, so I can offer best effort here, but I can't guarantee some fixed hours/week availability. Also most of the support would be from me, not @JHK.

dasch · 2019-08-12T10:47:36Z

OK, I will probably start that work next week.

dasch · 2019-08-23T09:00:55Z

Can you rebase and move the changelog entries under v2.0.0? Then I'll merge this and release an alpha.

Elyahou · 2019-08-27T09:12:50Z

Is there a reason for this switch ?

dasch · 2019-08-27T09:17:05Z

Yup – keeping ruby-kafka up to date with all the Kafka changes while avoiding regressions is too much work, and that's not even counting the missing features, such as background heartbeats, that we're still missing. By building upon rdkafka we can focus on ergonomic features rather than low-level stuff, and we should gain a lot of performance at the same time.

JHK added 5 commits October 15, 2018 09:29

First draft of replacing ruby-kafka with rdkafka

74b2f90

Note that multi subscribing does not work yet

1bfff83

Move config to rdkafka-ruby

acaca39

Allow subscribing to multiple topics again

3a1028c

Batch support

d00dbc7

dasch reviewed Oct 16, 2018

View reviewed changes

README.md Outdated Show resolved Hide resolved

dasch reviewed Oct 16, 2018

View reviewed changes

lib/racecar/config.rb Show resolved Hide resolved

dasch reviewed Oct 16, 2018

View reviewed changes

lib/racecar/config.rb Show resolved Hide resolved

Revive most config values (ssl/sasl still missing)

f26a710

JHK commented Oct 17, 2018

View reviewed changes

Add ssl/sasl configuration options

066b14f

dasch reviewed Oct 19, 2018

View reviewed changes

lib/racecar/config.rb Outdated Show resolved Hide resolved

dasch reviewed Oct 19, 2018

View reviewed changes

lib/racecar/config.rb Outdated Show resolved Hide resolved

dasch reviewed Oct 19, 2018

View reviewed changes

JHK added 10 commits October 19, 2018 15:28

Use Hash#compact to be more concise

f7eb041

Tests for runner

c67e2a4

Support pre ruby 2.5 syntax

63aa985

Retrofit Hash#compact

01fdb89

Update to king_konf 0.3.7

03430af

to avoid duplication in prioviding allowed_values in config

Merge branch 'master' into rdkafka-ruby

77021d8

Add tests for consumer_set

ee66e39

Fix runner spec

53f233e

Fix typo

dfa6ebb

Document changes

0e70c01

breunigs and others added 6 commits June 17, 2019 14:35

mark manual heartbeat as broken for the time being

6009cf2

fix bug where rdkafka producer config was ignored

450b611

give users the ability to inspect consumer status in depth

a6ba5c3

use rdkafka tip for seek support

361b1bb

breunigs added 2 commits July 23, 2019 14:01

bump rdkafka for seek support

36e53d5

reduce log level when pausing/resuming unsubscribed partitions

b580065

dasch reviewed Aug 5, 2019

View reviewed changes

breunigs added 2 commits August 6, 2019 13:57

remove temporary error references as librdkafka has had a release in …

54cf4df

…the meantime

Merge branch 'master' into rdkafka-ruby

782120d

Merge branch 'master' into rdkafka-ruby

63c528d

dasch merged commit ef99563 into zendesk:master Aug 27, 2019

dasch added this to the v2.0 milestone Sep 3, 2019

dasch added this to In progress in Switch to rdkafka Sep 18, 2019

dasch moved this from In progress to Done in Switch to rdkafka Feb 12, 2020

bzwei mentioned this pull request Sep 22, 2020

Add protocol Rdkafka, a Kafka client relying on libkafka ManageIQ/manageiq-messaging#53

Closed

sadhu89 mentioned this pull request Nov 28, 2022

Long term plan for ruby-kafka zendesk/ruby-kafka#963

Closed

Change ruby-kafka to rdkafka-ruby #97

Change ruby-kafka to rdkafka-ruby #97

Conversation

JHK commented Oct 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FlorianDoublet commented Oct 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thijsc commented Nov 14, 2018

mensfeld commented Nov 14, 2018

thijsc commented Nov 14, 2018

JHK commented Nov 14, 2018

dasch commented Nov 14, 2018

gaurish commented Jul 20, 2019

JHK commented Jul 23, 2019

Features

Stability

Breaking Changes

Documentation

thijsc commented Jul 23, 2019

mensfeld commented Jul 23, 2019

JHK commented Jul 25, 2019

dasch commented Aug 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dasch commented Aug 5, 2019

breunigs commented Aug 9, 2019

dasch commented Aug 12, 2019

dasch commented Aug 23, 2019

Elyahou commented Aug 27, 2019

dasch commented Aug 27, 2019

JHK commented Oct 16, 2018 •

edited

Loading