Skip to content
This repository has been archived by the owner on May 13, 2019. It is now read-only.

[WIP] Reimplementation #72

Closed
wants to merge 24 commits into from
Closed

[WIP] Reimplementation #72

wants to merge 24 commits into from

Conversation

wvanbergen
Copy link
Owner

This reimplements the kafka high-level consumer. This addresses a couple of limitations of the old implementation, and also makes the code and especially the multithreading model easier to grok (I was young an inexperienced when I wrote the initial implementation ;).

Changes

  1. It now uses Zookeeper to discover all topic/partition metadata, instead of Kafka. The primary reason is that this allows us to set watches, so we can detect changes.
  2. Using 1., it now automatically starts to consume new partitions when they become available.
  3. When claiming a partition that is still being claimed by someone else, it uses a zookeeper watch to wait until it becomes available.
  4. It is smarter during a rebalance operation. Instead of stopping everything and starting everything, it now only stops partition consumers it no longer needs, and only starts partition consumers which are not already there. Because it does all of this in parallel, rebalance operations are a lot faster.
  5. Many things are more resilient by implementing retries.
  6. It uses a Subscription interface to describe what topics to consume. This would allow us to implement a regular expression based black list or white list approach, as well as a static list of topics.
  7. It uses an interface for the main Consumer type, so unit testing apps that use this library is now possible using dependency injection.

Implementation notes

  • I have implemented it as a separate package, because the API is changed somewhat, and I want to keep the old one around for now.
  • We have three main types:
    1. consumerManager: runs a goroutine that figures out what partitions to consume, and start/stop partition managers for them. Afterwards it waits for changes in the subscription or changes in this list of running instances to do it again. Implements the Consumer interface.
    2. partitionManager: runs a goroutine that manages a single sarama.PartitionConsumer, claiming the partition in Zookeeper, and managing offsets.
    3. Subscription: Describes what partitions the entire group should be consuming, and watches zookeeper for potential changes.
  • This depends on some Kazoo changes: Consumergoup additions kazoo-go#10
  • This depends on sarama's offset manager, which has not yet landed in master: OffsetManager Implementation IBM/sarama#461

TODO

  • Kafka offset management.
  • Zookeeper offset management?
  • Add whitelist / blacklist Subscription types. Maybe move the Subscription type to Kazoo?
  • Unit tests - feasible now sarama has mock types.
  • Functional tests

@wvanbergen
Copy link
Owner Author

@horkhe @nemothekid @aaronkavlie-wf @kvs: your input on this is very welcome!

@wvanbergen
Copy link
Owner Author

Also pinging @eapache as always :)

}

instances, instancesChanged, err := cm.group.WatchInstances()
if err != nil {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: retry this on error.

@eapache
Copy link
Collaborator

eapache commented Aug 16, 2015

I am not sure whether to use Zookeeper for offset management, or use Kafka instead

Dunno how similar the ZK version is, but this seems like an ideal place for a swappable interface.

@wvanbergen
Copy link
Owner Author

I will go ahead and try to implement it with sarama's OffsetManager. I can rework the ZK version to match the sarama interface.

@wvanbergen
Copy link
Owner Author

I implemented offset management using sarama's WIP offset managers. It appears to work well.

@wvanbergen
Copy link
Owner Author

I also added Whitelist and Blacklist subscription types. This mean you can subscribe to topics (not) matching a regular expression. Any new topics that are created in the cluster will automatically be consumed.

}

eventCount += 1
if offsets[message.Topic][message.Partition] != 0 && offsets[message.Topic][message.Partition] != message.Offset-1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be unnecessary (sarama does these checks) and it's wrong anyways for compacted topics where the offsets might not be monotonic

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copy pasted from the previous example app and is shitty. This also fails when you start consuming a partition, then another instance takes it over, and later you get the partition assigned back to you.

Will remove.

@eapache
Copy link
Collaborator

eapache commented Aug 17, 2015

Quick skim looks good 👍

When this is fully ready I'll do a deep-dive review.

@wvanbergen
Copy link
Owner Author

This is getting pretty close to being feature complete. Let's work to get IBM/sarama#461 merged so we can pick this up.

@horkhe
Copy link

horkhe commented Aug 18, 2015

@wvanbergen good job! I only wish you guys had all that production ready a couple of months ago, then I would not need to implement that myself :)

}

// Initialize sarama consumer
if consumer, err := sarama.NewConsumerFromClient(cm.client); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumer and OffsetManager share a client instance but consumer requests can stay blocked doing long polling. Maybe it makes sense to use separate clients so that OffsetManager would never be affected by long polls?

// partition consumer resumes consuming from this partition later.
func (pm *partitionManager) waitForProcessing() {
nextOffset, _ := pm.offsetManager.NextOffset()
lastProcessedOffset := nextOffset - 1
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really super happy about this implementation, but not sure how else to do this.

We only want to wait for offsets if a) we actually consumed any messages at all, and b) if we haven't already processed all consumed offsets. In either of those cases, the pm.processingDone will never be closed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to me... early returns would make it much prettier though, e.g.

if lastConsumedOffset == -1 {
  return
}
// ...

}
ts.offsetTotal += offset - 1

request.AddBlock(topic, partition, offset-1, 0, "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp 0 means the beginning of Unix epoch. As a result all committed offsets are expired immediately. (in my setup Kafka kept them around for 1 minute). So you need to use ReceiveTime (-1) instead.

@F21
Copy link

F21 commented Nov 9, 2015

Any update on this one?

@wvanbergen
Copy link
Owner Author

It's mostly done; I need to get the functional tests to work on Travis (they work fine on my machine). Any input on that is welcome.

@s7anley
Copy link

s7anley commented Jan 11, 2016

@wvanbergen Hi, any news about this PR? Do you need any help? Last commit from Oct 7, so I'm not sure if I should wait or rather choose different approach for consumer groups.

@wvanbergen
Copy link
Owner Author

Hey. It looks like I will not be have a lot of time available to maintain this library, or finish this PR. If anybody is interested in taking it over, I will be glad to help out and get you started.

@s7anley
Copy link

s7anley commented Jan 11, 2016

And does it make sense when IBM/sarama@66d77e1 is merged? Or only as support for cluster still running on 0.8.x?

@wvanbergen
Copy link
Owner Author

Yeah, this will be primarily for people that are stuck on 0.8 for the time being.

caihua-yin pushed a commit to caihua-yin/kafka that referenced this pull request Apr 19, 2016
This is complementary fix for
wvanbergen#68
(issue: wvanbergen#62), before the
re-implementation (wvanbergen#72) is ready.

In my use case, the message consuming logic is sometimes time consuming,
even with 3 times retry as the fix in pull#68, it's still easy to have
the issue#62. Furhter checking current logic in
consumer_group.go:partitionConsumer(), it may take
as many as cg.config.Offsets.ProcessingTimeout to ReleasePartition
so that the partition can be claimed by new consumer during rebalance.
So just simply set the max retry time same as
cg.config.Offsets.ProcessingTimeout, which is 60s by default.

Verified this the system including this fix with frequent rebalance
operations, the issue does not occur again.
@wvanbergen wvanbergen closed this Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants