Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for leadership finish and exclude new topics #163

Merged
merged 4 commits into from
Oct 22, 2020
Merged

Conversation

ferbncode
Copy link
Contributor

@ferbncode ferbncode commented Oct 21, 2020

ARUHA-3130: Bubuku may hang up on start

Bubuku may hang up on start for new topics being under creation. For example:

  WARNING:bubuku.broker:Leadership is not transferred for 769c2845-4aee-4b38-8b80-087cc08817e4 2 ({"controller_epoch": 61, "leader": -1, "version": 1, "leader_epoch": 110, "isr": [67123066]}, brokers: ['67123027', '67123028', '67123033', '67123034', '67123035', '67123036', '67123037', '67123038', '67123039', '67123040', '67123041', '67123044', '67123046', '67123047', '67123048', '67123049', '67123050', '67123051', '67123052', '67123053', '67123055', '67123056', '67123057', '67123058', '67123059', '67123060', '67123061', '67123062', '67123063', '67123064', '67123065', '67123066', '67123427'])

In this pull request, I change the code to wait indefinitely for leadership transfer to complete since we start brokers sequentially during a rolling restart and none of the brokers would start if any leadership transfer is in progress.

Since leadership transfer check is done for both starting and stopping of brokers, also added a check to remove topics of certain age from the check. That is implemented by querying zookeeper for created_at time.

@ferbncode
Copy link
Contributor Author

I was thinking about the case when such a situation with restart of controller would arise when topic-partition leader would be =1, considering a newly created topic. Even with restarts/switch of the controller, eventually a leader should be successfully elected eventually.

In the case which is presented in the ticket, I think the issue was actually with a topic being deleted. A deleted topic is still part of /broker/topics children, and is apparently not cleaned up from there until the log cleaner deletes the data from the brokers. Please find an example case produced locally for this case:

[zk: localhost:2181(CONNECTED) 11] ls /admin/delete_topics                    
[test]
[zk: localhost:2181(CONNECTED) 12] ls /brokers/topics                         
[party1, party2, party3, test, party4]
[zk: localhost:2181(CONNECTED) 13] get /brokers/topics/test/partitions/0/state
{"controller_epoch":14,"leader":-1,"version":1,"leader_epoch":17,"isr":[67108865]}
cZxid = 0xc5
ctime = Wed Oct 21 00:29:11 UTC 2020
mZxid = 0x21ad
mtime = Wed Oct 21 22:05:12 UTC 2020
pZxid = 0xc5
cversion = 0
dataVersion = 18
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 82
numChildren = 0
[zk: localhost:2181(CONNECTED) 14] 

In the bug happening on live, Kafka was taking a long time (~30 minutes), possibly due to broker restarts to delete the topic partitions logs and the log trail can be found here.

Initially, considering leader election for a topic, I've added code to filter out topics by creation time and age. However, now I've modified it to exclude deleted topics while considering leader election.

@thomasabraham
Copy link

👍

1 similar comment
@ferbncode
Copy link
Contributor Author

👍

@ferbncode ferbncode merged commit 0c9c943 into master Oct 22, 2020
@ferbncode ferbncode deleted the old-topics branch October 22, 2020 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants