Skip to content
This repository has been archived by the owner on May 13, 2019. It is now read-only.

Consumergroup stops consuming after ZK connection lost/timeout #76

Closed
nemosupremo opened this issue Sep 15, 2015 · 7 comments
Closed

Consumergroup stops consuming after ZK connection lost/timeout #76

nemosupremo opened this issue Sep 15, 2015 · 7 comments

Comments

@nemosupremo
Copy link
Contributor

I'm trying to debug an issue where it looks like

1.) The Zookeeper connection dies/timesout

2015/09/14 10:39:45 read tcp 10.129.196.49:2181: i/o timeout
2015/09/14 10:39:53 read tcp 10.129.196.11:2181: i/o timeout
2015/09/14 10:40:01 read tcp 10.129.196.55:2181: i/o timeout
2015/09/14 10:40:09 read tcp 10.129.196.49:2181: i/o timeout
2015/09/14 10:49:12 read tcp 10.129.196.11:2181: i/o timeout
2015/09/14 10:49:13 read tcp 10.129.196.55:2181: i/o timeout
2015/09/14 10:49:13 Failed to set previous watches: zk: connection closed
2015/09/14 10:49:14 read tcp 10.129.196.49:2181: i/o timeout
2015/09/14 10:49:14 Failed to set previous watches: zk: connection closed
2015/09/14 10:49:14 read tcp 10.129.196.11:2181: i/o timeout
2015/09/14 10:49:14 Failed to set previous watches: zk: connection closed

2.) A rebalance does not occur (on any of the other nodes) so the partitions are left without a consumer.

I'm not sure if this is an issue with ZK or consumer group. Also I'm using consumergroup from e236a65 with my PRs added (meaning I don't have any other fixes applied - I'm not sure if my problem is solved with those fixes).

@nemosupremo
Copy link
Contributor Author

From looking at the connection loop in zk, it seems if there is an issue that causes a disconnect - the library attempts to reconnect and reset the watchers. However if resetting the watchers fails, you can end up with a consumer that is no longer aware of changes in the consumer groups membership.

What I think should happen is that node goes offline and the rest of the consumers continue as planned, but it doesn't - I frequently get partitions with no consumers attached. Maybe the connection is succeeding, but the watching isn't?

@nemosupremo
Copy link
Contributor Author

Some more logging:

2015/09/17 21:22:41 read tcp 10.129.196.49:2181: i/o timeout
�[32m[2015-09-17T21:22:48.071Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA]consumer/broker/36973 closed dead subscription to geard-user/18
�[0m
�[32m[2015-09-17T21:22:48.072Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA]consumer/broker/36973 closed dead subscription to geard-user/22
�[0m
�[32m[2015-09-17T21:22:48.072Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA]consumer/broker/36973 closed dead subscription to geard-user/20
�[0m
�[32m[2015-09-17T21:22:48.073Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA]consumer/broker/36973 closed dead subscription to geard-user/16
[2015-09-17T21:22:48.116Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA][geard/f66fbb31b15d] geard-user :: Stopped topic consumer
�[2015-09-17T21:22:48.149Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA][geard/f66fbb31b15d] Currently registered consumers: 0
[2015-09-17T21:22:48.15Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA][geard/f66fbb31b15d] geard-user :: Started topic consumer
[2015-09-17T21:22:48.288Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA][geard/f66fbb31b15d] geard-user :: Claiming 0 of 32 partitions
[2015-09-17T21:22:48.288Z] [GEARD] [kafka.go:21] [NOTICE] [SARAMA][geard/f66fbb31b15d] geard-user :: Stopped topic consumer

So it looks like (1) communication with zookeeper fails then (2) we see 0 registered consumers, (3) then we don't do anything and (4) samuel-zk silently fails to set the watchers, and we are left in the dark about updates.

It seems (guessing) like the appropriate solution here is to close all watcher channels if we fail to reset watchers (which seems like a bad idea anyways, what if the field we are watching changed while we were disconnected.)

@nemosupremo
Copy link
Contributor Author

Created a PR @ samuel/go-zookeeper#84 that should hopefully solve this issue.

@wvanbergen
Copy link
Owner

Nice find! Let me know once you know more about this.

@allenbo
Copy link

allenbo commented Oct 31, 2015

I don't know if this is relevant or not, but when you say " I frequently get partitions with no consumers attached. ", are you saying all the partitions do not get claimed by any consumer, or just some partitions keep orphan for a long time? I checked the consumer group code and kazoo and found out when zookeeper session expired, consumer group instance(an ephemeral znode) didn't reregister itself, which could potentially leads to such situation that all consumer group instances lose their claims in zookeeper like no consumer ever exists.

@nemosupremo
Copy link
Contributor Author

I should probably close this issue, since I've yet to see it again ever since samuel/go-zookeeper#87 landed. I never got any feedback on samuel/go-zookeeper#84, but it looks like samuel/go-zookeeper#87 may have solved the issue with zookeeper. Rebalances in wvanbergen/kafka have been working as advertised.

@yejingx
Copy link
Contributor

yejingx commented Dec 21, 2015

hi @nemothekid, I had encountered the same problem last days, have you solved this problem now? I tried your PR @ samuel/go-zookeeper#84,but it dosen't work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants