Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zkfarmer wrongly thinks it's successfully connected to zookeeper #42

Closed
klasy opened this issue Aug 20, 2018 · 7 comments
Closed

zkfarmer wrongly thinks it's successfully connected to zookeeper #42

klasy opened this issue Aug 20, 2018 · 7 comments

Comments

@klasy
Copy link
Contributor

klasy commented Aug 20, 2018

Using zkfarmer for third-party software registration for our project. It's working great except for one strange thing. Hit this issue a couple of times. The behavior is as following:

zkfarmer connects to zookeeper successfully and runs for some without a problem. Then (according to the logs from one of the zookeeper nodes) one node goes unreachable. Which means the quorum is broken since we have only 3 nodes (5 ideally advised by zookeeper at the moment). zookeeper stays reachable overall - using load balancer link to access it. The third node quickly gets back (maybe 10 minutes overall). And zkfarmer does not report any connection drop or any problem nowhere in the logs (I'm watching over zkfarmer process). So it's there and it thinks it is still connected to zookeeper while in fact it's not. If you go to zookeeper you won't be able to find an ephemeral node, which is an obvious indication that zkfarmer is not connected.

@klasy
Copy link
Contributor Author

klasy commented Aug 22, 2018

Ok, a little more insights:

2018-08-22T21:28:37 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:28:37 [ DBG]: Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=10000, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None)
2018-08-22T21:28:37 [INFO]: Zookeeper connection established, state: CONNECTED
2018-08-22T21:28:37 [ DBG]: Sending request(xid=1): Exists(path='/test/my_mqtt_test', watcher=None)
2018-08-22T21:28:37 [ DBG]: Received response(xid=1): ZnodeStat(czxid=4294967309, mzxid=4294967310, ctime=1534965854727, mtime=1534965854734, version=1, cversion=4, aversion=0, ephemeralOwner=0, dataLength=11, numChildren=0, pzxid=12884901893)
2018-08-22T21:28:37 [ DBG]: Sending request(xid=2): GetChildren(path='/test/my_mqtt_test', watcher=None)
2018-08-22T21:28:37 [ DBG]: Received response(xid=2): []
2018-08-22T21:28:37 [ DBG]: Sending request(xid=3): GetData(path='/test/my_mqtt_test', watcher=None)
2018-08-22T21:28:37 [ DBG]: Received response(xid=3): ('{"size": 1}', ZnodeStat(czxid=4294967309, mzxid=4294967310, ctime=1534965854727, mtime=1534965854734, version=1, cversion=4, aversion=0, ephemeralOwner=0, dataLength=11, numChildren=0, pzxid=12884901893))
2018-08-22T21:28:37 [ DBG]: Transition from 'initial' to 'observer ready' next to event 'initial setup'
2018-08-22T21:28:37 [ DBG]: And execute the appropriate action <bound method ZkFarmJoiner.exec_initial_setup of <zkfarmer.watcher.ZkFarmJoiner object at 0x7f7a41aba7d0>>
2018-08-22T21:28:37 [ DBG]: Transition from 'observer ready' to 'idle' next to event 'initial znode setup'
2018-08-22T21:28:37 [ DBG]: And execute the appropriate action <bound method ZkFarmJoiner.exec_initial_znode_setup of <zkfarmer.watcher.ZkFarmJoiner object at 0x7f7a41aba7d0>>
2018-08-22T21:28:37 [ DBG]: Sending request(xid=4): Exists(path='/test/my_mqtt_test', watcher=None)
2018-08-22T21:28:37 [ DBG]: Received response(xid=4): ZnodeStat(czxid=4294967309, mzxid=4294967310, ctime=1534965854727, mtime=1534965854734, version=1, cversion=4, aversion=0, ephemeralOwner=0, dataLength=11, numChildren=0, pzxid=12884901893)
2018-08-22T21:28:37 [ DBG]: Sending request(xid=5): Create(path='/test/my_mqtt_test/10.240.93.16', data='{"hostname": "ip-10-240-93-16"}', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=1)
2018-08-22T21:28:37 [ DBG]: Received response(xid=5): u'/test/my_mqtt_test/10.240.93.16'
2018-08-22T21:28:37 [ DBG]: Sending request(xid=6): GetData(path='/test/my_mqtt_test/10.240.93.16', watcher=<bound method ZkFarmJoiner.watch_node of <zkfarmer.watcher.ZkFarmJoiner object at 0x7f7a41aba7d0>>)
2018-08-22T21:28:37 [ DBG]: Received response(xid=6): ('{"hostname": "ip-10-240-93-16"}', ZnodeStat(czxid=12884901895, mzxid=12884901895, ctime=1534973317598, mtime=1534973317598, version=0, cversion=0, aversion=0, ephemeralOwner=388826195785613312, dataLength=31, numChildren=0, pzxid=12884901895))
2018-08-22T21:29:14 [WARN]: Connection dropped: socket connection broken
2018-08-22T21:29:14 [WARN]: Transition to CONNECTING
2018-08-22T21:29:14 [INFO]: Zookeeper connection lost
2018-08-22T21:29:14 [WARN]: Connection to Zookeeper suspended
2018-08-22T21:29:14 [ DBG]: Connection is considered as lost
2018-08-22T21:29:14 [ DBG]: Transition from 'idle' to 'lost' next to event 'connection lost'
2018-08-22T21:29:14 [ DBG]: Transition from 'lost' to 'lost' next to event 'znode modified'
2018-08-22T21:29:14 [ DBG]: And execute the appropriate action <bound method ZkFarmJoiner.exec_znode_modified of <zkfarmer.watcher.ZkFarmJoiner object at 0x7f7a41aba7d0>>
2018-08-22T21:29:15 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:29:15 [WARN]: Connection dropped: socket connection error: Connection refused
2018-08-22T21:29:16 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:29:16 [WARN]: Connection dropped: socket connection error: Connection refused
2018-08-22T21:29:17 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:29:17 [WARN]: Connection dropped: socket connection error: Connection refused
2018-08-22T21:29:18 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:29:18 [WARN]: Connection dropped: socket connection error: Connection refused
2018-08-22T21:29:20 [INFO]: Connecting to 10.240.82.12:2185
2018-08-22T21:29:20 [WARN]: Connection dropped: socket connection error: Connection refused
2018-08-22T21:29:20 [WARN]: Failed connecting to Zookeeper within the connection retry policy.
2018-08-22T21:29:20 [INFO]: Zookeeper session lost, state: CLOSED
2018-08-22T21:29:20 [WARN]: Connection to Zookeeper lost
2018-08-22T21:29:20 [ DBG]: Transition from 'lost' to 'lost' next to event 'connection lost'

So here is the problem: the max_retries limit is reached and session is being closed already, but zkfarmer is doing nothing about it. It just hangs in there. I see no point in doing that while it will never attempt to connect again. I would think it should just exit with a failure or just exit to be able to initiate another join transaction.

Also had it tested against 3 nodes vs 5 nodes cluster. In 5-nodes cluster kazoo tolerates one node death while in 3-nodes cluster it does not tolerate it at all. Seems strange to me since zookeeper says the cluster should stay fully operational while the majority is there. Anyway, it's just a zookeeper changing that state - I guess can do nothing here.

@klasy
Copy link
Contributor Author

klasy commented Aug 22, 2018

zkfarmer does not account for the transition from SUSPENDED to LOST:

Valid State Transitions

LOST -> CONNECTED

New connection, or previously lost one becoming connected.

CONNECTED -> SUSPENDED

Connection loss to server occurred on a connection.

CONNECTED -> LOST

Only occurs if invalid authentication credentials are provided after the connection was established.

SUSPENDED -> LOST

Connection resumed to server, but then lost as the session was expired.

SUSPENDED -> CONNECTED

Connection that was lost has been restored.

Need to account for that.

@klasy
Copy link
Contributor Author

klasy commented Aug 23, 2018

Resolved the problem with the following diff - might be kinda ugly... but it's perfect for my current needs.
connection_lost_handler_diff.txt

@rs
Copy link
Owner

rs commented Aug 23, 2018

Can you please send a PR?

@klasy
Copy link
Contributor Author

klasy commented Aug 23, 2018

@rs , sure, how do I do that? It seems that I need to give you my key or something because otherwise it's giving me 403 :)

@rs
Copy link
Owner

rs commented Aug 23, 2018

You need to fork the project, then create a branch and then send a PR thru github. You can find an article that describes this process in more details here: https://github.com/susam/gitpr.

@rs
Copy link
Owner

rs commented Aug 23, 2018

Fixed by #43

@rs rs closed this as completed Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants