-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection to dead zookeeper ensemble #6
Comments
Maxim, today's version of the client does not have a retry loop inside the init call. So, if all ensemble members are down (remeber - you can specify more than one in the 'connect' argument) the init call fails. |
Sorry, Yuri, I can't detect failure in my code. var ZK = require ("zookeeper").ZooKeeper; zk_session = function() { zk_session();I have recompiled zookeeper c client library and node-zookeeper and it still loops on dead connect: 2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.3 2011-04-27 09:35:52,221:7216:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor 2011-04-27 09:35:52,848:7216:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.039405 2011-04-27 09:35:54,898:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off 2011-04-27 09:35:55,548:7216:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.015979 2011-04-27 09:35:57,558:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off |
i think the problem he's describing is that the init call does loop internally in a failed state. this happens ove and over: this stems from the fact that the zookeeper c client behaves BADLY when it cannot connect to the first server it tries. e.g. if i run 3 zk hosts, and i start his example code (after adding the missing code to call zk_session to begin with..), then it connects and startgs printing: data some value then if i shutdown the zk host i was connected to i get this: 2011-04-27 20:08:35,337:13945:ZOO_ERROR@handle_socket_error_msg@1603: Socket [127.0.0.1:5010] zk retcode=-4, errno=64(Host is down): failed while receiving a server response 2011-04-27 20:08:35,338:13945:ZOO_ERROR@handle_socket_error_msg@1579: Socket [::1:5010] zk retcode=-4, errno=61(Connection refused): server refused to accept the client 2011-04-27 20:08:38,004:13945:ZOO_ERROR@handle_socket_error_msg@1528: Socket [fe80::1:5010] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 1ms) 2011-04-27 20:08:38,005:13945:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout 2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.2 and the program exits b/c closed is actually emitted due to zookeeper_init() returning a null handle inside node-zk.cpp#realIinit() in the case where none of the hosts are up to begin with, zookeeper_init() returns a handle, but doesn't actually connect (ever), but it yield() never uses this information to emit on_closed or error or anything, it simply loops internally waiting to connect over and over. however, it doesn't re-init, such that IF the host does start, that it would successfully connect, so it kind of bones itself. sometimes this will succeed (if it tries b or c first), BUT if it tries port a first, it fails rather unexpectedly. so, this is definitely a bug worth fixing, but i think it's with the zookeeper c client code. perhaps we could take it up with them for comment. |
oh, and you can recreate this with any code, not just his. do the 3 node setup and run: cd test ; node zk_test_buffer.js host:a,host:b,host:c same issue (though you have to change the debug level to see it..) |
so i've tracked things this far (int zookeeper c code) if i don't close the zh->fd when the socket fails to connect to the first server, then the second connection works immediately. i'm going to try to get some help for the zookeeper mailing list |
Any update on this? I, too, need to deal with the possibility for a failed connection, preferably without exiting my program. |
Unfortunately I didn't find a solution and suspended my work with 2011/6/23 jayjanssen
Best regards, |
I think I have a viable work-around for this issue, and the issue of losing a connection to zookeeper, and then reconnecting when it's available again. My zookeeper interaction is wrapped in a single object anyway, so this isn't too painful.
All of the above will handle it when you have a good connection, and when you have a no connection. It will also handle when you have a bad connection that turns good. But it won't catch when a good connection goes bad. To address that, all of my calls to zookeeper (a_get, etc.) check the rc return code, and if it isn't what I expect (especially if it is ZCONNECTIONLOSS), I close my zk handle and set it to undefined, which will put the class back in an "unconnected" state. The key for me was not blocking my application http requests for more than a brief timeout if I wasn't connected to zookeeper. I'd rather return a failure quickly than block all my clients indefinitely. |
This sounds like a decent workaround. Can you post the actual code so that people could use it while we're trying to fix the native driver please? From: jayjanssen reply@reply.github.com I think I have a viable work-around for this issue, and the issue of losing a connection to zookeeper, and then reconnecting when it's available again. My zookeeper interaction is wrapped in a single object anyway, so this isn't too painful.
All of the above will handle it when you have a good connection, and when you have a no connection. It will also handle when you have a bad connection that turns good. But it won't catch when a good connection goes bad. To address that, all of my calls to zookeeper (a_get, etc.) check the rc return code, and if it isn't what I expect (especially if it is ZCONNECTIONLOSS), I close my zk handle and set it to undefined, which will put the class back in an "unconnected" state. The key for me was not blocking my application http requests for more than a brief timeout if I wasn't connected to zookeeper. I'd rather return a failure quickly than block all my clients indefinitely. Reply to this email directly or view it on GitHub: |
https://issues.apache.org/jira/browse/ZOOKEEPER-1057 is the ticket i created on the zookeeper project that relates to this issue. I haven't seen any movement on that front. |
This is what I've got embedded in my zk class: https://gist.github.com/1048955 It seems to work correctly, but there certainly could be some issues. |
I fixed this with my current pull request: #19 The issue was with the libev code within this module. |
Closing, if this issue should be open let me know. |
When trying to connect to dead ensemble connection fails, but I haven't figured how to reconnect later in my code.
Even if zookeeper server is up again the connection still loops in failed state.
How to reproduce:
put zookeeper node down
start zk.js
put zookeeper node up
zk.js won't reconnect and still loops in failed state.
zk.js attached: http://pastebin.com/KjD3kXaV
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.3
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@662: Client environment:host.name=node.test.we
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@669: Client environment:os.name=Linux
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@670: Client environment:os.arch=2.6.35-28-server
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@671: Client environment:os.version=#49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@679: Client environment:user.name=root
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@687: Client environment:user.home=/root
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@699: Client environment:user.dir=/tmp
2011-04-26 15:35:10,725:12625:ZOO_INFO@zookeeper_init@727: Initiating client connection, host=192.168.7.33:2181 sessionTimeout=2000 watcher=0x7f9e1390b24f sessionId=0 sessionPasswd= context=0x2228c90 flags=0
2011-04-26 15:35:10,726:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:10,726:12625:ZOO_ERROR@handle_socket_error_msg@1579: Socket [192.168.7.33:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2011-04-26 15:35:10,726:12625:ZOO_ERROR@zk_io_cb@237: yield:zookeeper_process returned error: -4 - connection loss
2011-04-26 15:35:10,726:12625:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor
2011-04-26 15:35:11,339:12625:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.052756
2011-04-26 15:35:11,399:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:11,399:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:12,069:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:12,069:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.662000
2011-04-26 15:35:12,739:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:12,739:12625:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 7ms)
2011-04-26 15:35:12,739:12625:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout
2011-04-26 15:35:13,399:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:13,399:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 668ms
2011-04-26 15:35:13,399:12625:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor
2011-04-26 15:35:14,059:12625:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.006009
2011-04-26 15:35:14,069:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:14,069:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 1338ms
2011-04-26 15:35:14,069:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:14,748:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:14,748:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 12ms
2011-04-26 15:35:14,748:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.655000
2011-04-26 15:35:15,408:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:15,408:12625:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 5ms)
2011-04-26 15:35:15,408:12625:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout
The text was updated successfully, but these errors were encountered: