connection to dead zookeeper ensemble #6

Closed
mmua opened this Issue Apr 26, 2011 · 13 comments

Projects

None yet

6 participants

@mmua
mmua commented Apr 26, 2011

When trying to connect to dead ensemble connection fails, but I haven't figured how to reconnect later in my code.
Even if zookeeper server is up again the connection still loops in failed state.

How to reproduce:

put zookeeper node down
start zk.js
put zookeeper node up

zk.js won't reconnect and still loops in failed state.
zk.js attached: http://pastebin.com/KjD3kXaV

2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.3
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@662: Client environment:host.name=node.test.we
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@669: Client environment:os.name=Linux
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@670: Client environment:os.arch=2.6.35-28-server
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@671: Client environment:os.version=#49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@679: Client environment:user.name=root
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@687: Client environment:user.home=/root
2011-04-26 15:35:10,725:12625:ZOO_INFO@log_env@699: Client environment:user.dir=/tmp
2011-04-26 15:35:10,725:12625:ZOO_INFO@zookeeper_init@727: Initiating client connection, host=192.168.7.33:2181 sessionTimeout=2000 watcher=0x7f9e1390b24f sessionId=0 sessionPasswd= context=0x2228c90 flags=0
2011-04-26 15:35:10,726:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:10,726:12625:ZOO_ERROR@handle_socket_error_msg@1579: Socket [192.168.7.33:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2011-04-26 15:35:10,726:12625:ZOO_ERROR@zk_io_cb@237: yield:zookeeper_process returned error: -4 - connection loss

2011-04-26 15:35:10,726:12625:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor

2011-04-26 15:35:11,339:12625:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.052756
2011-04-26 15:35:11,399:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:11,399:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:12,069:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:12,069:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.662000
2011-04-26 15:35:12,739:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:12,739:12625:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 7ms)
2011-04-26 15:35:12,739:12625:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

2011-04-26 15:35:13,399:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:13,399:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 668ms
2011-04-26 15:35:13,399:12625:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor

2011-04-26 15:35:14,059:12625:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.006009
2011-04-26 15:35:14,069:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:14,069:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 1338ms
2011-04-26 15:35:14,069:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-26 15:35:14,748:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:14,748:12625:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 12ms
2011-04-26 15:35:14,748:12625:ZOO_DEBUG@yield@225: zk time to wait: 0.655000
2011-04-26 15:35:15,408:12625:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-26 15:35:15,408:12625:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 5ms)
2011-04-26 15:35:15,408:12625:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

@yfinkelstein
Owner

Maxim, today's version of the client does not have a retry loop inside the init call. So, if all ensemble members are down (remeber - you can specify more than one in the 'connect' argument) the init call fails.
I would recommend to implement init retry loop yourself - similarly to how you implemented a poll loop around a_get call.

@mmua
mmua commented Apr 27, 2011

Sorry, Yuri, I can't detect failure in my code.
Currently I have following script to test:

var ZK = require ("zookeeper").ZooKeeper;

zk_session = function() {
var zk = new ZK();
zk.init ({connect:"192.168.7.33:2181", timeout:2000, debug_level:ZK.ZOO_LOG_LEVEL_DEBUG, host_order_deterministic:false});
zk.on (ZK.on_connected, function (zkk) {
console.log ("zk session established, id=%s", zkk.client_id);
});
};

zk_session();

I have recompiled zookeeper c client library and node-zookeeper and it still loops on dead connect:

2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.3
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@662: Client environment:host.name=node.test.we
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@669: Client environment:os.name=Linux
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@670: Client environment:os.arch=2.6.35-28-server
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@671: Client environment:os.version=#49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@679: Client environment:user.name=root
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@687: Client environment:user.home=/root
2011-04-27 09:35:52,221:7216:ZOO_INFO@log_env@699: Client environment:user.dir=/tmp
2011-04-27 09:35:52,221:7216:ZOO_INFO@zookeeper_init@727: Initiating client connection, host=192.168.7.33:2181 sessionTimeout=2000 watcher=0x7f6952f8924f sessionId=0 sessionPasswd= context=0xd14cb0 flags=0
2011-04-27 09:35:52,221:7216:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-27 09:35:52,221:7216:ZOO_ERROR@handle_socket_error_msg@1579: Socket [192.168.7.33:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2011-04-27 09:35:52,221:7216:ZOO_ERROR@zk_io_cb@237: yield:zookeeper_process returned error: -4 - connection loss

2011-04-27 09:35:52,221:7216:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor

2011-04-27 09:35:52,848:7216:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.039405
2011-04-27 09:35:52,888:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:52,888:7216:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-27 09:35:53,558:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:53,558:7216:ZOO_DEBUG@yield@225: zk time to wait: 0.662000
2011-04-27 09:35:54,228:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:54,228:7216:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 8ms)
2011-04-27 09:35:54,228:7216:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

2011-04-27 09:35:54,898:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:54,898:7216:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 677ms
2011-04-27 09:35:54,898:7216:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor

2011-04-27 09:35:55,548:7216:ZOO_DEBUG@zk_timer_cb@260: delaying ping timer by 0.015979
2011-04-27 09:35:55,568:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:55,568:7216:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 1347ms
2011-04-27 09:35:55,568:7216:ZOO_DEBUG@yield@225: zk time to wait: 0.666000
2011-04-27 09:35:56,248:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:56,248:7216:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 13ms
2011-04-27 09:35:56,248:7216:ZOO_DEBUG@yield@225: zk time to wait: 0.653000
2011-04-27 09:35:56,908:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:56,908:7216:ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.7.33:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 7ms)
2011-04-27 09:35:56,908:7216:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

2011-04-27 09:35:57,558:7216:ZOO_DEBUG@zk_timer_cb@251: ping timer went off
2011-04-27 09:35:57,558:7216:ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 658ms
2011-04-27 09:35:57,558:7216:ZOO_WARN@yield@228: yield:zookeeper_interest returned -1 descriptor

@mmua mmua closed this Apr 27, 2011
@mmua mmua reopened this Apr 27, 2011
@Woodya
Collaborator
Woodya commented Apr 28, 2011

i think the problem he's describing is that the init call does loop internally in a failed state.

this happens ove and over:
2011-04-27 19:54:39,312:13644:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

this stems from the fact that the zookeeper c client behaves BADLY when it cannot connect to the first server it tries.

e.g. if i run 3 zk hosts, and i start his example code (after adding the missing code to call zk_session to begin with..), then it connects and startgs printing:
...

data some value
data some value
..

then if i shutdown the zk host i was connected to i get this:

2011-04-27 20:08:35,337:13945:ZOO_ERROR@handle_socket_error_msg@1603: Socket [127.0.0.1:5010] zk retcode=-4, errno=64(Host is down): failed while receiving a server response
2011-04-27 20:08:35,338:13945:ZOO_ERROR@zk_io_cb@237: yield:zookeeper_process returned error: -4 - connection loss

2011-04-27 20:08:35,338:13945:ZOO_ERROR@handle_socket_error_msg@1579: Socket [::1:5010] zk retcode=-4, errno=61(Connection refused): server refused to accept the client
2011-04-27 20:08:35,338:13945:ZOO_ERROR@zk_io_cb@237: yield:zookeeper_process returned error: -4 - connection loss

2011-04-27 20:08:38,004:13945:ZOO_ERROR@handle_socket_error_msg@1528: Socket [fe80::1:5010] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 1ms)
zoo error -7
ZK CLOSED
2011-04-27 20:08:38,005:13945:ZOO_INFO@zookeeper_close@2321: Freeing zookeeper resources for sessionId=0x12f9a13d5e80000

2011-04-27 20:08:38,005:13945:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.2
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@662: Client environment:host.name=tintin.local
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@669: Client environment:os.name=Darwin
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@670: Client environment:os.arch=10.7.0
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@671: Client environment:os.version=Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@679: Client environment:user.name=woody
2011-04-27 20:08:43,005:13945:ZOO_INFO@log_env@687: Client environment:user.home=/Users/woody
2011-04-27 20:08:43,006:13945:ZOO_INFO@log_env@699: Client environment:user.dir=/Users/woody/dev/node/test
2011-04-27 20:08:43,006:13945:ZOO_INFO@zookeeper_init@727: Initiating client connection, host=undefined sessionTimeout=2000 watcher=0x125889420 sessionId=0 sessionPasswd= context=0x100a0e250 flags=0
2011-04-27 20:08:43,006:13945:ZOO_ERROR@getaddrs@457: no port in undefined
2011-04-27 20:08:43,006:13945:ZOO_ERROR@realInit@267: zookeeper_init returned 0!

and the program exits b/c closed is actually emitted due to zookeeper_init() returning a null handle inside node-zk.cpp#realIinit()

in the case where none of the hosts are up to begin with, zookeeper_init() returns a handle, but doesn't actually connect (ever), but it yield() never uses this information to emit on_closed or error or anything, it simply loops internally waiting to connect over and over.

however, it doesn't re-init, such that IF the host does start, that it would successfully connect, so it kind of bones itself.
this behaves similarly if you have a 3 host zk system. ports a,b,c. shutdown the first host (port a).
now try to run zk code with connect string localhost:a,localhost:b,localhost:c

sometimes this will succeed (if it tries b or c first), BUT if it tries port a first, it fails rather unexpectedly.
it gets a connection refused to port a (expected), followed by a timeout while connecting to port b (very unexpected)

so, this is definitely a bug worth fixing, but i think it's with the zookeeper c client code. perhaps we could take it up with them for comment.

@Woodya
Collaborator
Woodya commented Apr 28, 2011

oh, and you can recreate this with any code, not just his.

do the 3 node setup and run: cd test ; node zk_test_buffer.js host:a,host:b,host:c

same issue (though you have to change the debug level to see it..)

@Woodya
Collaborator
Woodya commented Apr 28, 2011

so i've tracked things this far (int zookeeper c code)

if i don't close the zh->fd when the socket fails to connect to the first server, then the second connection works immediately.
if i do (which one would think is a good idea), then it fails seemingly forever.

i'm going to try to get some help for the zookeeper mailing list

@jayjanssen

Any update on this? I, too, need to deal with the possibility for a failed connection, preferably without exiting my program.

@mmua
mmua commented Jun 24, 2011

Unfortunately I didn't find a solution and suspended my work with
zookeeper and node.js. Probably I'll return to it soon.

2011/6/23 jayjanssen
reply@reply.github.com:

Any update on this?  I, too, need to deal with the possibility for a failed connection, preferably without exiting my program.

Reply to this email directly or view it on GitHub:
#6 (comment)

Best regards,
Maxim Moroz

@jayjanssen

I think I have a viable work-around for this issue, and the issue of losing a connection to zookeeper, and then reconnecting when it's available again. My zookeeper interaction is wrapped in a single object anyway, so this isn't too painful.

  • A local variable that stores the zk handle
  • A get_handle function that returns the zk handle if it is defined, otherwise it executes a connection function
  • The connection function takes a callback and re-initializes the zkclass (new ZK()) and does zkclass.init (unless somebody else already started it). Re-initializing ZK was the key bit for me, otherwise bad things seemed to happen.
  • That function then sets both an on_connected and setTimeout event.
  • If zk connects, it stops the timeout, sets the local zk variable and returns it in the callback.
  • If it gets the timeout first, it removes the on_connected listener and returns undefined.

All of the above will handle it when you have a good connection, and when you have a no connection. It will also handle when you have a bad connection that turns good. But it won't catch when a good connection goes bad. To address that, all of my calls to zookeeper (a_get, etc.) check the rc return code, and if it isn't what I expect (especially if it is ZCONNECTIONLOSS), I close my zk handle and set it to undefined, which will put the class back in an "unconnected" state.

The key for me was not blocking my application http requests for more than a brief timeout if I wasn't connected to zookeeper. I'd rather return a failure quickly than block all my clients indefinitely.

@yfinkelstein
Owner

This sounds like a decent workaround. Can you post the actual code so that people could use it while we're trying to fix the native driver please?
Thanks,
Yuri


From: jayjanssen reply@reply.github.com
To: yurif2003@yahoo.com
Sent: Friday, June 24, 2011 7:40 AM
Subject: Re: [node-zookeeper] connection to dead zookeeper ensemble (#6)

I think I have a viable work-around for this issue, and the issue of losing a connection to zookeeper, and then reconnecting when it's available again. My zookeeper interaction is wrapped in a single object anyway, so this isn't too painful.

  • A local variable that stores the zk handle
  • A get_handle function that returns the zk handle if it is defined, otherwise it executes a connection function
  • The connection function takes a callback and re-initializes the zkclass (new ZK()) and does zkclass.init (unless somebody else already started it). Re-initializing ZK was the key bit for me, otherwise bad things seemed to happen.
  • That function then sets both an on_connected and setTimeout event.
    • If zk connects, it stops the timeout, sets the local zk variable and returns it in the callback.
    • If it gets the timeout first, it removes the on_connected listener and returns undefined.

All of the above will handle it when you have a good connection, and when you have a no connection. It will also handle when you have a bad connection that turns good. But it won't catch when a good connection goes bad. To address that, all of my calls to zookeeper (a_get, etc.) check the rc return code, and if it isn't what I expect (especially if it is ZCONNECTIONLOSS), I close my zk handle and set it to undefined, which will put the class back in an "unconnected" state.

The key for me was not blocking my application http requests for more than a brief timeout if I wasn't connected to zookeeper. I'd rather return a failure quickly than block all my clients indefinitely.

Reply to this email directly or view it on GitHub:
#6 (comment)

@Woodya
Collaborator
Woodya commented Jun 25, 2011

https://issues.apache.org/jira/browse/ZOOKEEPER-1057 is the ticket i created on the zookeeper project that relates to this issue.

I haven't seen any movement on that front.

@jayjanssen

This is what I've got embedded in my zk class:

https://gist.github.com/1048955

It seems to work correctly, but there certainly could be some issues.

@rphillips
Contributor

I fixed this with my current pull request: #19

The issue was with the libev code within this module.

@kuebk
Collaborator
kuebk commented Feb 2, 2013

Closing, if this issue should be open let me know.

@kuebk kuebk closed this Feb 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment