Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Harden against connection and server failures #50

Merged
merged 30 commits into from Sep 9, 2013

Conversation

Projects
None yet
2 participants
Member

eric commented Sep 8, 2013

I created a test harness using a zk-server cluster that sent alternating SIGSTOP and SIGCONT signals every 30 seconds to each zookeeper server in sequence to simulate random network failures.

The changes from this included:

  • Make sure zookeeper_process() is called even if rb_thread_select() times out
  • Fixing a bug in the apache zookeeper library (reported as ZOOKEEPER-1756)
  • Upgrade to zookeeper 3.4.5 client library
  • Only submit commands when the connection state is connected
  • Prevent new commands from being submitted if the connection has expired
Contributor

slyphon commented Sep 8, 2013

just noticed this line should have the condition reversed:

        until @reg.in_flight.empty? or @_closed or is_unrecoverable
         # stuff
        end

is_unrecoverable will barf with a RuntimeError if @_closed is true because the state of @_closed is flipped by the C layer when the zk handle has been cleaned up.

Contributor

slyphon commented Sep 8, 2013

should remove the old zkc-3.3.5 tarball

We should also try to upgrade the java driver, to keep them on the same version (probably make that a separate commit before the next release).

@slyphon slyphon commented on the diff Sep 8, 2013

ext/zkrb.c
@@ -526,7 +527,7 @@ static VALUE method_get(VALUE self, VALUE reqid, VALUE path, VALUE async, VALUE
char * data = NULL;
if (IS_SYNC(call_type)) {
data = malloc(MAX_ZNODE_SIZE); /* ugh */
- memset(data, 0, sizeof(data));
+ memset(data, 0, MAX_ZNODE_SIZE);
@slyphon

slyphon Sep 8, 2013

Contributor

You know more about this than I do, is sizeof(data) incorrect?

I understand that MAX_ZNODE_SIZE is a defined constant, but, shouldn't these two be equivalent?

(i'm not objecting to the change, I just want to know if I screwed up somehow)

@eric

eric Sep 8, 2013

Member

It is incorrect.

Because data is defined as:

   char * data = NULL;

The sizeof(data) will return 4 (the size of a pointer).

If you had this:

   char data[MAX_ZNODE_SIZE];

The result of sizeof(data) would have been MAX_ZNODE_SIZE.

@slyphon slyphon commented on the diff Sep 8, 2013

ext/zkrb.c
@@ -788,10 +789,10 @@ static VALUE method_zkrb_iterate_event_loop(VALUE self) {
fd_set rfds, wfds, efds;
FD_ZERO(&rfds); FD_ZERO(&wfds); FD_ZERO(&efds);
- int fd=0, interest=0, events=0, rc=0, maxfd=0;
+ int fd = 0, interest = 0, events = 0, rc = 0, maxfd = 0, irc = 0, prc = 0;
@slyphon

slyphon Sep 8, 2013

Contributor

ahhh, breathing room.

@slyphon slyphon and 1 other commented on an outdated diff Sep 8, 2013

@@ -94,6 +94,7 @@ end
task :build do
cd 'ext' do
+ ENV['ZK_DEV'] = 'true'
@slyphon

slyphon Sep 8, 2013

Contributor

yeah?

@eric

eric Sep 8, 2013

Member

If you're building via the rake task, why not leave around the source to zkc?

Contributor

slyphon commented Sep 8, 2013

It looks like the build is failing because the GIL-release-wrapping-codegen script can't find zookeeper.h, not sure why though. It's doing a recursive glob in the ext dir here.

Member

eric commented Sep 8, 2013

The problem with the build failing is a travis-ci bug, you can see the passing builds here.

I have sent them an email asking about the situation.

I've addressed your comments, removed the old 3.3.5 tar and updated java to 3.4.5.

The travis build with all of this should be at: https://travis-ci.org/zk-ruby/zookeeper/builds/11128784

Member

eric commented Sep 8, 2013

Looks like I'm going to have to upgrade zk-server to be able to upgrade the java jar...

https://travis-ci.org/zk-ruby/zookeeper/jobs/11128789

Member

eric commented Sep 8, 2013

I've given up on including the java client upgrade as part of this change. There are too many dependencies to sort out for me to take on as part of this.

Member

eric commented Sep 8, 2013

If there aren't any more comments, this should be good to merge.

eric added a commit that referenced this pull request Sep 9, 2013

Merge pull request #50 from zk-ruby/zoomonkey
Harden against connection and server failures

@eric eric merged commit ebdfa78 into master Sep 9, 2013

@eric eric deleted the zoomonkey branch Sep 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment