New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failures in ContinuationQueue#poll #462

Closed
jhtwong opened this Issue Feb 19, 2017 · 17 comments

Comments

Projects
None yet
4 participants
@jhtwong

jhtwong commented Feb 19, 2017

I have been debugging random failures on Queue#subscribe and Session#create_channel for days now, where they seem to die with the top of the stack like this:

Timeout::Error: Timeout::Error
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:33:in `block in poll'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:1261:in `wait_on_continuations'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:503:in `block in open_channel'
/Users/joe/.rvm/rubies/ruby-2.1.3/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:496:in `open_channel'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/channel.rb:223:in `open'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:345:in `block in create_channel'
/Users/joe/.rvm/rubies/ruby-2.1.3/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:340:in `create_channel'
...

Per documentation I only allow one thread to access a particular channel at any given time. But still I run into random race conditions between an attempt to subscribe and a concurrent attempt by bunny to deliver a message on another subscription (on the same channel).

After digging into things, I found that ContinuationQueue#poll has a bug in it: it shouldn't do a @cond.signal after picking off the top of the @q array, if the contract of the @cond ConditionVariable is that it is signaled if and only if something is added to @q.

See:
b978942#diff-0066bdaf813d7be3eec58dfc0e476ec3R36

After deleting this one line the random failures seem to have gone away completely.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 19, 2017

Member

Thanks for spending your time on an investigation. Can you please submit a pull request?

Member

michaelklishin commented Feb 19, 2017

Thanks for spending your time on an investigation. Can you please submit a pull request?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 19, 2017

Member

Yes, that @cond.signal in Bunny::ContinuationQueue#poll looks strange. Maybe @jemc remembers why it works the way it does?

Member

michaelklishin commented Feb 19, 2017

Yes, that @cond.signal in Bunny::ContinuationQueue#poll looks strange. Maybe @jemc remembers why it works the way it does?

@jhtwong

This comment has been minimized.

Show comment
Hide comment
@jhtwong

jhtwong Feb 20, 2017

Actually even deleting the @cond.signal still sometimes result in a failure... which made me wonder how Ruby's memory model handles making changes visible across threads. This worked out better for me:

  def poll(timeout_in_ms = nil)
    timeout = timeout_in_ms ? timeout_in_ms / 1000.0 : nil

    @lock.synchronize do
      expiry = timeout.try(:+, Time.now.utc)
      while @q.empty?
        wait = expiry.try(:-, Time.now.utc)
        @cond.wait(@lock, wait)
        raise ::Timeout::Error if expiry && Time.now.utc >= expiry
      end
      item = @q.shift
      item
    end
  end

(EDIT: by tracing it I've seen this loop body execute a second time, perhaps a little surprising given that the only remaining @cond.signal is in #push. Also, I'm on MRI Ruby 2.1.3.)

At this point this is starting to look like the complexity of ::Queue's implementation itself. (Too bad it doesn't do timeouts natively.)

jhtwong commented Feb 20, 2017

Actually even deleting the @cond.signal still sometimes result in a failure... which made me wonder how Ruby's memory model handles making changes visible across threads. This worked out better for me:

  def poll(timeout_in_ms = nil)
    timeout = timeout_in_ms ? timeout_in_ms / 1000.0 : nil

    @lock.synchronize do
      expiry = timeout.try(:+, Time.now.utc)
      while @q.empty?
        wait = expiry.try(:-, Time.now.utc)
        @cond.wait(@lock, wait)
        raise ::Timeout::Error if expiry && Time.now.utc >= expiry
      end
      item = @q.shift
      item
    end
  end

(EDIT: by tracing it I've seen this loop body execute a second time, perhaps a little surprising given that the only remaining @cond.signal is in #push. Also, I'm on MRI Ruby 2.1.3.)

At this point this is starting to look like the complexity of ::Queue's implementation itself. (Too bad it doesn't do timeouts natively.)

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 20, 2017

Member
Member

michaelklishin commented Feb 20, 2017

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 20, 2017

Member

Your version doesn't handle nil timeouts. I've massaged it a little bit to account for nil values, running tests in a loop now before committing.

Member

michaelklishin commented Feb 20, 2017

Your version doesn't handle nil timeouts. I've massaged it a little bit to account for nil values, running tests in a loop now before committing.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 20, 2017

Member

@jhtwong please give master a try and thank you for the original patch. If this works well or at least better for you for a few days I can backport it to 2.6.x.

Member

michaelklishin commented Feb 20, 2017

@jhtwong please give master a try and thank you for the original patch. If this works well or at least better for you for a few days I can backport it to 2.6.x.

@jhtwong

This comment has been minimized.

Show comment
Hide comment
@jhtwong

jhtwong Feb 20, 2017

Thanks @michaelklishin! I just tried master @ 1f1c109 and so far so good for me. It'd be great to have it backported to 2.6.x!

(p.s. I had snuck in the try()s to handle the nil timeout case but I guess ninja-edits don't show up in email. ;) )

jhtwong commented Feb 20, 2017

Thanks @michaelklishin! I just tried master @ 1f1c109 and so far so good for me. It'd be great to have it backported to 2.6.x!

(p.s. I had snuck in the try()s to handle the nil timeout case but I guess ninja-edits don't show up in email. ;) )

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Feb 20, 2017

Member

@jhtwong thank you for digging out an almost 2 year old concurrency primitive issue. Let's wait for a few days and then backport.

Member

michaelklishin commented Feb 20, 2017

@jhtwong thank you for digging out an almost 2 year old concurrency primitive issue. Let's wait for a few days and then backport.

michaelklishin added a commit that referenced this issue Feb 20, 2017

Rework Bunny::ContinuationQueue#poll with feedback from Joseph Wong
Less susceptible to race conditions.

Fixes #462 (or so we hope).
@jhtwong

This comment has been minimized.

Show comment
Hide comment
@jhtwong

jhtwong Mar 3, 2017

@michaelklishin will a version 2.6.4 be cut with this fix? Thanks for backporting to 2.6.x!

jhtwong commented Mar 3, 2017

@michaelklishin will a version 2.6.4 be cut with this fix? Thanks for backporting to 2.6.x!

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 3, 2017

Member
Member

michaelklishin commented Mar 3, 2017

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@jhtwong

This comment has been minimized.

Show comment
Hide comment
@jhtwong

jhtwong Mar 4, 2017

It's be working well for me, and thanks for releasing 2.6.4! 👍

jhtwong commented Mar 4, 2017

It's be working well for me, and thanks for releasing 2.6.4! 👍

@igorwwwwwwwwwwwwwwwwwwww

This comment has been minimized.

Show comment
Hide comment
@igorwwwwwwwwwwwwwwwwwwww

igorwwwwwwwwwwwwwwwwwwww Mar 29, 2017

@michaelklishin is it possible that you didn't push the 2.6.4 tag to github?

@michaelklishin is it possible that you didn't push the 2.6.4 tag to github?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 29, 2017

Member

@igorwwwwwwwwwwwwwwwwwwww oh, indeed. That's for spotting.

Member

michaelklishin commented Mar 29, 2017

@igorwwwwwwwwwwwwwwwwwwww oh, indeed. That's for spotting.

@igorwwwwwwwwwwwwwwwwwwww

This comment has been minimized.

Show comment
Hide comment
@SmrutiSuman

This comment has been minimized.

Show comment
Hide comment
@SmrutiSuman

SmrutiSuman Nov 28, 2017

@michaelklishin I am getting time out error even for bunny 2.7.2. Below I have given the error :

/home/user/.rvm/gems/ruby-2.3.1/gems/bunny-2.7.2/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll': Timeout::Error (Timeout::Error)

@michaelklishin I am getting time out error even for bunny 2.7.2. Below I have given the error :

/home/user/.rvm/gems/ruby-2.3.1/gems/bunny-2.7.2/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll': Timeout::Error (Timeout::Error)

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Nov 28, 2017

Member

@SmrutiSuman this is not a support forum. Consider providing more information when you ask others for help.

Member

michaelklishin commented Nov 28, 2017

@SmrutiSuman this is not a support forum. Consider providing more information when you ask others for help.

@ruby-amqp ruby-amqp locked and limited conversation to collaborators Nov 28, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.