Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failures in ContinuationQueue#poll #462

Closed
jhtwong opened this issue Feb 19, 2017 · 17 comments
Closed

Random failures in ContinuationQueue#poll #462

jhtwong opened this issue Feb 19, 2017 · 17 comments

Comments

@jhtwong
Copy link

@jhtwong jhtwong commented Feb 19, 2017

I have been debugging random failures on Queue#subscribe and Session#create_channel for days now, where they seem to die with the top of the stack like this:

Timeout::Error: Timeout::Error
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:33:in `block in poll'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:1261:in `wait_on_continuations'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:503:in `block in open_channel'
/Users/joe/.rvm/rubies/ruby-2.1.3/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:496:in `open_channel'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/channel.rb:223:in `open'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:345:in `block in create_channel'
/Users/joe/.rvm/rubies/ruby-2.1.3/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
/Users/joe/.rvm/gems/ruby-2.1.3/gems/bunny-2.6.3/lib/bunny/session.rb:340:in `create_channel'
...

Per documentation I only allow one thread to access a particular channel at any given time. But still I run into random race conditions between an attempt to subscribe and a concurrent attempt by bunny to deliver a message on another subscription (on the same channel).

After digging into things, I found that ContinuationQueue#poll has a bug in it: it shouldn't do a @cond.signal after picking off the top of the @q array, if the contract of the @cond ConditionVariable is that it is signaled if and only if something is added to @q.

See:
b978942#diff-0066bdaf813d7be3eec58dfc0e476ec3R36

After deleting this one line the random failures seem to have gone away completely.

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 19, 2017

Thanks for spending your time on an investigation. Can you please submit a pull request?

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 19, 2017

Yes, that @cond.signal in Bunny::ContinuationQueue#poll looks strange. Maybe @jemc remembers why it works the way it does?

@jhtwong
Copy link
Author

@jhtwong jhtwong commented Feb 20, 2017

Actually even deleting the @cond.signal still sometimes result in a failure... which made me wonder how Ruby's memory model handles making changes visible across threads. This worked out better for me:

  def poll(timeout_in_ms = nil)
    timeout = timeout_in_ms ? timeout_in_ms / 1000.0 : nil

    @lock.synchronize do
      expiry = timeout.try(:+, Time.now.utc)
      while @q.empty?
        wait = expiry.try(:-, Time.now.utc)
        @cond.wait(@lock, wait)
        raise ::Timeout::Error if expiry && Time.now.utc >= expiry
      end
      item = @q.shift
      item
    end
  end

(EDIT: by tracing it I've seen this loop body execute a second time, perhaps a little surprising given that the only remaining @cond.signal is in #push. Also, I'm on MRI Ruby 2.1.3.)

At this point this is starting to look like the complexity of ::Queue's implementation itself. (Too bad it doesn't do timeouts natively.)

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 20, 2017

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 20, 2017

Your version doesn't handle nil timeouts. I've massaged it a little bit to account for nil values, running tests in a loop now before committing.

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 20, 2017

@jhtwong please give master a try and thank you for the original patch. If this works well or at least better for you for a few days I can backport it to 2.6.x.

@jhtwong
Copy link
Author

@jhtwong jhtwong commented Feb 20, 2017

Thanks @michaelklishin! I just tried master @ 1f1c109 and so far so good for me. It'd be great to have it backported to 2.6.x!

(p.s. I had snuck in the try()s to handle the nil timeout case but I guess ninja-edits don't show up in email. ;) )

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Feb 20, 2017

@jhtwong thank you for digging out an almost 2 year old concurrency primitive issue. Let's wait for a few days and then backport.

michaelklishin added a commit that referenced this issue Feb 20, 2017
Less susceptible to race conditions.

Fixes #462 (or so we hope).
@jhtwong
Copy link
Author

@jhtwong jhtwong commented Mar 3, 2017

@michaelklishin will a version 2.6.4 be cut with this fix? Thanks for backporting to 2.6.x!

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Mar 3, 2017

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Mar 4, 2017

@jhtwong
Copy link
Author

@jhtwong jhtwong commented Mar 4, 2017

It's be working well for me, and thanks for releasing 2.6.4! 👍

@igorwwwwwwwwwwwwwwwwwwww
Copy link

@igorwwwwwwwwwwwwwwwwwwww igorwwwwwwwwwwwwwwwwwwww commented Mar 29, 2017

@michaelklishin is it possible that you didn't push the 2.6.4 tag to github?

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Mar 29, 2017

@igorwwwwwwwwwwwwwwwwwwww oh, indeed. That's for spotting.

@igorwwwwwwwwwwwwwwwwwwww
Copy link

@igorwwwwwwwwwwwwwwwwwwww igorwwwwwwwwwwwwwwwwwwww commented Mar 29, 2017

@michaelklishin thanks! ❤️

@SmrutiSuman
Copy link

@SmrutiSuman SmrutiSuman commented Nov 28, 2017

@michaelklishin I am getting time out error even for bunny 2.7.2. Below I have given the error :

/home/user/.rvm/gems/ruby-2.3.1/gems/bunny-2.7.2/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll': Timeout::Error (Timeout::Error)

@michaelklishin
Copy link
Member

@michaelklishin michaelklishin commented Nov 28, 2017

@SmrutiSuman this is not a support forum. Consider providing more information when you ask others for help.

@ruby-amqp ruby-amqp locked and limited conversation to collaborators Nov 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.