Rubinius loses a block and throws LocalJumpError errors #1940

Closed
thedarkone opened this Issue Oct 3, 2012 · 7 comments

Projects

None yet

2 participants

@thedarkone
Contributor

Rubinius version: rubinius 2.0.0dev (1.8.7 32c35a4f yyyy-mm-dd JI) [x86_64-apple-darwin10.2.0] (1.9 mode doesn't help).

Rubinius throws an invalid LocalJumpError during a run of the test_cache_loops.rb stress test of the thread_safe gem.

The backtrace looks like this: https://gist.github.com/3827440.

The test code always calls compute_if_absent with a block like this while compute_if_absent correctly passes on the block to Table#try_to_cas_in_computed here, therefore I maintain my code is correct and shouldn't be throwing LocalJumpError errors.

I run into this while refactoring a method, so in end this might end to be a bug in my code. I can make this go away by manually inlining a single iteration of the AtomiceReferenceCacheBackend#find_value_in_node_list into the compute_if_absent method like this. The "workaround" can be tested by commenting out the && false here. JRuby runs the same code without any issues though, so I'm thinking this might be an rbx bug after all.

To reproduce:
git clone git://github.com/thedarkone/thread_safe.git
cd thread_safe
git checkout rbx-block-lost
ruby -rubygems -Ilib ./test/test_cache_loops.rb

For a speedier run the THREAD_COUNT const can be somewhat bumped down, I'm still getting full reproducibility with THREAD_COUNT = 8 on a dual core machine.

This is semi-related to: #1888.

@thedarkone
Contributor

@dbussink it is kinda important that 1888 is fixed first (unless its bugs stem from this ticket) or else you fix this JIT issue and the bugs exposed in 1888 go back into hiding...

@dbussink
Member
dbussink commented Oct 4, 2012

There are multiple things probably going on. It can also be that the deadlocks are not because of threading issues itself but of other issues. If there are issues running Ruby code in certain situations that could lead to a state where it actually deadlocks.

If there is erratic behavior in certain cases, that basically makes all reasoning about what happens afterwards at least dubious.

@dbussink dbussink added a commit that referenced this issue Oct 5, 2012
@dbussink dbussink Fix race condition when concurrently reading and modifying header
This fixes a potential number of race conditions. The most problematic
one is the following. If we are in the middle of inflating a header and
concurrently reading another flag there's a potential problem. If we
check the header status first and it's not inflated but then read the
flags after it's inflated, the flags contain bogus information.

We fix this by making sure we always first copy the header, then check
inflation status and return this copy if it is not inflated. This means
that the code can see slightly outdated information, but it can already
handle that properly because this type of race could happen in other
places too.

We also change every modification of changing the flags to be atomic.
This is necessary to prevent certain information from being lost. If we
for example freeze an object while concurrently inflating the header, it
could be that we set the frozen bit in the header after we already
copied the flags into the inflated header. In that case setting the
frozen status would be lost.

This is not a huge problem, but the side effect is more problematic.
Since we reuse the header space as the pointer to the inflated header,
modifying a bit there would mean the inflated header would point at a
totally different inflated header. This would mean the object would
suddenly get a very different status.

These changes are related to issue #1940. In this issue the behavior
exposed itself as follows. The code resulted in a concurrent scenario
where an object header was inflated while concurrently retrieving an
instance variable. Because of this, the returned obj_type was the value
of the bits inside the inflated header pointer:

$2 = {
  <rubinius::ObjectHeader> = {
    header = {
      f = {
        inflated = 1,
        meaning = 0,
        obj_type = 217,
        zone = rubinius::UnspecifiedZone,
        age = 2,
        Forwarded = 1,
        Remember = 1,
        Marked = 1,
        InImmix = 0,
        Pinned = 0,
        Frozen = 0,
        Tainted = 1,
        Untrusted = 0,
        LockContended = 0,
        unused = 0,
        aux_word = 1
      },
      flags64 = 4312680137,
      all_flags = 0x1010e46c9
    },
    klass_ = 0x104a3ff90,
    ivars_ = 0x1a,
    __body__ = {0x0}
  }, <No data fields>}

The value here is obj_type 217, which is actually an invalid value. What
happens because of this is that the code in Object#get_ivar would see
that obj_type. This resulted in it not recognizing it as a packed object
and this not running the proper code to retrieve the instance variable.

This resulted in a nil being returned as the instance variable instead
of the actual instance variable and then resulted in a NoMethodError.
c188bd6
@thedarkone
Contributor

@dbussink I see you've run into an illusive nil bug, that I've seen sometimes but could never reproduce. Awesome commit message btw.

@dbussink
Member
dbussink commented Oct 5, 2012

That nil bug should be gone now yeah. I've been running that test/test_cache_loops.rb script a bunch of times now with the JIT disabled and then it runs fine each time.

I suspect the remaining problem(s) to be in the JIT. They could also result in deadlocking etc. because if it jumps to wrong places in the code it might not lock or unlock objects etc. So basically anything can still then go wrong.

@thedarkone
Contributor

Just tested and there is no need for concurrency to get the local jump error (need to bump :loop_count to trigger the JIT threshold and thats it), see the latest commit on rbx-block-lost. Might be easier to debug it this way.

@dbussink dbussink added a commit that closed this issue Oct 10, 2012
@dbussink dbussink Always emit code for pushing a block in the JIT
There are situations where we create incorrect code before this fix.

Imagine the following code:

def x
  y do
    true
  end
end

def y
  z { yield }
end

def z
  yield
end

Now if in this case we decide to compile 'x', we can inline y into x and
z into y. The problem occurs when we try to inline z but we can't due to
it being too complex, or when it blows the inline budget. In that case
the block hasn't been pushed so z will raise a NoJumpError because it
can't find the block.

In the future we need to make the JIT logic here smarter. We need to
keep track of whether we end up using the block in the whole chain down
when we compile a method and if we don't end up using the block, we can
remove the code for pushing it.

Fixes #1940
897570e
@dbussink dbussink closed this in 897570e Oct 10, 2012
@thedarkone thedarkone added a commit to thedarkone/thread_safe that referenced this issue Oct 11, 2012
@thedarkone thedarkone Tweak the code a bit, since rubinius/rubinius#1940 has been fixed. bd2b925
@thedarkone
Contributor

@dbussink: do you know why a change like this seemed to make the bug go away? Was it because the method became slightly bigger and JIT stopped compiling/inlining stuff?

@dbussink
Member

That could be it, or due to not having to call an additional it could end up being actually less code in total compared to inlining and there could compile it properly. I didn't investigate the exact cause of that further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment