Bogus data being marked for GC sweeps under Rubinius #1047

Closed
YorickPeterse opened this Issue Feb 12, 2014 · 25 comments

Projects

None yet

4 participants

@YorickPeterse

Problem: in certain cases it seems that Nokogiri is marking (using the mark function found at https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L15) bogus data to be sweeped by the GC. Under MRI this seems to magically work but under Rubinius this triggers a segfault.

A script that reproduces this is as following: https://gist.github.com/YorickPeterse/29bdbab31c0cabdc66b2. When running this make sure you're using Rbx 2.2.5 and Nokogiri from Git source.

Observations so far:

Samples from gdb:

(gdb) frame
#0  0x00007ffff63248cc in mark (node=0x25247e0) at xml_node.c:18
18        if(doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
(gdb) info locals
doc = 0x0
(gdb) info args
node = 0x25247e0
(gdb) p node
$5 = (xmlNodePtr) 0x25247e0
(gdb) p doc
$6 = (xmlNodePtr) 0x0
(gdb) p *node
$7 = {_private = 0x0, type = 0, name = 0x1f1f360 "", children = 0x0, last = 0x2524830, parent = 0x0, next = 0x0, prev = 0x0, doc = 0x0, ns = 0x21, 
  content = 0x79726f6765746163 <error: Cannot access memory at address 0x79726f6765746163>, properties = 0x7ffff66d8600 <__memalign_hook>, nsDef = 0x0, psvi = 0x81, line = 0, 
  extra = 0}
(gdb) p *doc
Cannot access memory at address 0x0
(gdb) 

The corresponding backtrace:

(gdb) bt
#0  0x00007ffff63248cc in mark (node=0x25247e0) at xml_node.c:18
#1  0x00000000006b2151 in rubinius::Data::Info::mark (this=<optimized out>, t=0x7fff689c8d48, mark=...) at vm/builtin/data.cpp:155
#2  0x000000000071ddb2 in rubinius::GarbageCollector::scan_object (this=0x1554a30, obj=0x7fff689c8d48) at vm/gc/gc.cpp:108
#3  0x000000000071fb0d in walk_pointers (this=<optimized out>, addr=..., mark=...) at /home/yorickpeterse/Private/Projects/ruby/rubinius/vm/gc/immix.hpp:78
#4  process_mark_stack (this=<optimized out>, alloc=..., count=<optimized out>) at /home/yorickpeterse/Private/Projects/ruby/rubinius/vm/util/immix.hpp:1095
#5  rubinius::ImmixGC::process_mark_stack (this=<optimized out>, count=<optimized out>, this=<optimized out>, count=<optimized out>) at vm/gc/immix.cpp:382
#6  0x00000000007212ce in rubinius::ImmixMarker::perform (this=0x1a96210, state=0x7fffefffeec0) at vm/gc/immix_marker.cpp:159
#7  0x0000000000721193 in rubinius::immix_marker_tramp (state=0x7fffefffeec0) at vm/gc/immix_marker.cpp:18
#8  0x00000000006f90fd in rubinius::Thread::in_new_thread (ptr=0x1a732d0) at vm/builtin/thread.cpp:259
#9  0x00007ffff72830a2 in start_thread () from /usr/lib/libpthread.so.0
#10 0x00007ffff641bd1d in clone () from /usr/lib/libc.so.6

I doubt this particular issue is caused due to threading, it's more likely that this is an issue similar to the one described in #939. The use of threading simply increases the amount of GC activity which in turn causes this issue to surface much quicker.

@YorickPeterse

Extra info: Rubinius has two GC systems in place, Immix and Baker. Baker is used for the young generation, Immix for the mature generation (if I'm not mistaken). In a production application the above issue mainly occurs in the Baker system whereas in the script mentioned above the issue primarily occurs in the Immix system.

@YorickPeterse

http://hastebin.com/raw/rolenakale This pastie contains a sample of the backtraces of all threads when this issue would occur in a production application. The relevant bit is the following:

Thread 10 (Thread 0x7fffd8fec700 (LWP 18605)):
#0  rubinius::BakerGC::saw_object (this=0x1554880, obj=0x0) at vm/gc/baker.cpp:86
#1  0x000000000070af22 in rb_gc_mark (ptr=140736689064640) at vm/capi/gc.cpp:44
#2  0x00007ffff5e11932 in mark (node=0x7fffd04f4ca0) at xml_node.c:23
#3  0x00000000006b2151 in rubinius::Data::Info::mark (this=<optimized out>, t=0x7fffda3a7dd8, mark=...) at vm/builtin/data.cpp:155
#4  0x000000000071ddb2 in rubinius::GarbageCollector::scan_object (this=0x1554880, obj=0x7fffda3a7dd8) at vm/gc/gc.cpp:108
#5  0x000000000071a1f7 in rubinius::BakerGC::collect (this=0x1554880, data=0x7fffd8fccc48, stats=0x7fffd8fccc28) at vm/gc/baker.cpp:215
#6  0x0000000000656c31 in rubinius::ObjectMemory::collect_young (this=0x15543b0, state=0x7fffd8fcccf8, data=0x7fffd8fccc48, stats=0x7fffd8fccc28) at vm/object_memory.cpp:574
#7  0x00000000006569ad in rubinius::ObjectMemory::collect_maybe (this=0x15543b0, state=0x7fffd8fcccf8, gct=..., call_frame=<optimized out>) at vm/object_memory.cpp:531
#8  0x00000000006a1be5 in rubinius::VM::collect_maybe (this=0x1554880, gct=..., call_frame=0x719c90 <rubinius::BakerGC::saw_object(rubinius::Object*)>) at vm/vm.cpp:310
#9  0x00000000006add63 in collect_and_allocate (state=<optimized out>, gct=..., self=<optimized out>, self=<optimized out>, self=<optimized out>, self=<optimized out>, 
    self=<optimized out>, calling_environment=<optimized out>, state=<optimized out>, gct=..., self=<optimized out>, calling_environment=<optimized out>)
    at vm/builtin/class.cpp:66
#10 rubinius::(anonymous namespace)::allocate_packed (state=0x7fffc40008c0, gct=..., self=0x7fffd8fccd58, calling_environment=0x7fffd8fcd650) at vm/builtin/class.cpp:114

This shows that it occurs both in the young and mature generation.

Corresponding Rubinius issue: rubinius/rubinius#2908

@YorickPeterse

Backtraces of the repro script: http://hastebin.com/raw/pexacaladi

@brixen brixen added a commit to rubinius/rubinius that referenced this issue Feb 19, 2014
@brixen brixen Fixed reference_p() to return false for NULL.
This indirectly fixes one of the issues exposed in the following:

sparklemotion/nokogiri#1047
#2844

This is not the root cause of the Nokogiri-relate segvs, which
appear to be Nokogiri (libxml2) releasing data for a node while
a Data object wrapping it is still reachable and hence is having
mark() called on it.
eba5d40
@brixen
brixen commented Feb 21, 2014

I ran the repro script that @YorickPeterse provided, with one change: I upped the threads to 100.

I ran this on an Ubuntu 12.04 vbox machine on rubinius/rubinius@56d2edb with rubinius/rubinius@eba5d40 and rubinius/rubinius@6da3885 applied.

I build Rubinius and Nokogiri without optimizations and with debugging symbols enabled.

I instrumented Nokogiri (https://gist.github.com/brixen/9138421#file-nokogiri_1047-diff) to track where libxml2 pointers were being wrapped and where xmlFree* was being called. I also added instrumentation around the malloc, realloc, and free functions that Nokogiri is configuring libxml2 with.

I instrumented Rubinius (https://gist.github.com/brixen/9138421#file-rbx-diff) to track Data objects being finalized.

I've run multiple times with the same results:

  1. A pointer is allocated
  2. The pointer is wrapped by Nokogiri
  3. The pointer is marked
  4. The pointer is free'd, with every indication being it is free'd by libxml2
  5. The Ruby object wrapping the pointer is still reachable, and an attempt to mark the pointer causes a SEGV

This gist is the output of running ack on the log file for the pointer that causes the fault: https://gist.github.com/brixen/9106890. I built the instrumentation up over time, but the same pattern is visible. This run is the last: https://gist.github.com/brixen/9106890#file-gistfile7-txt

The pointer causing the fault is a member of a NodeSet and is wrapped at this point in Nokogiri code: https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node_set.c#L268

I attempted to use xmlCopyNode, passing 1 for the extended parameter, but that function does not completely copy the node. In particular, the node->doc was 0x0 in the copy, as were parent, next, previous, IIRC.

The fundamental issue appears to be that Nokogiri is assuming the lifetime of the libxml2 object that it is wrapping bounds the lifetime of the Ruby object in which the pointer is wrapped. This assumption appears to be invalid in this particular case, which causes me to wonder how many other cases are invalid as well, but by some happy coincidence, usually work. If this is the case, Nokogiri is fundamentally unsound.

@brixen
brixen commented Feb 21, 2014

Also, I have all those logs if they are useful, but they are really big (each ~300mb gzip'd).

@brixen brixen added a commit to rubinius/rubinius that referenced this issue Feb 22, 2014
@brixen brixen Added a C-API lock for Nokogiri.
This makes whatever the issue is in

sparklemotion/nokogiri#1047
#2844

to not manifest during more than 1 hour of 100 threads running in
the repro script in the Nokogiri ticket.

If a bug never appears, does it exist?
5f2e61b
@brixen
brixen commented Feb 22, 2014

There are two scenarios that seem plausible.

Rubinius has a concurrent mark thread for the mature generation. It's possible that Nokogiri calls a libxml2 function that releases a tree, ultimately calling free on a node that the mark thread reaches after libxml2 calls free, but before Nokogiri finishes making the Ruby Data objects unreachable.

The other scenario is that Nokogiri is not thread safe in some respect.

Or it could be a combination of the two. Note that here, free is called 4 times on an address with no intervening allocs, and here, it appears alloc is returning the same address twice with no intervening free, then free is called twice. The latter could be the result of interleaving of output, but the former is hard to explain.

Running Rubinius with the -Xgc.immix.concurrent=no, the issue does not manifest after more than 30 min of processing.

Ultimately, I added a C-API lock specifically for Nokogiri. This also appears to prevent the issue from manifesting in over 60 min of processing.

@headius
Contributor
headius commented Feb 22, 2014

VMs like the JVM that have concurrently-executable garbage collectors generally do not allow those collectors to run concurrently with C calls that may be accessing managed objects. In fact, under most circumstances JVMs do not allow GC to run at all while unmanaged code is executing, unless you know the proper VM-specific incantations to allow that to happen.

If Rubinius is actually running GC concurrent to unmanaged code, this will not be the only issue to come up. Part of the implicit contract of MRI's C API is that unless GC is explicitly invoked, objects in hand will live until execution returns to the VM. The JVM's JNI API makes the same guarantee for the same reasons: you can't GC objects that native code might still be using.

I would recommend not allowing Rubinius's GC to run during the lifetime of a C extension downcall for all C extensions.

@brixen
brixen commented Feb 23, 2014

Rubinius was not collecting an object. It was marking it, per the mark function set on the Data object.

I'd argue it's a bug to wrap a structure that is being free'd independent of the available free function on the Data object.

@headius
Contributor
headius commented Feb 23, 2014

@brixen Perhaps I'm confused about your GC... isn't marking part of the GC cycle? Or is there some other "marking" going on that's unrelated to GC?

@headius
Contributor
headius commented Feb 23, 2014

Perhaps I can clarify things a bit. What I meant was that within a given C downcall, Rubinius should probably guarantee that no concurrent interaction of GC mechanisms (marking, sweeping, compacting) can occur while that C call is active.

Further, the MRI C API guarantees (implicitly, I'll grant) that VM-level mechanisms like GC will not interact with any objects (excessive, perhaps, but illlustrative when you are attempting to emulate that API) while a C extension is executing. I think Rubinius will continue to run into problems with C extensions as long as it allows VM-level processes to run concurrently with C extension downcalls.

@headius
Contributor
headius commented Sep 16, 2014

Was the Rubinius team able to sort out exactly what's causing this? I have heard vague claims that "libml isn't thread-safe" or "nokogiri isn't thread-safe" but nobody seems to have recorded exactly what makes them thread-safe. That would be very useful information to give to either the Nokogiri team or the libxml maintainers.

If this is fixable in Nokogiri, it should be fixed. Need more information on what to fix.

If it's not fixable in Nokogiri but exposes a problem in libxml, the libxml maintainers should be notified and this bug should be closed (and libxml should be fixed, but that's somewhat out of our control).

If this is not fixable in Nokogiri or libxml, then it would indicate something's wrong in Rubinius, and this bug should be closed.

@headius
Contributor
headius commented Sep 16, 2014

I would like some clarification about @brixen's "same result" comment above. Specifically:

A pointer is allocated
The pointer is wrapped by Nokogiri
The pointer is marked

What does this mean? A pointer is a memory address. Do you mean the object that wraps the pointer? And marked by whom?

The pointer is free'd, with every indication being it is free'd by libxml2

What are these indications? Do you have evidence to support this claim?

The Ruby object wrapping the pointer is still reachable, and an attempt to mark the pointer causes a SEGV

Again I'm unclear what you mean by "mark the pointer". If the pointer is allocated by libxml, then marking does not apply; it's a pointer to opaque memory libxml2 controls. I think you mean that the pointer gets freed independently of the DATA-related free function associated with the object? And you believe it's libxml doing it?

@headius
Contributor
headius commented Sep 16, 2014

I've tried to consume as much of this bug as possible, and I'm still confused why you think that the problem is in nokogiri or libxml.

In #1047 (comment) @brixen said that adding a lock around nokogiri native calls and turning off the concurrent GC fixes the problem. That could indicate a concurrency issue in Nokogiri or libxml, true. But it could also point at Rubinius's concurrent GC or a problem with the test (especially if the threads running in Rubinius are not using isolated objects from Nokogiri).

What's the status? Do the changes that fixed the issue for @brixen fix it in general?

@flavorjones
Member

I've been running the nonogiri.rb script linked to above for the better part of the day on my 4-core laptop and haven't been able to reproduce it yet.

I've tried:

  • Rubinius 2.4.1 / Nokogiri 1.6.2.1 / libxml 2.8.0 with patches
  • Rubinius 2.4.1 / Nokogiri 1.6.5 / libxml 2.9.2 with patches

Any suggestions on how I can more reliably reproduce it? I've tried increasing and decreasing the number of threads; what else should I be trying?

Possibly-notably, I don't seem to be able to saturate my CPUs. Not having much experience using Rubinius, I wonder if I'm missing something obvious?

Any help or advice would be greatly appreciated. Thanks.

-m

@brixen
brixen commented Jan 10, 2015

@flavorjones I added a C-API lock specifically for Nokogiri, so you'll get no parallelization in Nokogiri code when running threads on Rubinius https://github.com/rubinius/rubinius/blob/a7a6cb0052e3a3c19a3559f43cb1eb973251bfb5/vm/shared_state.cpp#L432

Perhaps @YorickPeterse has a better repro.

@YorickPeterse

@brixen I tried running the script with and without the lock, neither crashed for me on Rbx master. I'll try with my actual apps in the coming week to see what happens there.

@brixen
brixen commented Jan 10, 2015

@YorickPeterse when you say, "with and without the lock", what do you mean? Nokogiri gets its own artisanal, free-range, environmentally friendly special lock now. How are you running, "without the lock"?

@YorickPeterse

@brixen As in, I removed the CAPI_BLACK_LIST(nokogiri); bit then recompiled Rubinius from source and ran the script.

@brixen
brixen commented Jan 10, 2015

@YorickPeterse ah, that is good to know!

@flavorjones
Member

@YorickPeterse, any luck seeing this happen in the wild in the last 10 days?

@flavorjones flavorjones self-assigned this Jan 20, 2015
@YorickPeterse

@flavorjones Sorry, I haven't gotten to it yet. We're right in the middle of a big database migration this week, I'll try to take a look at it this Friday/next week.

@flavorjones
Member

@YorickPeterse - just bumping to keep this on your radar. Would love more info if you have it.

@YorickPeterse

I tried my app again and it did crash once with a backtrace similar to the ones discussed above. Sadly I didn't set up logging of crashes properly making copy-pasting the error a total nightmare. I'm currently running the application under GDB but of course now it refuses to crash.

@sodabrew sodabrew referenced this issue in rubinius/rubinius Jul 14, 2015
Open

mysql2 related crashes when running specs #2662

@flavorjones
Member

Running modern nokogiri under rbx-3, these crashes don't seem to occur (at least not in CI). I'd like to propose closing this, unless there's new and compelling information that I can use to investigate.

@flavorjones
Member

Closing. If anyone comes across this later, rbx3 builds with both system and vendored libraries are here, and look green:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment