New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults and memory corruption from using nokogiri with libxml-ruby #1426

Closed
bbergstrom opened this Issue Feb 16, 2016 · 19 comments

Comments

Projects
None yet
5 participants
@bbergstrom
Copy link

bbergstrom commented Feb 16, 2016

We have been experiencing memory corruption in our Rails application which depends on nokogiri (1.6.7.2) and libxml-ruby (2.8.0). This memory corruption manifested itself in seemingly random segmentation faults with stack traces to nearly every part of code in our application and its dependencies.

After upgrading or removing nearly every gem with a C extension we were able to verify it would go away when we removed a gem that dependent on libxml-ruby. Upon investigation into libxml-ruby segmentation faults we came across similar issues of #895 and #881 and #1364 . That issue was patched some versions ago, but it appears that a similar issue still exists.

I am able to reproduce on Amazon Linux (RHEL/CentOS based distro) but not on OSX with this script 3/4 of the time it executes.

$ cat nokogiri-libxml-segfault.rb 
#!/usr/bin/env ruby
gem 'nokogiri', '1.6.7.2'
gem 'libxml-ruby', '2.8.0'
require 'nokogiri'
require 'libxml'

message = "<h2>BOOM!</h2>"
20_000.times do
  node = Nokogiri::HTML::DocumentFragment.parse(message)
  node.add_previous_sibling(Nokogiri::XML::Text.new(' ', node.document))
  node.add_next_sibling(Nokogiri::XML::Text.new(' ', node.document))
end

Here is a sample of the stack traces that result.

$ ./nokogiri-libxml-segfault.rb
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/html/document_fragment.rb:10: [BUG] Segmentation fault at 0x00000000000070
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0006 p:---- s:0021 e:000020 CFUNC  :encoding=
c:0005 p:0062 s:0017 e:000016 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/html/document_fragment.rb:10
c:0004 p:0019 s:0011 e:000010 BLOCK  ./nokogiri-libxml-segfault.rb:12 [FINISH]
c:0003 p:---- s:0008 e:000007 CFUNC  :times
c:0002 p:0048 s:0005 E:002568 EVAL   ./nokogiri-libxml-segfault.rb:11 [FINISH]
c:0001 p:0000 s:0002 E:001ba0 TOP    [FINISH]

-- Ruby level backtrace information ----------------------------------------
./nokogiri-libxml-segfault.rb:11:in `<main>'
./nokogiri-libxml-segfault.rb:11:in `times'
./nokogiri-libxml-segfault.rb:12:in `block in <main>'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/html/document_fragment.rb:10:in `parse'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/html/document_fragment.rb:10:in `encoding='
$ ./nokogiri-libxml-segfault.rb
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:107: [BUG] Segmentation fault at 0x00000000000040
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0009 p:---- s:0033 e:000032 CFUNC  :document
c:0008 p:0007 s:0030 e:000029 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:107 [FINISH]
c:0007 p:---- s:0027 e:000026 CFUNC  :add_next_sibling_node
c:0006 p:0142 s:0023 e:000022 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780
c:0005 p:0048 s:0015 e:000014 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:197
c:0004 p:0072 s:0011 e:000010 BLOCK  ./nokogiri-libxml-segfault.rb:14 [FINISH]
c:0003 p:---- s:0008 e:000007 CFUNC  :times
c:0002 p:0048 s:0005 E:000b78 EVAL   ./nokogiri-libxml-segfault.rb:11 [FINISH]
c:0001 p:0000 s:0002 E:000160 TOP    [FINISH]

-- Ruby level backtrace information ----------------------------------------
./nokogiri-libxml-segfault.rb:11:in `<main>'
./nokogiri-libxml-segfault.rb:11:in `times'
./nokogiri-libxml-segfault.rb:14:in `block in <main>'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:197:in `add_next_sibling'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780:in `add_sibling'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780:in `add_next_sibling_node'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:107:in `decorate!'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:107:in `document'
$ ./nokogiri-libxml-segfault.rb
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780: [BUG] Segmentation fault at 0x00000000000040
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0007 p:---- s:0027 e:000026 CFUNC  :add_next_sibling_node
c:0006 p:0142 s:0023 e:000022 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780
c:0005 p:0048 s:0015 e:000014 METHOD /usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:197
c:0004 p:0072 s:0011 e:000010 BLOCK  ./nokogiri-libxml-segfault.rb:14 [FINISH]
c:0003 p:---- s:0008 e:000007 CFUNC  :times
c:0002 p:0048 s:0005 E:0014c8 EVAL   ./nokogiri-libxml-segfault.rb:11 [FINISH]
c:0001 p:0000 s:0002 E:0009c0 TOP    [FINISH]

-- Ruby level backtrace information ----------------------------------------
./nokogiri-libxml-segfault.rb:11:in `<main>'
./nokogiri-libxml-segfault.rb:11:in `times'
./nokogiri-libxml-segfault.rb:14:in `block in <main>'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:197:in `add_next_sibling'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780:in `add_sibling'
/usr/local/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:780:in `add_next_sibling_node'

We hope to eventually remove our dependency on libxml-ruby as we use nokogiri in our codebase, but a required dependency currently forces libxml-ruby into our project as well. A patch would be great for compatibility and for anyone else that may encounter this convoluted issue. We had to spend a lot of time troubleshooting this issue as none of the segmentation faults that happened in our systems pointed to nokogiri or libxml-ruby.

TIA

@bbergstrom

This comment has been minimized.

Copy link

bbergstrom commented Feb 16, 2016

Related libxml-ruby issue: xml4r/libxml-ruby#116

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Trying right now to reproduce given your information above.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Aha, I'm able to reproduce this. Thanks so much for putting in the time to isolate this with a test case. As you can probably see from some of the other tickets, reproduction isn't always easy.

I'll try to diagnose root causes; but no promises. libxml-ruby uses a very different memory management model within libxml2, and some things may just not be possible to work around.

Will let you know what I find.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Possibly related to #1270

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

valgrind says:

==29039== Memcheck, a memory error detector
==29039== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==29039== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==29039== Command: ruby ./foo.rb
==29039== 
==29039== Invalid read of size 8
==29039==    at 0x7938AC2: new (xml_xpath_context.c:279)
==29039==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29039==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29039==    by 0x501F518: vm_exec_core (insns.def:995)
==29039==    by 0x5023090: vm_exec (vm.c:1645)
==29039==    by 0x50245AE: invoke_block_from_c_0 (vm.c:966)
==29039==    by 0x50245AE: invoke_block_from_c_splattable (vm.c:983)
==29039==    by 0x50245AE: vm_yield (vm.c:1018)
==29039==    by 0x50245AE: rb_yield_0 (vm_eval.c:1008)
==29039==    by 0x50245AE: rb_yield_1 (vm_eval.c:1014)
==29039==    by 0x50245AE: rb_yield (vm_eval.c:1024)
==29039==    by 0x4E6F19C: rb_ary_collect (array.c:2733)
==29039==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29039==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29039==    by 0x501F445: vm_exec_core (insns.def:964)
==29039==    by 0x5023090: vm_exec (vm.c:1645)
==29039==    by 0x5026453: vm_call0_body.constprop.116 (vm_eval.c:182)
==29039==    by 0x502702D: vm_call0 (vm_eval.c:61)
==29039==    by 0x502702D: rb_call0 (vm_eval.c:351)
==29039==    by 0x793651B: new (xml_document_fragment.c:26)
==29039==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29039==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29039==    by 0x501F518: vm_exec_core (insns.def:995)
==29039==    by 0x5023090: vm_exec (vm.c:1645)
==29039==    by 0x5024776: invoke_block_from_c_0 (vm.c:966)
==29039==    by 0x5024776: invoke_block_from_c_splattable (vm.c:983)
==29039==    by 0x5024776: vm_yield (vm.c:1018)
==29039==    by 0x5024776: rb_yield_0 (vm_eval.c:1008)
==29039==    by 0x5024776: rb_yield_1 (vm_eval.c:1014)
==29039==    by 0x4F1373D: int_dotimes (numeric.c:3953)
==29039==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29039==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29039==    by 0x502BA62: vm_call_method (vm_insnhelper.c:2172)
==29039==    by 0x501F445: vm_exec_core (insns.def:964)
==29039==    by 0x5023090: vm_exec (vm.c:1645)
==29039==    by 0x4EBC5B3: ruby_exec_internal (eval.c:244)
==29039==    by 0x4EBE4AC: ruby_exec_node (eval.c:309)
==29039==    by 0x4EC0EDB: ruby_run_node (eval.c:301)
==29039==    by 0x40089A: main (main.c:36)
==29039==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==29039== 
/home/flavorjones/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:157: [BUG] Segmentation fault at 0x00000000000040
@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Alternatively, finds this:

==29128== Invalid read of size 8
==29128==    at 0x793EA33: document (xml_node.c:310)
==29128==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29128==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29128==    by 0x501F518: vm_exec_core (insns.def:995)
==29128==    by 0x5023090: vm_exec (vm.c:1645)
==29128==    by 0x5024776: invoke_block_from_c_0 (vm.c:966)
==29128==    by 0x5024776: invoke_block_from_c_splattable (vm.c:983)
==29128==    by 0x5024776: vm_yield (vm.c:1018)
==29128==    by 0x5024776: rb_yield_0 (vm_eval.c:1008)
==29128==    by 0x5024776: rb_yield_1 (vm_eval.c:1014)
==29128==    by 0x4F1373D: int_dotimes (numeric.c:3953)
==29128==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==29128==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==29128==    by 0x502BA62: vm_call_method (vm_insnhelper.c:2172)
==29128==    by 0x501F445: vm_exec_core (insns.def:964)
==29128==    by 0x5023090: vm_exec (vm.c:1645)
==29128==    by 0x4EBC5B3: ruby_exec_internal (eval.c:244)
==29128==    by 0x4EBE4AC: ruby_exec_node (eval.c:309)
==29128==    by 0x4EC0EDB: ruby_run_node (eval.c:301)
==29128==    by 0x40089A: main (main.c:36)
==29128==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==29128== 
./foo.rb:13: [BUG] Segmentation fault at 0x00000000000040
ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux]
@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

and another, this time with Nokogiri 1.6.8.rc2:

==17511== Invalid read of size 4
==17511==    at 0xCA7C171: reify_node_set_namespaces (xml_node_set.c:390)
==17511==    by 0xCA7C171: Nokogiri_wrap_xml_node_set (xml_node_set.c:414)
==17511==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==17511==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==17511==    by 0x501F518: vm_exec_core (insns.def:995)
==17511==    by 0x5023090: vm_exec (vm.c:1645)
==17511==    by 0x5026453: vm_call0_body.constprop.116 (vm_eval.c:182)
==17511==    by 0x502702D: vm_call0 (vm_eval.c:61)
==17511==    by 0x502702D: rb_call0 (vm_eval.c:351)
==17511==    by 0xCA7465B: new (xml_document_fragment.c:26)
==17511==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==17511==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==17511==    by 0x501F518: vm_exec_core (insns.def:995)
==17511==    by 0x5023090: vm_exec (vm.c:1645)
==17511==    by 0x5024776: invoke_block_from_c_0 (vm.c:966)
==17511==    by 0x5024776: invoke_block_from_c_splattable (vm.c:983)
==17511==    by 0x5024776: vm_yield (vm.c:1018)
==17511==    by 0x5024776: rb_yield_0 (vm_eval.c:1008)
==17511==    by 0x5024776: rb_yield_1 (vm_eval.c:1014)
==17511==    by 0x4F1373D: int_dotimes (numeric.c:3953)
==17511==    by 0x5014805: vm_call_cfunc_with_frame (vm_insnhelper.c:1638)
==17511==    by 0x5014805: vm_call_cfunc (vm_insnhelper.c:1733)
==17511==    by 0x502BA62: vm_call_method (vm_insnhelper.c:2172)
==17511==    by 0x501F445: vm_exec_core (insns.def:964)
==17511==    by 0x5023090: vm_exec (vm.c:1645)
==17511==    by 0x4EBC5B3: ruby_exec_internal (eval.c:244)
==17511==    by 0x4EBE4AC: ruby_exec_node (eval.c:309)
==17511==    by 0x4EC0EDB: ruby_run_node (eval.c:301)
==17511==    by 0x40089A: main (main.c:36)
==17511==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Perhaps interesting is the fact that I can't reproduce this segfault with Ruby 1.9.3, but can easily reproduce with Ruby 2.0, 2.1, 2.2, and 2.3.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

Hmm. This definitely has to do with the fact that libxml-ruby registers a global callback, invoked on all nodes (even nokogiri's) when the node is GCed.

Still trying to nail down the exact circumstances when memory gets clobbered.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 16, 2016

I want to repeat that above statement, in case anyone missed the importance of it: libxml-ruby registers a global callback. That is, libxml-ruby makes the assumption that no other users of libxml2 exist in a process, which is IMHO bad behavior.

Historically, Nokogiri has tried to work around this behavior, with mixed success (dating back to as early as 2009 -- see #33). We're doing our best, even today, but I want to make it clear that we're attempting to hack around another library's bad behavior.

I wonder if the libxml-ruby maintainers care about this? A quick peek at the commit log doesn't reveal any indications that they've ever tried to work with Nokogiri.

But like I said, still looking at this with the hope of once again working around libxml-ruby's global callback behavior.

@bbergstrom

This comment has been minimized.

Copy link

bbergstrom commented Feb 16, 2016

Glad you were able to reproduce the issue as well. Your steadfast positivity deserves a 🏆 in the face of recurring one-sided incompatibilities. Ideally, we wouldn't use libxml-ruby, but in pragmatic practice you can only control your transitive dependencies so much.

flavorjones added a commit that referenced this issue Feb 17, 2016

add test coverage for libxml-ruby conflicts
related to #1426 which has a nice test case I'd like to keep as part of
our suite.
@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 17, 2016

@bbergstrom I'm going to cut a release candidate for you to try out. Will update when it's available, sometime tonight.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 17, 2016

v1.6.8.rc3 is up on rubygems.org. Let me know how you get on.

@bbergstrom

This comment has been minimized.

Copy link

bbergstrom commented Feb 17, 2016

Great! Trying it out today and will report back.

@bbergstrom

This comment has been minimized.

Copy link

bbergstrom commented Feb 17, 2016

We tried out 1.6.8.rc3 on our Ruby 2.2 Rails application for about 5 hours and have seen a 99% reduction in segfaults. :D

The 2 that happened have similar segfault traces that we were seeing before, but it is hard to say with any certainty if they at all related. I think you fixed the main cause of the issue we were experiencing. If there are any further issues, I will file new tickets for those.

Thanks again for your quick work on this issue.

@flavorjones

This comment has been minimized.

Copy link
Member

flavorjones commented Feb 17, 2016

Great to hear.

If you can figure out a reproduction test case for the remaining segfaults,
please do open a new issue. It was great having your script to quickly
analyze and debug, and hopefully we've built a little trust here. ;)

On Wed, Feb 17, 2016 at 4:33 PM, Brian Bergstrom notifications@github.com
wrote:

We tried out 1.6.8.rc3 on our Ruby 2.2 Rails application for about 5 hours
and have seen a 99% reduction in segfaults. :D

The 2 that happened have similar segfault traces that we were seeing
before, but it is hard to say with any certainty if they at all related. I
think you fixed the main cause of the issue we were experiencing. If there
are any further issues, I will file new tickets for those.

Thanks again for your quick work on this issue.


Reply to this email directly or view it on GitHub
#1426 (comment)
.

@stoivo

This comment has been minimized.

Copy link

stoivo commented Apr 11, 2016

I had the same issue but I went through a journey to find solution.

It started with our Passenger worker jamming and not responding. After a lot of logging and testing we discovered that Passenger jams while we are parsing XML and returns 502 status code and leaves a segmentation fault in logs. We contacted Passenger Development Team, and with their help we concluded that libxml-ruby was the issue. We rewrote our code only to use nokogiri and now it works.

Special thanks to Passenger Development Team.

Note: Wrote this to make it easier for other people with similar issue to find this solution faster.

Keywords:

libxml-ruby, nokogiri, passenger worker jams, xml parsing

@johnsyweb

This comment has been minimized.

Copy link

johnsyweb commented May 18, 2016

We have had the same issue. Nokogiri 1.6.8.rc3 (Ruby 2.1.8) halved the number of segfaults. xml4r/libxml-ruby#118 has (finally) erradicated the segfaults.

Keywords:

  • ubuntu
  • libxml-ruby
  • nokogiri
  • rails
  • unicorn
  • segmentation fault
  • random
  • heap corruption
@cfis

This comment has been minimized.

Copy link

cfis commented May 19, 2016

Nice work all. What's the global handler on the libxml side causing issues? Happy to look at removing it - or is that already done in xml4r/libxml-ruby#118?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment