Please describe the bug
A fragment that is parsed "in context" and contains recoverable errors is silently parsed "out of context".
The code at
|
error_count = document.errors.length |
|
node_set = in_context(contents, options.to_i) |
|
if node_set.empty? and document.errors.length > error_count and options.recover? |
|
fragment = Nokogiri::HTML::DocumentFragment.parse contents |
|
node_set = fragment.children |
|
end |
is responsible for this behavior, dating back to a 2010 fix for
#313.
The root cause is that libxml2 does not pay attention to the "recover" option when parsing fragments in context via xmlParseInNodeContext.
Help us reproduce what you're seeing
#! /usr/bin/env ruby
require 'nokogiri'
context_xml = "<root xmlns:n='https://example.com/foo'></root>"
context_doc = Nokogiri::XML::Document.parse(context_xml)
valid_xml_fragment = "<n:a><b/></n:a>"
invalid_xml_fragment = "<n:a><b></n:a>" # note missing closing tag for `b`
# valid fragment parses fine
context_doc.root.parse(valid_xml_fragment).tap do |fragment|
fragment.to_xml # => "<n:a>\n <b/>\n</n:a>"
fragment.first.name # => "a"
fragment.first.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="n" href="https://example.com/foo">
end
# invalid fragment parses with errors, cannot recover, and is silently parsed out of context leading
# to namespaces not being properly referenced
context_doc.root.parse(invalid_xml_fragment).tap do |fragment|
fragment.to_xml # => "<a>\n <b/>\n</a>"
fragment.first.name # => "a"
fragment.first.namespace # => nil
end
run with:
# Nokogiri (1.10.10)
---
warnings: []
nokogiri: 1.10.10
ruby:
version: 2.7.0
platform: x86_64-linux
description: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
libxml2_patches:
- 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
- 0002-Remove-script-macro-support.patch
- 0003-Update-entities-to-remove-handling-of-ssi.patch
- 0004-libxml2.la-is-in-top_builddir.patch
- 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
libxslt_patches: []
compiled: 2.9.10
loaded: 2.9.10
Expected behavior
I think preferable behavior would be to choose one of:
- add an error noting that we "fell back" and pointing the user to turning off the
recover option
- don't recover, but raise a sensible exception
- fix libxml2
Additional context
The behavior described here was introduced in #313.
Please describe the bug
A fragment that is parsed "in context" and contains recoverable errors is silently parsed "out of context".
The code at
nokogiri/lib/nokogiri/xml/node.rb
Lines 824 to 829 in d852d97
The root cause is that libxml2 does not pay attention to the "recover" option when parsing fragments in context via
xmlParseInNodeContext.Help us reproduce what you're seeing
run with:
Expected behavior
I think preferable behavior would be to choose one of:
recoveroptionAdditional context
The behavior described here was introduced in #313.