Skip to content

[bug] "in context" fragment parsing silently degrades functionality when XML errors are encountered #2092

@flavorjones

Description

@flavorjones

Please describe the bug

A fragment that is parsed "in context" and contains recoverable errors is silently parsed "out of context".

The code at

error_count = document.errors.length
node_set = in_context(contents, options.to_i)
if node_set.empty? and document.errors.length > error_count and options.recover?
fragment = Nokogiri::HTML::DocumentFragment.parse contents
node_set = fragment.children
end
is responsible for this behavior, dating back to a 2010 fix for #313.

The root cause is that libxml2 does not pay attention to the "recover" option when parsing fragments in context via xmlParseInNodeContext.

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'

context_xml = "<root xmlns:n='https://example.com/foo'></root>"
context_doc = Nokogiri::XML::Document.parse(context_xml)

valid_xml_fragment = "<n:a><b/></n:a>"
invalid_xml_fragment = "<n:a><b></n:a>" # note missing closing tag for `b`

# valid fragment parses fine
context_doc.root.parse(valid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<n:a>\n  <b/>\n</n:a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="n" href="https://example.com/foo">
end

# invalid fragment parses with errors, cannot recover, and is silently parsed out of context leading
# to namespaces not being properly referenced
context_doc.root.parse(invalid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<a>\n  <b/>\n</a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => nil
end

run with:

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.7.0
      platform: x86_64-linux
      description: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

Expected behavior

I think preferable behavior would be to choose one of:

  1. add an error noting that we "fell back" and pointing the user to turning off the recover option
  2. don't recover, but raise a sensible exception
  3. fix libxml2

Additional context

The behavior described here was introduced in #313.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions