Skip to content

SAX::Parser errors when it encounters non-predefined entities. #1926

@searls

Description

@searls

Describe the bug

When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.

Note that this warning from libxml2's docs seem to hint that getting this right is hard:

WARNING: handling entities on top of the libxml2 SAX interface is difficult!!!* If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself.

To Reproduce

#! /usr/bin/env ruby

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with DOM parser"
doc = Nokogiri::XML.parse(xml)
pp doc

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def error(s)
    raise s
  end
end

Nokogiri::XML::SAX::Parser.new(StuffDoc.new).parse(xml)

When run, this will output:

----> parsing with DOM parser
#(Document:0x3fd9cdca51ac {
  name = "document",
  children = [
    #(DTD:0x3fd9cdca96a8 {
      name = "Stuff",
      children = [
        #(ElementDecl:0x3fd9cdca862c { name = "stuff" }),
        #(EntityDecl:0x3fd9cdcad8ac {
          name = "THING",
          children = [ #(Text "a thing")]
          })]
      }),
    #(Element:0x3fd9cdcac86c {
      name = "stuff",
      children = [ #(EntityReference:0x3fd9cdcb1f9c { name = "THING" })]
      })]
  })
----> parsing with SAX parser
Traceback (most recent call last):
	4: from demo.rb:24:in `<main>'
	3: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
	2: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
	1: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
demo.rb:20:in `error': Entity 'THING' not defined (RuntimeError)

Expected behavior

I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g. "&THING;" in this case.

Environment

# Nokogiri (1.10.4)
    ---
    warnings: []
    nokogiri: 1.10.4
    ruby:
      version: 2.6.3
      platform: x86_64-darwin18
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxml2/2.9.9"
      libxslt_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxslt/1.1.33"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      libxslt_patches:
      - 0001-Fix-security-framework-bypass.patch
      compiled: 2.9.9
      loaded: 2.9.9

Additional context

This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions