-
-
Notifications
You must be signed in to change notification settings - Fork 937
Description
Describe the bug
When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.
Note that this warning from libxml2's docs seem to hint that getting this right is hard:
WARNING: handling entities on top of the libxml2 SAX interface is difficult!!!* If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself.
To Reproduce
#! /usr/bin/env ruby
xml = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Stuff [
<!ELEMENT stuff (#PCDATA)>
<!ENTITY THING "a thing">
]>
<stuff>&THING;</stuff>
XML
require "nokogiri"
require "pp"
puts "----> parsing with DOM parser"
doc = Nokogiri::XML.parse(xml)
pp doc
puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
def error(s)
raise s
end
end
Nokogiri::XML::SAX::Parser.new(StuffDoc.new).parse(xml)When run, this will output:
----> parsing with DOM parser
#(Document:0x3fd9cdca51ac {
name = "document",
children = [
#(DTD:0x3fd9cdca96a8 {
name = "Stuff",
children = [
#(ElementDecl:0x3fd9cdca862c { name = "stuff" }),
#(EntityDecl:0x3fd9cdcad8ac {
name = "THING",
children = [ #(Text "a thing")]
})]
}),
#(Element:0x3fd9cdcac86c {
name = "stuff",
children = [ #(EntityReference:0x3fd9cdcb1f9c { name = "THING" })]
})]
})
----> parsing with SAX parser
Traceback (most recent call last):
4: from demo.rb:24:in `<main>'
3: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
2: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
1: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
demo.rb:20:in `error': Entity 'THING' not defined (RuntimeError)
Expected behavior
I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g. "&THING;" in this case.
Environment
# Nokogiri (1.10.4)
---
warnings: []
nokogiri: 1.10.4
ruby:
version: 2.6.3
platform: x86_64-darwin18
description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxml2/2.9.9"
libxslt_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxslt/1.1.33"
libxml2_patches:
- 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
- 0002-Remove-script-macro-support.patch
- 0003-Update-entities-to-remove-handling-of-ssi.patch
libxslt_patches:
- 0001-Fix-security-framework-bypass.patch
compiled: 2.9.9
loaded: 2.9.9
Additional context
This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)