Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX::Parser errors when it encounters non-predefined entities. #1926

Open
searls opened this issue Sep 22, 2019 · 10 comments
Open

SAX::Parser errors when it encounters non-predefined entities. #1926

searls opened this issue Sep 22, 2019 · 10 comments

Comments

@searls
Copy link

searls commented Sep 22, 2019

Describe the bug

When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.

Note that this warning from libxml2's docs seem to hint that getting this right is hard:

WARNING: handling entities on top of the libxml2 SAX interface is difficult!!!* If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself.

To Reproduce

#! /usr/bin/env ruby

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with DOM parser"
doc = Nokogiri::XML.parse(xml)
pp doc

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def error(s)
    raise s
  end
end

Nokogiri::XML::SAX::Parser.new(StuffDoc.new).parse(xml)

When run, this will output:

----> parsing with DOM parser
#(Document:0x3fd9cdca51ac {
  name = "document",
  children = [
    #(DTD:0x3fd9cdca96a8 {
      name = "Stuff",
      children = [
        #(ElementDecl:0x3fd9cdca862c { name = "stuff" }),
        #(EntityDecl:0x3fd9cdcad8ac {
          name = "THING",
          children = [ #(Text "a thing")]
          })]
      }),
    #(Element:0x3fd9cdcac86c {
      name = "stuff",
      children = [ #(EntityReference:0x3fd9cdcb1f9c { name = "THING" })]
      })]
  })
----> parsing with SAX parser
Traceback (most recent call last):
	4: from demo.rb:24:in `<main>'
	3: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
	2: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
	1: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
demo.rb:20:in `error': Entity 'THING' not defined (RuntimeError)

Expected behavior

I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g. "&THING;" in this case.

Environment

# Nokogiri (1.10.4)
    ---
    warnings: []
    nokogiri: 1.10.4
    ruby:
      version: 2.6.3
      platform: x86_64-darwin18
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxml2/2.9.9"
      libxslt_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxslt/1.1.33"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      libxslt_patches:
      - 0001-Fix-security-framework-bypass.patch
      compiled: 2.9.9
      loaded: 2.9.9

Additional context

This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)

@flavorjones
Copy link
Member

Hey @searls, thanks for the clear bug report. I wanted to acknowledge that I saw this and let you know I will likely not be able to dig into it immediately, but I agree with your take that the described behavior is problematic.

@tenderlove
Copy link
Member

I did a quick look in to this. I knew I had looked in to this at one point, and now the code is jogging my memory. The SAX parser will actually build a DOM object in order to support entity substitution. I'm not sure if it actually builds the full DOM, but it definitely keeps the entity references on the document structs. In order to support this we need to make sure the myDoc part of the struct is filled out.

I thought the easiest approach would be to initialize the SAX parser with all the default callbacks, but that doesn't seem to work because the context pointer we're using is a nokogiriSAXTuple * where I think all the default callbacks are expecting a xmlParserCtxtPtr.

I think there are two approaches we could take to fix this:

  1. Embed xmlParserCtxt at the head of our nokogiriSAXTuple struct
  2. Implement all the default SAX callbacks and unwrap the nokogiriSAXTuple

I'm not sure if the first approach is possible, and the second approach sounds like a bigger patch.

@tenderlove
Copy link
Member

I forgot to mention, there is a callback for when an entity is encountered. But the huge bummer is that the callback is expected to return a xmlEntityPtr. We can't simply return NULL, otherwise you'll get the same error (but with a callback executed!).

With this patch:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 1a5f6c5f..7cc25524 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -7,7 +7,7 @@ static ID id_start_document, id_end_document, id_start_element, id_end_element;
 static ID id_start_element_namespace, id_end_element_namespace;
 static ID id_comment, id_characters, id_xmldecl, id_error, id_warning;
 static ID id_cdata_block, id_cAttribute;
-static ID id_processing_instruction;
+static ID id_processing_instruction, id_get_entity;
 
 static void start_document(void * ctx)
 {
@@ -251,6 +251,21 @@ static void processing_instruction(void * ctx, const xmlChar * name, const xmlCh
   );
 }
 
+static xmlEntityPtr get_entity(void * ctx, const xmlChar *name)
+{
+  VALUE rb_content;
+  VALUE self = NOKOGIRI_SAX_SELF(ctx);
+  VALUE doc = rb_iv_get(self, "@document");
+
+  rb_funcall( doc,
+              id_get_entity,
+              1,
+              NOKOGIRI_STR_NEW2(name)
+  );
+
+  return NULL;
+}
+
 static void deallocate(xmlSAXHandlerPtr handler)
 {
   NOKOGIRI_DEBUG_START(handler);
@@ -276,6 +291,7 @@ static VALUE allocate(VALUE klass)
   handler->error = error_func;
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
+  handler->getEntity = get_entity;
   handler->initialized = XML_SAX2_MAGIC;
 
   return Data_Wrap_Struct(klass, NULL, deallocate, handler);
@@ -303,6 +319,7 @@ void init_xml_sax_parser()
   id_error          = rb_intern("error");
   id_warning        = rb_intern("warning");
   id_cdata_block    = rb_intern("cdata_block");
+  id_get_entity     = rb_intern("get_entity");
   id_cAttribute     = rb_intern("Attribute");
   id_start_element_namespace = rb_intern("start_element_namespace");
   id_end_element_namespace = rb_intern("end_element_namespace");

And this script:

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def get_entity name
    p [__method__, name]
  end

  def error(s)
    p [__method__, s]
  end
end

parser = Nokogiri::XML::SAX::Parser.new(StuffDoc.new)
parser.parse(xml)

Output is:

----> parsing with SAX parser
[:get_entity, "THING"]
[:get_entity, "THING"]
[:error, "Entity 'THING' not defined\n"]

@flavorjones
Copy link
Member

@tenderlove Thanks for looking into this. I've got performance concerns about option 2 you outlined above. If you recall the Fairy Wing Throwdown from 2011, I suspected that the poor performance of the SAX parser was due to callbacks. Maybe we should actually benchmark what happens if we implement all the defaults so we know for sure if this is reasonable to do?

@nwellnhof
Copy link

You have to implement the getEntity callback in libxml2's SAX interface. This should return a pointer to an xmlEntity struct which can be created with xmlNewEntity(). Make sure to set the doc argument to NULL, so the document's entity table won't be polluted. etype should be XML_INTERNAL_PREDEFINED_ENTITY for literal string or XML_INTERNAL_GENERAL_ENTITY for XML data to be parsed. To simplify memory management, it should be safe to free the old struct if getEntity is invoked again.

@flavorjones
Copy link
Member

OK, so I have a branch where I've implemented a default getEntity callback that invokes libxml2's xmlSAX2GetEntity to return the correct entity from the document:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 989ad9eb..622e8159 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -265,6 +265,12 @@ processing_instruction(void *ctx, const xmlChar *name, const xmlChar *content)
             );
 }
 
+static xmlEntityPtr
+get_entity(void *ctx, const xmlChar *name)
+{
+  return xmlSAX2GetEntity(NOKOGIRI_SAX_CTXT(ctx), name);
+}
+
 static size_t
 memsize(const void *data)
 {
@@ -300,6 +306,7 @@ allocate(VALUE klass)
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
   handler->initialized = XML_SAX2_MAGIC;
+  handler->getEntity = get_entity;
 
   return self;
 }

The good news is that with this change, errors are no longer reported for entities that are properly declared in the DTD.

The less-good news is that the expansion of the entity is passed to the characters callback, meaning that detecting it as an entity as such is not easy. We could add a callback Nokogiri::XML::SAX::Document#get_entity which could be passed the name and the expansion, but there doesn't seem to be a way to prevent libxml2 from invoking the characters callback.

This change would have some implications for the design of eiwa which uses the Eiwa::Tag::Entity class.

@searls what are your thoughts here? Using Nokogiri with above patch, I can make all eiwa tests pass by applying this patch:

diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..eeb91880 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -62,15 +62,7 @@ def characters(s)
       # end
 
       def error(msg)
-        if (matches = msg.match(/Entity '(\S+)' not defined/))
-          # See: http://github.com/sparklemotion/nokogiri/issues/1926
-          code = matches[1]
-          @current.set_entity(code, ENTITIES[code])
-        elsif msg == "Detected an entity reference loop\n"
-          # Do nothing and hope this does not matter.
-        else
-          raise Eiwa::Error.new("Parsing error: #{msg}")
-        end
+        raise Eiwa::Error.new("Parsing error: #{msg}")
       end
 
       # def cdata_block string
diff --git a/lib/eiwa/jmdict/entities.rb b/lib/eiwa/jmdict/entities.rb
index cf218d60..553e952c 100644
--- a/lib/eiwa/jmdict/entities.rb
+++ b/lib/eiwa/jmdict/entities.rb
@@ -13,7 +13,7 @@ module Jmdict
       "adj-ku" => "`ku' adjective (archaic)",
       "adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
       "adj-nari" => "archaic/formal form of na-adjective",
-      "adj-no" => "nouns which may take the genitive case particle `no'",
+      "adj-no" => "nouns which may take the genitive case particle 'no'",
       "adj-pn" => "pre-noun adjectival (rentaishi)",
       "adj-shiku" => "`shiku' adjective (archaic)",
       "adj-t" => "`taru' adjective",
@@ -34,7 +34,7 @@ module Jmdict
       "chem" => "chemistry term",
       "chn" => "children's language",
       "col" => "colloquialism",
-      "comp" => "computer terminology",
+      "comp" => "computing",
       "conj" => "conjunction",
       "cop" => "copula",
       "cop-da" => "copula",
@@ -101,6 +101,7 @@ module Jmdict
       "quote" => "quotation",
       "rare" => "rare",
       "rkb" => "Ryuukyuu-ben",
+      "rK" => "rarely used kanji form",
       "sens" => "sensitive",
       "shogi" => "shogi term",
       "sl" => "slang",
diff --git a/lib/eiwa/tag/entity.rb b/lib/eiwa/tag/entity.rb
index 12f4f8c7..a75263a9 100644
--- a/lib/eiwa/tag/entity.rb
+++ b/lib/eiwa/tag/entity.rb
@@ -4,10 +4,13 @@ class Entity < Any
       attr_reader :code, :text
 
       def initialize(code: nil, text: nil)
-        @code = code
         @text = text
       end
 
+      def add_characters(s)
+        @text = s
+      end
+
       def set_entity(code, text)
         @code = code
         @text = text

@nwellnhof
Copy link

IIRC, custom SAX parsers can only work in entity replacement mode (XML_PARSE_NOENT). Without this option, the callback sequence is a bit nonsensical.

@flavorjones
Copy link
Member

@nwellnhof Just to make sure I understand your meaning -- are you saying that there's no way to avoid the characters callback being invoked for entities?

@nwellnhof
Copy link

are you saying that there's no way to avoid the characters callback being invoked for entities?

Yes, but when substituting entities, this should be what you want.

@searls
Copy link
Author

searls commented Mar 19, 2024

@flavorjones this seems about right? As long as I'm able to retrieve the code (i.e. "uk"), I'm happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants