New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX::Parser errors when it encounters non-predefined entities. #1926
Comments
Hey @searls, thanks for the clear bug report. I wanted to acknowledge that I saw this and let you know I will likely not be able to dig into it immediately, but I agree with your take that the described behavior is problematic. |
I did a quick look in to this. I knew I had looked in to this at one point, and now the code is jogging my memory. The SAX parser will actually build a DOM object in order to support entity substitution. I'm not sure if it actually builds the full DOM, but it definitely keeps the entity references on the document structs. In order to support this we need to make sure the I thought the easiest approach would be to initialize the SAX parser with all the default callbacks, but that doesn't seem to work because the context pointer we're using is a I think there are two approaches we could take to fix this:
I'm not sure if the first approach is possible, and the second approach sounds like a bigger patch. |
I forgot to mention, there is a callback for when an entity is encountered. But the huge bummer is that the callback is expected to return a With this patch: diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 1a5f6c5f..7cc25524 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -7,7 +7,7 @@ static ID id_start_document, id_end_document, id_start_element, id_end_element;
static ID id_start_element_namespace, id_end_element_namespace;
static ID id_comment, id_characters, id_xmldecl, id_error, id_warning;
static ID id_cdata_block, id_cAttribute;
-static ID id_processing_instruction;
+static ID id_processing_instruction, id_get_entity;
static void start_document(void * ctx)
{
@@ -251,6 +251,21 @@ static void processing_instruction(void * ctx, const xmlChar * name, const xmlCh
);
}
+static xmlEntityPtr get_entity(void * ctx, const xmlChar *name)
+{
+ VALUE rb_content;
+ VALUE self = NOKOGIRI_SAX_SELF(ctx);
+ VALUE doc = rb_iv_get(self, "@document");
+
+ rb_funcall( doc,
+ id_get_entity,
+ 1,
+ NOKOGIRI_STR_NEW2(name)
+ );
+
+ return NULL;
+}
+
static void deallocate(xmlSAXHandlerPtr handler)
{
NOKOGIRI_DEBUG_START(handler);
@@ -276,6 +291,7 @@ static VALUE allocate(VALUE klass)
handler->error = error_func;
handler->cdataBlock = cdata_block;
handler->processingInstruction = processing_instruction;
+ handler->getEntity = get_entity;
handler->initialized = XML_SAX2_MAGIC;
return Data_Wrap_Struct(klass, NULL, deallocate, handler);
@@ -303,6 +319,7 @@ void init_xml_sax_parser()
id_error = rb_intern("error");
id_warning = rb_intern("warning");
id_cdata_block = rb_intern("cdata_block");
+ id_get_entity = rb_intern("get_entity");
id_cAttribute = rb_intern("Attribute");
id_start_element_namespace = rb_intern("start_element_namespace");
id_end_element_namespace = rb_intern("end_element_namespace"); And this script: xml = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Stuff [
<!ELEMENT stuff (#PCDATA)>
<!ENTITY THING "a thing">
]>
<stuff>&THING;</stuff>
XML
require "nokogiri"
require "pp"
puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
def get_entity name
p [__method__, name]
end
def error(s)
p [__method__, s]
end
end
parser = Nokogiri::XML::SAX::Parser.new(StuffDoc.new)
parser.parse(xml) Output is:
|
@tenderlove Thanks for looking into this. I've got performance concerns about option 2 you outlined above. If you recall the Fairy Wing Throwdown from 2011, I suspected that the poor performance of the SAX parser was due to callbacks. Maybe we should actually benchmark what happens if we implement all the defaults so we know for sure if this is reasonable to do? |
You have to implement the |
OK, so I have a branch where I've implemented a default diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 989ad9eb..622e8159 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -265,6 +265,12 @@ processing_instruction(void *ctx, const xmlChar *name, const xmlChar *content)
);
}
+static xmlEntityPtr
+get_entity(void *ctx, const xmlChar *name)
+{
+ return xmlSAX2GetEntity(NOKOGIRI_SAX_CTXT(ctx), name);
+}
+
static size_t
memsize(const void *data)
{
@@ -300,6 +306,7 @@ allocate(VALUE klass)
handler->cdataBlock = cdata_block;
handler->processingInstruction = processing_instruction;
handler->initialized = XML_SAX2_MAGIC;
+ handler->getEntity = get_entity;
return self;
} The good news is that with this change, errors are no longer reported for entities that are properly declared in the DTD. The less-good news is that the expansion of the entity is passed to the This change would have some implications for the design of @searls what are your thoughts here? Using Nokogiri with above patch, I can make all eiwa tests pass by applying this patch: diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..eeb91880 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -62,15 +62,7 @@ def characters(s)
# end
def error(msg)
- if (matches = msg.match(/Entity '(\S+)' not defined/))
- # See: http://github.com/sparklemotion/nokogiri/issues/1926
- code = matches[1]
- @current.set_entity(code, ENTITIES[code])
- elsif msg == "Detected an entity reference loop\n"
- # Do nothing and hope this does not matter.
- else
- raise Eiwa::Error.new("Parsing error: #{msg}")
- end
+ raise Eiwa::Error.new("Parsing error: #{msg}")
end
# def cdata_block string
diff --git a/lib/eiwa/jmdict/entities.rb b/lib/eiwa/jmdict/entities.rb
index cf218d60..553e952c 100644
--- a/lib/eiwa/jmdict/entities.rb
+++ b/lib/eiwa/jmdict/entities.rb
@@ -13,7 +13,7 @@ module Jmdict
"adj-ku" => "`ku' adjective (archaic)",
"adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
"adj-nari" => "archaic/formal form of na-adjective",
- "adj-no" => "nouns which may take the genitive case particle `no'",
+ "adj-no" => "nouns which may take the genitive case particle 'no'",
"adj-pn" => "pre-noun adjectival (rentaishi)",
"adj-shiku" => "`shiku' adjective (archaic)",
"adj-t" => "`taru' adjective",
@@ -34,7 +34,7 @@ module Jmdict
"chem" => "chemistry term",
"chn" => "children's language",
"col" => "colloquialism",
- "comp" => "computer terminology",
+ "comp" => "computing",
"conj" => "conjunction",
"cop" => "copula",
"cop-da" => "copula",
@@ -101,6 +101,7 @@ module Jmdict
"quote" => "quotation",
"rare" => "rare",
"rkb" => "Ryuukyuu-ben",
+ "rK" => "rarely used kanji form",
"sens" => "sensitive",
"shogi" => "shogi term",
"sl" => "slang",
diff --git a/lib/eiwa/tag/entity.rb b/lib/eiwa/tag/entity.rb
index 12f4f8c7..a75263a9 100644
--- a/lib/eiwa/tag/entity.rb
+++ b/lib/eiwa/tag/entity.rb
@@ -4,10 +4,13 @@ class Entity < Any
attr_reader :code, :text
def initialize(code: nil, text: nil)
- @code = code
@text = text
end
+ def add_characters(s)
+ @text = s
+ end
+
def set_entity(code, text)
@code = code
@text = text |
IIRC, custom SAX parsers can only work in entity replacement mode (XML_PARSE_NOENT). Without this option, the callback sequence is a bit nonsensical. |
@nwellnhof Just to make sure I understand your meaning -- are you saying that there's no way to avoid the |
Yes, but when substituting entities, this should be what you want. |
@flavorjones this seems about right? As long as I'm able to retrieve the code (i.e. "uk"), I'm happy. |
Describe the bug
When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.
Note that this warning from libxml2's docs seem to hint that getting this right is hard:
To Reproduce
When run, this will output:
Expected behavior
I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g.
"&THING;"
in this case.Environment
Additional context
This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)
The text was updated successfully, but these errors were encountered: