-
-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failing test case to keep entities in sax parsers #1500
Conversation
Hi @tenderlove, I think I understand what this issue is all about. libxml2 provides the getEntity callback but Nokogiri doesn't expose it. If it exposed I could override the default behavior to simply return the raw data so that it would be passed to the characters callback. Here's what Nokogiri currently supports: https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_sax_parser.c#L266-L278 Here's the missing callback function: https://github.com/GNOME/libxml2/blob/master/include/libxml/parser.h#L725 This article explains how getEntity works (see "The getEntity Callback" section): http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#entities
The default implementation seems to check for either Maybe it would be simpler if Nokogiri could simply add the getEntity callback available to the SAX parser so that we could override the default behavior to simply return the raw string. I've only searched through the sources and documentation, but I don't actually know how to implement this, so maybe what I'm saying is non-sense? |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
@rosenfeld I've made an attempt to get this to work; unfortunately I can't quite get it to work correctly with either libxml2 or xerces. feature branch is Any thoughts? If you've got bandwidth, I'd love for you to look into finishing this work. |
Hi @flavorjones. I do have bandwidth, this is not the problem. Time is the problem. I currently don't have time to work on my off hours since I have two little girls that demand all my attention when I'm not working. I'd love to spend some time during my work investigating this issue and trying to get it working into Nokogiri, but to be honest, there are currently so many items in our backlogs that I suspect it will be a long time before I'm able to work on this. All of that is new to me, which means it will take me several days at best to understand the pieces. I have never used libxml2 before, I have little knowledge about XML in general, I have never worked on C-extension gems before and after checking the documentation for libxml2 it seems there's not much there regarding entities substitutions in SAX parsers either. That means, I'll need some time to read through libxml2 source to understand how it works, try to get some simple C-only program to try that replaceEntities option (which may force me to understand libxml2 internal source), learn how Nokogiri uses libxml2, then finally debug what is needed to get it to work as expected. All of this sounds like very interesting work actually to be honest and I'd love to do that. But I don't think I'll have that much free time during my work time for a few months at least... I spent the past few hours having a glance over libxml2 docs and source, Nokogiri sources, and some other xml sax parsers for Ruby and I think I have now a better idea on how much time will be required for me to focus on this. I just wanted you to know that I really appreciate your effort on this feature and that I do intend to help with that, but I just wanted to warn you that it can take a long time. It doesn't mean I forgot about this issue, this is all I wanted you to understand. Thank you very much for this branch, I hope I can find some time to finish this work some day. |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
I've rebased my branch and re-pushed it to |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
See related #1926 |
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
As part of my work in #1926, I investigated entity handling and documented its behavior in #3265. Here's a screenshot of the docs: As you can see, character references and predefined entities will always result in a callback to Sorry it took so long to explain this, and I hope this is helpful. |
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
**What problem is this PR intended to solve?** #1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem. I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the `_private` struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue). This PR also introduces a new feature, a SAX handler callback `Document#reference` which allows callers to get entity-specific name and replacement text information (rather than relying on the `Document#characters` callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11 The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the `XML::SAX::Document` docstring with a table and explanation. I've also added warnings to remind users that `#replace_entities` is not safe to set when parsing untrusted documents. In the Java implementation, I've fixed the `#replace_entities` option in the SAX parser context and set it to the proper default (`false`), fixing #614. I've also corrected the value of the URI argument to `Document#start_element_namespace` which was a blank string when it should have been `nil`. I've added quite a bit of testing around the SAX parser's handling of entities. I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284. Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention). Closes #1926. Closes #614. **Have you included adequate test coverage?** Yes! **Does this change affect the behavior of either the C or the Java implementations?** Yes, but the implementations are much more consistent with each other now.
Could you please help me getting this test to pass? cc/ @tenderlove