Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failing test case to keep entities in sax parsers #1500

Closed
wants to merge 1 commit into from

Conversation

rosenfeld
Copy link
Contributor

Could you please help me getting this test to pass? cc/ @tenderlove

@rosenfeld
Copy link
Contributor Author

Hi @tenderlove, I think I understand what this issue is all about. libxml2 provides the getEntity callback but Nokogiri doesn't expose it. If it exposed I could override the default behavior to simply return the raw data so that it would be passed to the characters callback. Here's what Nokogiri currently supports:

https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_sax_parser.c#L266-L278

Here's the missing callback function:

https://github.com/GNOME/libxml2/blob/master/include/libxml/parser.h#L725

This article explains how getEntity works (see "The getEntity Callback" section):

http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#entities

You may have been wondering how entities (eg <, etc) are handled by the SAX interface. This is done by the getEntity callback... After calling getEntity, the expansion of the entity is passed to the characters callback. This way, you do not need to worry about decoding entities anywhere else in your callback routines.

The default implementation seems to check for either ctxt->validate or ctxt->replaceEntities, so maybe setting replaceEntities to true is not enough if ctxt->validate is true.

Maybe it would be simpler if Nokogiri could simply add the getEntity callback available to the SAX parser so that we could override the default behavior to simply return the raw string. I've only searched through the sources and documentation, but I don't actually know how to implement this, so maybe what I'm saying is non-sense?

flavorjones added a commit that referenced this pull request Jan 13, 2017
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

flavorjones commented Jan 13, 2017

@rosenfeld I've made an attempt to get this to work; unfortunately I can't quite get it to work correctly with either libxml2 or xerces.

feature branch is flavorjones-sax-parser-replace-entities

Any thoughts? If you've got bandwidth, I'd love for you to look into finishing this work.

@rosenfeld
Copy link
Contributor Author

Hi @flavorjones. I do have bandwidth, this is not the problem. Time is the problem. I currently don't have time to work on my off hours since I have two little girls that demand all my attention when I'm not working. I'd love to spend some time during my work investigating this issue and trying to get it working into Nokogiri, but to be honest, there are currently so many items in our backlogs that I suspect it will be a long time before I'm able to work on this.

All of that is new to me, which means it will take me several days at best to understand the pieces. I have never used libxml2 before, I have little knowledge about XML in general, I have never worked on C-extension gems before and after checking the documentation for libxml2 it seems there's not much there regarding entities substitutions in SAX parsers either. That means, I'll need some time to read through libxml2 source to understand how it works, try to get some simple C-only program to try that replaceEntities option (which may force me to understand libxml2 internal source), learn how Nokogiri uses libxml2, then finally debug what is needed to get it to work as expected. All of this sounds like very interesting work actually to be honest and I'd love to do that. But I don't think I'll have that much free time during my work time for a few months at least... I spent the past few hours having a glance over libxml2 docs and source, Nokogiri sources, and some other xml sax parsers for Ruby and I think I have now a better idea on how much time will be required for me to focus on this.

I just wanted you to know that I really appreciate your effort on this feature and that I do intend to help with that, but I just wanted to warn you that it can take a long time. It doesn't mean I forgot about this issue, this is all I wanted you to understand.

Thank you very much for this branch, I hope I can find some time to finish this work some day.

Base automatically changed from master to main January 17, 2021 21:52
flavorjones added a commit that referenced this pull request Aug 20, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

I've rebased my branch and re-pushed it to 1500-sax-parser-replace-entities for potential further investigation

flavorjones added a commit that referenced this pull request Sep 24, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
flavorjones added a commit that referenced this pull request Sep 24, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

See related #1926

@flavorjones flavorjones added this to the v1.17.0 milestone Jun 30, 2024
flavorjones added a commit that referenced this pull request Jul 1, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
@flavorjones
Copy link
Member

As part of my work in #1926, I investigated entity handling and documented its behavior in #3265. Here's a screenshot of the docs:

image

As you can see, character references and predefined entities will always result in a callback to #characters with the replacement text. This is just how the underlying parsers (libxml2 and xerces) behave.

Sorry it took so long to explain this, and I hope this is helpful.

@flavorjones flavorjones closed this Jul 1, 2024
@flavorjones flavorjones removed this from the v1.17.0 milestone Jul 1, 2024
flavorjones added a commit that referenced this pull request Jul 2, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
flavorjones added a commit that referenced this pull request Jul 2, 2024
**What problem is this PR intended to solve?**

#1926 described an issue wherein the SAX parser was not correctly
resolving and replacing internal entities, and was instead reporting an
error for each entity reference. This PR includes a fix for that
problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation,
replacing it with the `_private` struct member that libxml2 makes
available. Then I set up the parser context structs so that we can use
libxml2's standard SAX callbacks where they're useful (which is how I
addressed the above issue).

This PR also introduces a new feature, a SAX handler callback
`Document#reference` which allows callers to get entity-specific name
and replacement text information (rather than relying on the
`Document#characters` callback). This can be used to solve the original
issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex
enough that I wrote up a short doc in the `XML::SAX::Document` docstring
with a table and explanation. I've also added warnings to remind users
that `#replace_entities` is not safe to set when parsing untrusted
documents.

In the Java implementation, I've fixed the `#replace_entities` option in
the SAX parser context and set it to the proper default (`false`),
fixing #614. I've also corrected the value of the URI argument to
`Document#start_element_namespace` which was a blank string when it
should have been `nil`.

I've added quite a bit of testing around the SAX parser's handling of
entities.

I added and clarified quite a bit of documentation around SAX parsing
generally. Exception messages have been clarified in a couple of places,
and made consistent between the C and Java implementations. This should
address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing,
naming functions more explicitly (and moving towards some kind of
standard naming convention).

Closes #1926.
Closes #614.


**Have you included adequate test coverage?**

Yes!


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but the implementations are much more consistent with each other
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants