Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

Closed
mojavelinux opened this issue Oct 14, 2018 · 2 comments

Comments

@mojavelinux
Copy link

Starting with Nokogiri 1.8.3, JRuby introduces a tbody tag in the parsed document where there isn't one in the source. This causes an inconsistently in the XPath (and CSS) query results between MRI and JRuby.

Consider the following HTML fragment:

<table>
<tr>
<td>cell</td>
</tr>
</table>

Now consider the XPath query:

/table/tbody/tr

When using Nokogiri 1.8.5 on MRI (or Nokogiri <= 1.8.2 on JRuby), the size of the result set of this query is:

0

However, when using Nokogiri 1.8.5 on JRuby, the size of the result set changes to:

1

The following program can be used to demonstrate this problem:

require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse <<-EOS
<table>
<tr>
<td>cell</td>
</tr>
</table>
EOS
result = doc.xpath '/table/tbody/tr'
puts result.size

Note that the equivalent CSS query also exhibits the same problem:

result = doc.css 'table > tbody > tr'

The problem can be seen by converting the parsed document to a string, which outputs:

<table>
<tbody><tr>
<td>cell</td>
</tr>
</tbody></table>

Nokogiri should not be introducing a tbody tag in the parsed HTML tree where there isn't one in the source (or MRI should do the same).


Nokogiri version: 1.8.5
Ruby version: jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.181-b15 on 1.8.0_181-b15 +jit
OS: linux-x86_64

@flavorjones
Copy link
Member

Hello! And thank you for raising this issue.

What you're seeing is due to the change in 2e89378, whose commit message I'll reproduce here to help drive the conversation:

remove monkey patch introduced in #1251

this is an ugly change whose only purpose is to mask the difference between
libxml and nekohtml. we agreed to stop doing that a while ago and just accept
that different libraries will behave different. furthermore, it caused a stack
overflow while parding documents with a TD element that doesn't have any
parents in #1501

There are many differences between the parsers used in the JRuby and the MRI/YARV implementations. Xerces/NekoHTML is used by JRuby and libxml2 is used by MRI/YARV.

Nokogiri is intended to be a Ruby wrapper around these parsers, and so making them behave identically is an anti-goal. In particular, "fixing up markup" has shown itself to be a rich vein of frustration for users who expect the two implementations to behave the same (or who expect either of them to behave the same as Browser X). And I'm sorry about that.

I hope I've explained what's going on (underlying parsers behaving differently, and a broken monkeypatch was removed which attempted to make them behave the same in this case) in such a way that it seems defensible that Nokogiri allows this difference in behavior to exist.

@mojavelinux
Copy link
Author

@flavorjones, thank you for the explanation. What you wrote make sense, and I can understand the challenge at hand, though I can't say I'm very comfortable with the differences. For me, it's very important that libraries behave the same across various Ruby implementations. That's one of the key benefits of using a library over maintaining code inside the application. I'll just have to keep in mind that nokogiri doesn't subscribe to that model when choosing to use it.

I think what caught me off guard the most is that this change came in a patch release (1.8.2 to 1.8.3). I was definitely not expecting the behavior of nokogiri to fundamentally change by making that upgrade. Had it been a major or even a minor version, I might have consulted the changelog carefully. In the future, I'd ask that you consider reserving such changes for major (ideally) or minor releases so that users like myself aren't surprised when upgrading to address a security bulletin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants