Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

mojavelinux · 2018-10-14T06:37:16Z

Starting with Nokogiri 1.8.3, JRuby introduces a tbody tag in the parsed document where there isn't one in the source. This causes an inconsistently in the XPath (and CSS) query results between MRI and JRuby.

Consider the following HTML fragment:

<table>
<tr>
<td>cell</td>
</tr>
</table>

Now consider the XPath query:

/table/tbody/tr

When using Nokogiri 1.8.5 on MRI (or Nokogiri <= 1.8.2 on JRuby), the size of the result set of this query is:

However, when using Nokogiri 1.8.5 on JRuby, the size of the result set changes to:

The following program can be used to demonstrate this problem:

require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse <<-EOS
<table>
<tr>
<td>cell</td>
</tr>
</table>
EOS
result = doc.xpath '/table/tbody/tr'
puts result.size

Note that the equivalent CSS query also exhibits the same problem:

result = doc.css 'table > tbody > tr'

The problem can be seen by converting the parsed document to a string, which outputs:

<table>
<tbody><tr>
<td>cell</td>
</tr>
</tbody></table>

Nokogiri should not be introducing a tbody tag in the parsed HTML tree where there isn't one in the source (or MRI should do the same).

Nokogiri version: 1.8.5
Ruby version: jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.181-b15 on 1.8.0_181-b15 +jit
OS: linux-x86_64

The text was updated successfully, but these errors were encountered:

flavorjones · 2018-10-15T14:42:20Z

Hello! And thank you for raising this issue.

What you're seeing is due to the change in 2e89378, whose commit message I'll reproduce here to help drive the conversation:

remove monkey patch introduced in #1251

this is an ugly change whose only purpose is to mask the difference between
libxml and nekohtml. we agreed to stop doing that a while ago and just accept
that different libraries will behave different. furthermore, it caused a stack
overflow while parding documents with a TD element that doesn't have any
parents in #1501

There are many differences between the parsers used in the JRuby and the MRI/YARV implementations. Xerces/NekoHTML is used by JRuby and libxml2 is used by MRI/YARV.

Nokogiri is intended to be a Ruby wrapper around these parsers, and so making them behave identically is an anti-goal. In particular, "fixing up markup" has shown itself to be a rich vein of frustration for users who expect the two implementations to behave the same (or who expect either of them to behave the same as Browser X). And I'm sorry about that.

I hope I've explained what's going on (underlying parsers behaving differently, and a broken monkeypatch was removed which attempted to make them behave the same in this case) in such a way that it seems defensible that Nokogiri allows this difference in behavior to exist.

mojavelinux · 2018-10-16T05:17:27Z

@flavorjones, thank you for the explanation. What you wrote make sense, and I can understand the challenge at hand, though I can't say I'm very comfortable with the differences. For me, it's very important that libraries behave the same across various Ruby implementations. That's one of the key benefits of using a library over maintaining code inside the application. I'll just have to keep in mind that nokogiri doesn't subscribe to that model when choosing to use it.

I think what caught me off guard the most is that this change came in a patch release (1.8.2 to 1.8.3). I was definitely not expecting the behavior of nokogiri to fundamentally change by making that upgrade. Had it been a major or even a minor version, I might have consulted the changelog carefully. In the future, I'd ask that you consider reserving such changes for major (ideally) or minor releases so that users like myself aren't surprised when upgrading to address a security bulletin.

flavorjones closed this as completed Oct 15, 2018

flavorjones mentioned this issue Nov 6, 2021

Note in the documentation that broken markup may be "fixed" differently, or docs may be serialized differently, by libxml than by your browser or by xerces sparklemotion/nokogiri.org#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

mojavelinux commented Oct 14, 2018

flavorjones commented Oct 15, 2018

mojavelinux commented Oct 16, 2018

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

Comments

mojavelinux commented Oct 14, 2018

flavorjones commented Oct 15, 2018

mojavelinux commented Oct 16, 2018