Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Regression in 1.13.0 in DocumentFragment#css? #2419

Closed
CvX opened this issue Jan 10, 2022 · 13 comments
Closed

[bug] Regression in 1.13.0 in DocumentFragment#css? #2419

CvX opened this issue Jan 10, 2022 · 13 comments

Comments

@CvX
Copy link

@CvX CvX commented Jan 10, 2022

The following code used to work before 1.13.0:

fragments = Nokogiri::HTML5::fragment(html)
fragments.css("a/@href", "img/@src", "source/@src", "track/@src", "video/@poster")

Now it raises:

Nokogiri::XML::XPath::SyntaxError: ERROR: Invalid expression: .//*:a/*:@href | self::*:a/*:@href

Stack trace:

lib/nokogiri/xml/searchable.rb:222:in `evaluate'
lib/nokogiri/xml/searchable.rb:222:in `xpath_impl'
lib/nokogiri/xml/searchable.rb:208:in `block (2 levels) in xpath_internal'
lib/nokogiri/xml/searchable.rb:207:in `each'
lib/nokogiri/xml/searchable.rb:207:in `block in xpath_internal'
lib/nokogiri/xml/node_set.rb:23:in `initialize'
lib/nokogiri/xml/searchable.rb:206:in `new'
lib/nokogiri/xml/searchable.rb:206:in `xpath_internal'
lib/nokogiri/xml/node_set.rb:86:in `block in css'
lib/nokogiri/xml/node_set.rb:233:in `block in each'
lib/nokogiri/xml/node_set.rb:232:in `upto'
lib/nokogiri/xml/node_set.rb:232:in `each'
lib/nokogiri/xml/node_set.rb:85:in `inject'
lib/nokogiri/xml/node_set.rb:85:in `css'
lib/nokogiri/xml/document_fragment.rb:107:in `css'

Additional context

You can see the failure in the wild here: https://github.com/discourse/discourse/runs/4768400926?check_suite_focus=true#step:25:23

I don't know if it's related to #2418

@CvX CvX added the state/needs-triage label Jan 10, 2022
@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

@CvX Thanks for reporting this! It's unrelated to #2418.

This is an interesting failure -- and is most likely happening because nokogiri's ruby code is running against something besides the packaged/vendored libxml2 (with the patch 0009-allow-wildcard-namespaces.patch applied).

Is it possible that you're pulling in a libxml2 library from somewhere else (an earlier version of nokogiri or libxml-ruby or the system libxml2)? Can you please reply with:

  • the output of nokogiri -v
  • the output of bundle show and bundle show --paths if you're using bundler

I can't tell much from the output of that actions run, unfortunately.

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

Hey! Good news, I can reproduce this in the discourse/discourse_test:slim docker container. Give me a bit.

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

@CvX Ah, ok, so I apologize for my confusion above -- this is pretty easily reproducible once I got my local system into a good state.

Some background context

For context, Nokogiri implements CSS selector searches by translating them into the equivalent XPath query. So, for example:

#! /usr/bin/env ruby

require 'nokogiri'

css_parser = Nokogiri::CSS::Parser.new
css_parser.parse("a").first.to_xpath("//", Nokogiri::CSS::XPathVisitor.new)
# => "//a"

When you write a CSS selector that looks like img/@href, it is converted in a way that looks sensible:

css_parser.parse("a/@href").first.to_xpath("//", Nokogiri::CSS::XPathVisitor.new)
# => "//a/@href"

What changed in v1.13.0

Nokogiri v1.13.0 improved HTML5 CSS selector query performance (by about 10X!) by introducing limited "wildcard namespace" functionality into libxml2, and converting CSS queries to use it. (See #2376 and #2403 for details if you're interested in the details.)

With that new HTML5 CSS selector translation, the CSS query is converted into a different XPath query:

visitor = Nokogiri::CSS::XPathVisitor.new(
  builtins: Nokogiri::CSS::XPathVisitor::BuiltinsConfig::ALWAYS,
  doctype: Nokogiri::CSS::XPathVisitor::DoctypeConfig::HTML5,
)
css_parser.parse("a").first.to_xpath("//", visitor)
# => "//*:a"
css_parser.parse("a/@href").first.to_xpath("//", visitor)
# => "//*:a/*:@href"

You probably recognize that last XPath query string from your application's error message.

The catch

OK, here's the catch: the CSS selectors your app is using are invalid, but happens to work. They've always been invalid, but Nokogiri just happened to not catch the invalid syntax and did the right thing in the past.

The XPath query you want to generate is //a/@href, meaning "find me all href attributes on an a element". In older versions of Nokogiri, asking for document.css("a/@href") did indeed search using this XPath, as demonstrated above.

But a/@href is not valid CSS, and this is only working for you because:

  1. Nokogiri supports / as an XPath-y extension to standard CSS (i.e., this is not standard CSS)
  2. Nokogiri doesn't flag @href as an invalid type selector (i.e., element name)

An immediate workaround (and, actually, my recommendation regardless) is to update this code:

fragments.css("a/@href", "img/@src", "source/@src", "track/@src", "video/@poster")

to simply use XPath:

fragments.xpath("//a/@href", "//img/@src", "//source/@src", "//track/@src", "//video/@poster")

You'll avoid the CSS translation step that way, and work around this particular issue until I can ship a backwards-compatible fix.

The fix

This use case is interesting! Although I now understand why it worked in previous versions, we had nothing in our test suite like it and to be honest this mash-up of XPath and CSS wasn't part of my mental model. That said, I really like this syntax, and there's nothing stopping us from formally supporting it going forward.

So, my punchlist looks like:

  • add test coverage!
  • update documentation to indicate that using @attr syntax in CSS queries is supported
  • modify the CSS translation layer to not use the namespace wildcard with attributes.

There's a bit of work here, so it may take me a day or two to ship a fix.

@flavorjones flavorjones added this to the v1.13.x patch releases milestone Jan 11, 2022
@flavorjones flavorjones removed the state/needs-triage label Jan 11, 2022
@CvX
Copy link
Author

@CvX CvX commented Jan 11, 2022

Thank you for the detailed write-up! 👏 I will take your recommendation. It was a bit weird in the first place to use this xpath-but-not-really syntax with the css method. 😃

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

@CvX Happy to help. Also, I want to point out a mistake I made above, the XPath version of that fragment search is

fragments.xpath(".//a/@href", ".//img/@src", ".//source/@src", ".//track/@src", ".//video/@poster")

That is, the XPath query prefix for "anywhere in this fragment" should be .// not // for unobvious reasons dating back to original implementation of DocumentFragment.

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

Oh, fixing this is going to involve a healthy and needed refactoring of the CSS parser and the XPath AST visitor. I'll get to delete some code! 👏

@CvX
Copy link
Author

@CvX CvX commented Jan 11, 2022

A small curveball regarding the css -> xpath transition:

doc = Nokogiri::HTML5.fragment('<img src="/test.gif">')
fragment = doc.css('img[src]')

fragment.css("img/@src").length == 1
fragment.xpath(".//img/@src").length == 0
fragment.xpath("//img/@src").length == 0

…but as I was typing this, I think I just worked it out:

fragment.xpath(".//descendant-or-self::img/@src")

The fascinating world of XPath 😂


btw. I also found another almost-xpath css call in our codebase: css("text()") So keep in mind that all sorts of xpath-like is being passed to that method in the wild! 😃

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

@CvX Yeah, searching within fragments is wonky and has edge cases with the search path like this. Sorry for your troubles. I've been tracking some of these issues at https://nokogiri.org/ROADMAP.html#documentfragment but also starting to work on a re-implementation at #2184

Thanks for the note about text(), I'll make sure that's got test coverage as well.

@CvX
Copy link
Author

@CvX CvX commented Jan 11, 2022

Huh, yet another difference!

doc = Nokogiri::HTML5.fragment(<<~HTML)
  <p><img src="/test.gif"></p>
  <p><a href="/test">x</a></p>
HTML

doc.css("a/@href", "img/@src").map(&:value)
# => ["/test.gif", "/test"]
doc.xpath(".//a/@href", ".//img/@src").map(&:value)
# => ["/test", "/test.gif"]

But that can also be worked around!

doc.xpath(".//a/@href|.//img/@src").map(&:value)
# => ["/test.gif", "/test"]

So my original query will look like this:

# Instead of:
# fragments.css("a/@href", "img/@src", "source/@src", "track/@src", "video/@poster")

fragments.xpath(".//descendant-or-self::a/@href|.//descendant-or-self::img/@src|.//descendant-or-self::source/@src|.//descendant-or-self::track/@src|.//descendant-or-self::video/@poster")

I'm starting to really appreciate the mixed syntax! 😅

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 11, 2022

Yeah, fragment searching is pretty bad, sorry. If you prefer, you could hold off on upgrading for a day and I should be able to get a patch release out.

flavorjones added a commit that referenced this issue Jan 12, 2022
This commit removes "@" from the IDENT token so that we can create a
new grammar rule in the parser for XPath attributes.

Fixes #2419
flavorjones added a commit that referenced this issue Jan 12, 2022
And officially document the XPath attribute extensions to CSS selector
syntax that we support.

See #2419 for context
@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 12, 2022

PR with a proposed fix is at #2422

flavorjones added a commit that referenced this issue Jan 12, 2022
And officially document the XPath attribute extensions to CSS selector
syntax that we support.

See #2419 for context
flavorjones added a commit that referenced this issue Jan 13, 2022
This commit removes "@" from the IDENT token so that we can create a
new grammar rule in the parser for XPath attributes.

Fixes #2419
flavorjones added a commit that referenced this issue Jan 13, 2022
And officially document the XPath attribute extensions to CSS selector
syntax that we support.

See #2419 for context
flavorjones added a commit that referenced this issue Jan 13, 2022
@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 13, 2022

waiting for the backport to go green at #2423 and then I'll cut v1.13.1, probably tomorrow morning.

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Jan 13, 2022

v1.13.1 is out now: https://github.com/sparklemotion/nokogiri/releases/tag/v1.13.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants