Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up CSS class queries #2137

Merged
merged 10 commits into from Dec 18, 2020
Merged

speed up CSS class queries #2137

merged 10 commits into from Dec 18, 2020

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Dec 18, 2020

What problem is this PR intended to solve?

This PR implements native C and Java XPath functions to look for CSS class names in a string in order to speed up CSS class selectors.

Currently (v1.10.10), Nokogiri turns the CSS query .red into the XPath query:

"//*[contains(concat(' ', normalize-space(@class), ' '), ' red ')]"

which is doing a lot of string manipulation under the hood:

  • normalize-space creates a new string buffer and assembles it one byte at a time, cleaning up whitespace along the way (xpath.c:xmlXPathNormalizeFunction), then strdups that string
  • concat is pretty expensive, allocating new strings and repeatedly calling strlen and strdup

The native implementations gave mixed results. On CRuby, it's about 2x faster; but on JRuby it's slower. Here's a benchmark script:

#! /usr/bin/env ruby

require "nokogiri"
require "benchmark/ips"
require "securerandom"

root = File.expand_path(File.join(File.dirname(__FILE__), ".."))

puts RUBY_DESCRIPTION

Benchmark.ips do |x|
  x.time = 10

  doc = Nokogiri::HTML::Document.parse(File.read(File.join(root, "test/files/tlm.html")))

  [
    [:xpath, "//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"],
    [:xpath, "//*[nokogiri-builtin:css-class(@class, 'xxxx')]"],
  ].each do |method, query|
    
    x.report("#{method}(\"#{query}\")") do
      cache_buster = query.gsub("xxxx", "x" + SecureRandom.alphanumeric(4))
      doc.public_send(method, cache_buster)
    end

  end

  x.compare!
end

CRuby/libxml2:

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        71.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                       135.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        681.312  (± 5.9%) i/s -      6.816k in  10.041631s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                          1.343k (± 5.9%) i/s -     13.500k in  10.090504s

Comparison:
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):     1343.0 i/s
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      681.3 i/s - 1.97x  (± 0.00) slower

versus JRuby/Xerces:

jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 on 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 [linux-x86_64]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        74.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        41.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        814.536  (± 9.6%) i/s -      8.066k in  10.022432s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        443.781  (± 6.8%) i/s -      4.428k in  10.029857s

Comparison:
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      814.5 i/s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):      443.8 i/s - 1.84x  (± 0.00) slower

It's unclear to me why this is; the benchmark tries to bust any caching that might getting done. If anybody can tell me why that is, I'd appreciate the help. But in the meantime the Xerces XPath implementation is compellingly faster than the native Java version, and so we should prefer it to the native builtin function.

To support having variations on the XPath generated from the same CSS query, this PR also extracts the optimization decision into a single method, XPathVisitor#css_class, and introduces new subclasses XPathVisitorAlwaysUseBuiltins (which will prefer the builtin to the xpath function) and XPathVisitorOptimallyUseBuiltins (which will generate the XPath query that will run the fastest on the user's platform).

This PR also makes some non-functional changes:

  • the class selector . and the attribute selector ~= now share the same logic and generate the same XPath (which is an improvement for ~= which was doing some unnecessary string concatenation)
  • considerable cleanup of XPathVisitor tests for readability and maintainability
  • remove unnecessary whitespace from the generated XPath queries

Finally, this PR also fixes a long-standing bug with the ~= operator (related to #854) which was incorrectly limiting its consideration of whitespace to 0x20 when it should have been also treating \r, \r, and \n as whitespace.

Have you included adequate test coverage?

You bet.

Does this change affect the behavior of either the C or the Java implementations?

It's notable that the XPath generated by default will differ because of the performance optimizations being made, however there are no functional changes outside of this performance improvement on CRuby/libxml2.

available as `nokogiri-builtin:css-class`

Part of #2135
and consolidated the XPathVisitor tests from css/test_parser.rb into
css/test_xpath_visitor.rb so we're testing xpath visitor behavior in
one place.
This is consistent with the class selector ".", and should be faster
because it's doing less string manipulation in the XPath query.
This allows us to continue to generate XPath that uses standard XPath
functions, but also allow us to inject the preference to use the
optimized builtin implementation.

This also leaves open a path to extracting Nokogiri's CSS parser into
a separate gem, and allowing Nokogiri to inject builtin via a custom
XPathVisitor class.
Notably, on libxml2 CSS class queries are now ~2x faster using the
`nokogiri-builtin:css-class` function.

Closes #2135
@codeclimate
Copy link

codeclimate bot commented Dec 18, 2020

Code Climate has analyzed commit 5d0b7fe and detected 2 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 2

The test coverage on the diff in this pull request is 92.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.2% (0.0% change).

View more on Code Climate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant