Skip to content

Commit

Permalink
fix: CSS queries with xpath-incompatible pseudo-classes raise an exce…
Browse files Browse the repository at this point in the history
…ption at query parse time (#3197)

**What problem is this PR intended to solve?**

fix: raise CSS::SyntaxError if a pseudo-class is not an XPath Name

Some pseudo-classes cannot be converted into an XPath function name, and
libxml2 will raise an Nokogiri::XML::XPath::SyntaxError at query-time:

```
nokogiri/xml/searchable.rb:238:in `evaluate': ERROR: Invalid expression: //*:div[nokogiri:-moz-drag-over(.)] (Nokogiri::XML::XPath::SyntaxError)
```

This change moves the error from query-time to parse-time, in the hopes
that this is more rescuable (and the error is more descriptive):

```
nokogiri/css/parser_extras.rb:86:in `on_error': unexpected '-' after ':' (Nokogiri::CSS::SyntaxError)
```

Closes #3193


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

N/A
  • Loading branch information
flavorjones committed May 24, 2024
2 parents fab511c + d3a60cb commit 1c50ff8
Show file tree
Hide file tree
Showing 7 changed files with 278 additions and 248 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

### Fixed

* [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
* `Node#clone`, `NodeSet#clone`, and `*::Document#clone` all properly copy the metaclass of the original as expected. Previously, `#clone` had been aliased to `#dup` for these classes (since v1.3.0 in 2009). [#316, #3117] @flavorjones
* CSS queries for pseudo-selectors that cannot be transpiled into XPath queries now raise a more descriptive `Nokogiri::CSS::SyntaxError` when they are parsed. Previously, an invalid XPath query was created and a hard-to-understand XPath error was being raised by the query engine. [#3197] @flavorjones
* [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
* [CRuby] Update node GC lifecycle to avoid a potential memory leak with fragments in libxml 2.13.0 caused by changes in `xmlAddChild`. [#3156] @flavorjones


Expand Down
469 changes: 242 additions & 227 deletions lib/nokogiri/css/parser.rb

Large diffs are not rendered by default.

14 changes: 9 additions & 5 deletions lib/nokogiri/css/parser.y
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
class Nokogiri::CSS::Parser

token FUNCTION INCLUDES DASHMATCH LBRACE HASH PLUS GREATER S STRING IDENT
token FUNCTION INCLUDES DASHMATCH LBRACE HASH PLUS MINUS GREATER S STRING IDENT
token COMMA NUMBER PREFIXMATCH SUFFIXMATCH SUBSTRINGMATCH TILDE NOT_EQUAL
token SLASH DOUBLESLASH NOT EQUAL RPAREN LSQUARE RSQUARE HAS

Expand Down Expand Up @@ -143,13 +143,17 @@ rule
raise Racc::ParseError, "parse error on IDENT '#{val[1]}'"
end
}
| IDENT PLUS NUMBER { # n+3, -n+3
| IDENT PLUS NUMBER { # n+3
if val[0] == 'n'
val.unshift("1")
result = Node.new(:NTH, val)
elsif val[0] == '-n'
val[0] = 'n'
val.unshift("-1")
else
raise Racc::ParseError, "parse error on IDENT '#{val[0]}'"
end
}
| MINUS IDENT PLUS NUMBER { # -n+3
if val[1] == 'n'
val[0] = '-1'
result = Node.new(:NTH, val)
else
raise Racc::ParseError, "parse error on IDENT '#{val[1]}'"
Expand Down
11 changes: 7 additions & 4 deletions lib/nokogiri/css/tokenizer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,13 @@ def _next_token
when (text = @ss.scan(/has\([\s]*/))
action { [:HAS, text] }

when (text = @ss.scan(/-?([_A-Za-z]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*\([\s]*/))
when (text = @ss.scan(/([_A-Za-z]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*\([\s]*/))
action { [:FUNCTION, text] }

when (text = @ss.scan(/-?([_A-Za-z]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*/))
when (text = @ss.scan(/([_A-Za-z]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*/))
action { [:IDENT, text] }

when (text = @ss.scan(/\#([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])+/))
when (text = @ss.scan(/\#([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))+/))
action { [:HASH, text] }

when (text = @ss.scan(/[\s]*~=[\s]*/))
Expand Down Expand Up @@ -120,6 +120,9 @@ def _next_token
when (text = @ss.scan(/-?([0-9]+|[0-9]*\.[0-9]+)/))
action { [:NUMBER, text] }

when (text = @ss.scan(/[\s]*\-[\s]*/))
action { [:MINUS, text] }

when (text = @ss.scan(/[\s]*\/\/[\s]*/))
action { [:DOUBLESLASH, text] }

Expand All @@ -132,7 +135,7 @@ def _next_token
when (text = @ss.scan(/[\s]+/))
action { [:S, text] }

when (text = @ss.scan(/"([^\n\r\f"]|\n|\r\n|\r|\f|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*(?<!\\)(?:\\{2})*"|'([^\n\r\f']|\n|\r\n|\r|\f|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*(?<!\\)(?:\\{2})*'/))
when (text = @ss.scan(/("([^\n\r\f"]|(\n|\r\n|\r|\f)|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*(?<!\\)(?:\\{2})*"|'([^\n\r\f']|(\n|\r\n|\r|\f)|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*(?<!\\)(?:\\{2})*')/))
action { [:STRING, text] }

when (text = @ss.scan(/./))
Expand Down
21 changes: 11 additions & 10 deletions lib/nokogiri/css/tokenizer.rex
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,29 @@ module CSS
class Tokenizer

macro
nl \n|\r\n|\r|\f
nl (\n|\r\n|\r|\f)
w [\s]*
nonascii [^\0-\177]
num -?([0-9]+|[0-9]*\.[0-9]+)
unicode \\[0-9A-Fa-f]{1,6}(\r\n|[\s])?

escape {unicode}|\\[^\n\r\f0-9A-Fa-f]
nmchar [_A-Za-z0-9-]|{nonascii}|{escape}
nmstart [_A-Za-z]|{nonascii}|{escape}
ident -?({nmstart})({nmchar})*
name ({nmchar})+
escape ({unicode}|\\[^\n\r\f0-9A-Fa-f])
nmchar ([_A-Za-z0-9-]|{nonascii}|{escape})
nmstart ([_A-Za-z]|{nonascii}|{escape})
name {nmstart}{nmchar}*
charref {nmchar}+
string1 "([^\n\r\f"]|{nl}|{nonascii}|{escape})*(?<!\\)(?:\\{2})*"
string2 '([^\n\r\f']|{nl}|{nonascii}|{escape})*(?<!\\)(?:\\{2})*'
string {string1}|{string2}
string ({string1}|{string2})
rule
# [:state] pattern [actions]
has\({w} { [:HAS, text] }
{ident}\({w} { [:FUNCTION, text] }
{ident} { [:IDENT, text] }
\#{name} { [:HASH, text] }
{name}\({w} { [:FUNCTION, text] }
{name} { [:IDENT, text] }
\#{charref} { [:HASH, text] }
{w}~={w} { [:INCLUDES, text] }
{w}\|={w} { [:DASHMATCH, text] }
{w}\^={w} { [:PREFIXMATCH, text] }
Expand All @@ -43,6 +43,7 @@ rule
{w}~{w} { [:TILDE, text] }
\:not\({w} { [:NOT, text] }
{num} { [:NUMBER, text] }
{w}\-{w} { [:MINUS, text] }
{w}\/\/{w} { [:DOUBLESLASH, text] }
{w}\/{w} { [:SLASH, text] }
Expand Down
3 changes: 2 additions & 1 deletion test/css/test_tokenizer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ def test_scan_nth
[:IDENT, "x"],
[":", ":"],
[:FUNCTION, "nth-child("],
[:IDENT, "-n"],
[:MINUS, "-"],
[:IDENT, "n"],
[:PLUS, "+"],
[:NUMBER, "3"],
[:RPAREN, ")"],
Expand Down
5 changes: 5 additions & 0 deletions test/css/test_xpath_visitor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,11 @@ def assert_xpath(expecteds, asts)
assert_xpath("//*[not(@id='foo')]", parser.parse(":not(#foo)"))
assert_xpath("//*[count(preceding-sibling::*)=0]", parser.parse(":first-child"))
end

it "raises an exception for pseudo-classes that are not XPath Names" do
# see https://github.com/sparklemotion/nokogiri/issues/3193
assert_raises(Nokogiri::CSS::SyntaxError) { parser.parse("div:-moz-drag-over") }
end
end

describe "combinators" do
Expand Down

0 comments on commit 1c50ff8

Please sign in to comment.