Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CSS queries with xpath-incompatible pseudo-classes raise an exception at query parse time #3197

Merged
merged 3 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

### Fixed

* [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
* `Node#clone`, `NodeSet#clone`, and `*::Document#clone` all properly copy the metaclass of the original as expected. Previously, `#clone` had been aliased to `#dup` for these classes (since v1.3.0 in 2009). [#316, #3117] @flavorjones
* CSS queries for pseudo-selectors that cannot be transpiled into XPath queries now raise a more descriptive `Nokogiri::CSS::SyntaxError` when they are parsed. Previously, an invalid XPath query was created and a hard-to-understand XPath error was being raised by the query engine. [#3197] @flavorjones
* [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
* [CRuby] Update node GC lifecycle to avoid a potential memory leak with fragments in libxml 2.13.0 caused by changes in `xmlAddChild`. [#3156] @flavorjones


Expand Down
469 changes: 242 additions & 227 deletions lib/nokogiri/css/parser.rb

Large diffs are not rendered by default.

14 changes: 9 additions & 5 deletions lib/nokogiri/css/parser.y
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
class Nokogiri::CSS::Parser

token FUNCTION INCLUDES DASHMATCH LBRACE HASH PLUS GREATER S STRING IDENT
token FUNCTION INCLUDES DASHMATCH LBRACE HASH PLUS MINUS GREATER S STRING IDENT
token COMMA NUMBER PREFIXMATCH SUFFIXMATCH SUBSTRINGMATCH TILDE NOT_EQUAL
token SLASH DOUBLESLASH NOT EQUAL RPAREN LSQUARE RSQUARE HAS

Expand Down Expand Up @@ -143,13 +143,17 @@ rule
raise Racc::ParseError, "parse error on IDENT '#{val[1]}'"
end
}
| IDENT PLUS NUMBER { # n+3, -n+3
| IDENT PLUS NUMBER { # n+3
if val[0] == 'n'
val.unshift("1")
result = Node.new(:NTH, val)
elsif val[0] == '-n'
val[0] = 'n'
val.unshift("-1")
else
raise Racc::ParseError, "parse error on IDENT '#{val[0]}'"
end
}
| MINUS IDENT PLUS NUMBER { # -n+3
if val[1] == 'n'
val[0] = '-1'
result = Node.new(:NTH, val)
else
raise Racc::ParseError, "parse error on IDENT '#{val[1]}'"
Expand Down
11 changes: 7 additions & 4 deletions lib/nokogiri/css/tokenizer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,13 @@ def _next_token
when (text = @ss.scan(/has\([\s]*/))
action { [:HAS, text] }

when (text = @ss.scan(/-?([_A-Za-z]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*\([\s]*/))
when (text = @ss.scan(/([_A-Za-z]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*\([\s]*/))
Dismissed Show dismissed Hide dismissed
action { [:FUNCTION, text] }

when (text = @ss.scan(/-?([_A-Za-z]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*/))
when (text = @ss.scan(/([_A-Za-z]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*/))
action { [:IDENT, text] }

when (text = @ss.scan(/\#([_A-Za-z0-9-]|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])+/))
when (text = @ss.scan(/\#([_A-Za-z0-9-]|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))+/))
action { [:HASH, text] }

when (text = @ss.scan(/[\s]*~=[\s]*/))
Expand Down Expand Up @@ -120,6 +120,9 @@ def _next_token
when (text = @ss.scan(/-?([0-9]+|[0-9]*\.[0-9]+)/))
action { [:NUMBER, text] }

when (text = @ss.scan(/[\s]*\-[\s]*/))
action { [:MINUS, text] }

when (text = @ss.scan(/[\s]*\/\/[\s]*/))
action { [:DOUBLESLASH, text] }

Expand All @@ -132,7 +135,7 @@ def _next_token
when (text = @ss.scan(/[\s]+/))
action { [:S, text] }

when (text = @ss.scan(/"([^\n\r\f"]|\n|\r\n|\r|\f|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*(?<!\\)(?:\\{2})*"|'([^\n\r\f']|\n|\r\n|\r|\f|[^\0-\177]|\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f])*(?<!\\)(?:\\{2})*'/))
when (text = @ss.scan(/("([^\n\r\f"]|(\n|\r\n|\r|\f)|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*(?<!\\)(?:\\{2})*"|'([^\n\r\f']|(\n|\r\n|\r|\f)|[^\0-\177]|(\\[0-9A-Fa-f]{1,6}(\r\n|[\s])?|\\[^\n\r\f0-9A-Fa-f]))*(?<!\\)(?:\\{2})*')/))
Dismissed Show dismissed Hide dismissed
Dismissed Show dismissed Hide dismissed
action { [:STRING, text] }

when (text = @ss.scan(/./))
Expand Down
21 changes: 11 additions & 10 deletions lib/nokogiri/css/tokenizer.rex
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,29 @@ module CSS
class Tokenizer

macro
nl \n|\r\n|\r|\f
nl (\n|\r\n|\r|\f)
w [\s]*
nonascii [^\0-\177]
num -?([0-9]+|[0-9]*\.[0-9]+)
unicode \\[0-9A-Fa-f]{1,6}(\r\n|[\s])?

escape {unicode}|\\[^\n\r\f0-9A-Fa-f]
nmchar [_A-Za-z0-9-]|{nonascii}|{escape}
nmstart [_A-Za-z]|{nonascii}|{escape}
ident -?({nmstart})({nmchar})*
name ({nmchar})+
escape ({unicode}|\\[^\n\r\f0-9A-Fa-f])
nmchar ([_A-Za-z0-9-]|{nonascii}|{escape})
nmstart ([_A-Za-z]|{nonascii}|{escape})
name {nmstart}{nmchar}*
charref {nmchar}+
string1 "([^\n\r\f"]|{nl}|{nonascii}|{escape})*(?<!\\)(?:\\{2})*"
string2 '([^\n\r\f']|{nl}|{nonascii}|{escape})*(?<!\\)(?:\\{2})*'
string {string1}|{string2}
string ({string1}|{string2})

rule

# [:state] pattern [actions]

has\({w} { [:HAS, text] }
{ident}\({w} { [:FUNCTION, text] }
{ident} { [:IDENT, text] }
\#{name} { [:HASH, text] }
{name}\({w} { [:FUNCTION, text] }
{name} { [:IDENT, text] }
\#{charref} { [:HASH, text] }
{w}~={w} { [:INCLUDES, text] }
{w}\|={w} { [:DASHMATCH, text] }
{w}\^={w} { [:PREFIXMATCH, text] }
Expand All @@ -43,6 +43,7 @@ rule
{w}~{w} { [:TILDE, text] }
\:not\({w} { [:NOT, text] }
{num} { [:NUMBER, text] }
{w}\-{w} { [:MINUS, text] }
{w}\/\/{w} { [:DOUBLESLASH, text] }
{w}\/{w} { [:SLASH, text] }

Expand Down
3 changes: 2 additions & 1 deletion test/css/test_tokenizer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ def test_scan_nth
[:IDENT, "x"],
[":", ":"],
[:FUNCTION, "nth-child("],
[:IDENT, "-n"],
[:MINUS, "-"],
[:IDENT, "n"],
[:PLUS, "+"],
[:NUMBER, "3"],
[:RPAREN, ")"],
Expand Down
5 changes: 5 additions & 0 deletions test/css/test_xpath_visitor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,11 @@ def assert_xpath(expecteds, asts)
assert_xpath("//*[not(@id='foo')]", parser.parse(":not(#foo)"))
assert_xpath("//*[count(preceding-sibling::*)=0]", parser.parse(":first-child"))
end

it "raises an exception for pseudo-classes that are not XPath Names" do
# see https://github.com/sparklemotion/nokogiri/issues/3193
assert_raises(Nokogiri::CSS::SyntaxError) { parser.parse("div:-moz-drag-over") }
end
end

describe "combinators" do
Expand Down