-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS: Add ECMAScript 2024 v
Flag Operators for Regex Parsing
#18899
base: main
Are you sure you want to change the base?
Conversation
84fddf1
to
94adaf8
Compare
605456f
to
f93419e
Compare
v
Flag Operators for Regex Parsing
6fe7753
to
430514b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This pull request introduces support for ECMAScript 2024 regex constructs under the new "v" flag. Key changes include:
- New AST node classes for character class operations (Subtraction, QuotedString, Intersection, Union)
- Enhancements to RegExpParser to conditionally enable nested character classes, new operators, and quoted string parsing with a fallback mechanism when errors are encountered
- New test inputs covering quoted strings, unions, intersections, subtractions, and nested character classes
Reviewed Changes
File | Description |
---|---|
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassSubtraction.java | New AST node for subtraction operator in character classes |
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassQuotedString.java | New AST node for handling quoted string escapes |
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassIntersection.java | New AST node for intersection operator in character classes |
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassUnion.java | New AST node for union operator in character classes |
javascript/extractor/src/com/semmle/js/parser/RegExpParser.java | Extended parser functionality to support the new "v" flag and corresponding regex operations |
javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java and RegExpExtractor.java | Updated extraction logic to accommodate new AST node types and conditional flag handling |
Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.
Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more
javascript/extractor/src/com/semmle/js/parser/RegExpParser.java
Outdated
Show resolved
Hide resolved
javascript/extractor/src/com/semmle/js/parser/RegExpParser.java
Outdated
Show resolved
Hide resolved
78aa5dc
to
9e1f050
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! I have a couple of comments to keep you busy during the week 😄
import java.util.List; | ||
|
||
public class CharacterClassIntersection extends RegExpTerm { | ||
private final List<RegExpTerm> intersections; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to elements
and likewise for the other lists in the new AST classes
// If the regular expression was created using RegExp(), the flags might be unknown. | ||
// In this case, we will also attempt to parse it using the "v" (Unicode sets) flag. | ||
String flagsStr = isRegExprCall ? null : source.substring(source.lastIndexOf('/') + 1); | ||
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), sourceMap, nd, false, flagsStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
Normally the argument to new RegExp()
will be a string, e.g. new RegExp("...")
, but here's you're checking for the pattern where it is a regexp literal, e.g. new RegExp(/.../)
. This is an unusual pattern since the regexp literal already evalautes to a RegExp
object and the call to new RegExp
is redundant. It's more important to focus on the case where the argument is a string (which is handled in the if
clause immediately below this one).
Second, the matching in isRegExpCall
is a little too specific to the case where the enclosing statement has the shape var v = new RegExp(...)
. Information from parent nodes can be passed down via the Context
class, which all the visit methods receive as the second parameter. This can be cumbersome I'll admit, but that's the way to go if we're going to do this. I'd say either add enough information to Context
or leave this feature out of the PR. Given that the flags can in general be unknown I'd say it's fine to just limit the scope of this PR to focus on regexp literals.
@@ -82,6 +88,23 @@ public Result parse(String src) { | |||
return new Result(root, errors); | |||
} | |||
|
|||
public Result parse(String src) { | |||
Result res = tryParse(src); | |||
if(flags == null && !res.getErrors().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible this will add too much overhead, since all string literals are parsed as regexps (in what we call "speculative parsing") and a lot of these string literals will have parse errors. This means many string literals will then be parsed twice in practice, even for projects that don't use v
regexps at all. It would be nice if the first parsing attempt remembered if it has seen syntax that indicates likely v
-syntax in the regexp, so we only have to retry when it makes sense.
In any case we should keep a close eye on extraction times from DCA.
sb.append(c); | ||
} | ||
|
||
String literal = sb.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a StringBuilder
could you implement this as src.substring(startPos, endPos)
like in parseAtom()
?
@@ -421,6 +483,12 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) { | |||
return this.finishTerm(new NamedBackReference(loc, name, "\\k<" + name + ">")); | |||
} | |||
|
|||
if (this.match("q{")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check for the v
flag here? It seems like the regexp /\q{4}/
will be parsed incorrectly now.
Same for the p{
case below, actually.
@@ -493,6 +561,7 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) { | |||
} | |||
|
|||
private RegExpTerm parseCharacterClass() { | |||
if (flags != null && flags.contains("v")) return parseNestedCharacterClass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we'll have to check for the v
flag multiple times during parsing, I'd say it's probably a good idea to store it in a boolean
field.
else if (lookahead("&&")) { | ||
this.match("&&"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if (lookahead("&&")) { | |
this.match("&&"); | |
else if (match("&&")) { |
else if (lookahead("--")) { | ||
this.match("--"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if (lookahead("--")) { | |
this.match("--"); | |
else if (match("--")) { |
import com.semmle.js.ast.SourceLocation; | ||
import java.util.List; | ||
|
||
public class CharacterClassUnion extends RegExpTerm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should merge this with the standard CharacterClass
node since they're both unions.
// 2. We have more than one element | ||
// 3. We have at least one complex element (i.e. a nested character class or a UnicodePropertyEscape) | ||
if (containsComplex && classType == CharacterClassType.STANDARD && elements.size() > 1) { | ||
classType = CharacterClassType.UNION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested earlier, let's merge the union and standard character classes. It would simplify the AST and we wouldn't need this check.
This pull request adds support for parsing ECMAScript 2024
v
flag operators, including:Example:
/[[abc][cz]]/v
&&
): Matches characters common to both sets.Example:
/[[abc]&&[cz]]/v
--
): Removes characters from a set.Example:
/[[abc]--[cz]]/v
Mixing operations at the same level is not allowed:
/[[abc]&&[cz]--[zz]]/v
/[[abc]&&[[cz]--[zz]]]/v
Example:
/[[abc][cz]]/v
\q{}
): Allows matching exact sequences.Example:
/[\q{ab|cb|db}]/v
Commit by commit review encouraged.
If the regex flags are unknown and the parser encounters errors while parsing, I added a fallback method that attempts to parse it using a synthetic
v
flag 1854bf9.Useful links:
With correct parsing, this no longer produces an false positive in Closes #18854.