Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS: Add ECMAScript 2024 v Flag Operators for Regex Parsing #18899

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Napalys
Copy link
Contributor

@Napalys Napalys commented Feb 28, 2025

This pull request adds support for parsing ECMAScript 2024 v flag operators, including:

  • Nested Classes: Enables using nested character classes in regexes.
    Example: /[[abc][cz]]/v
  • Intersection (&&): Matches characters common to both sets.
    Example: /[[abc]&&[cz]]/v
  • Subtraction (--): Removes characters from a set.
    Example: /[[abc]--[cz]]/v
    Mixing operations at the same level is not allowed:
    • Invalid: /[[abc]&&[cz]--[zz]]/v
    • Valid: /[[abc]&&[[cz]--[zz]]]/v
  • Union: Combines multiple sets.
    Example: /[[abc][cz]]/v
  • Quoted Strings (\q{}): Allows matching exact sequences.
    Example: /[\q{ab|cb|db}]/v

Commit by commit review encouraged.

If the regex flags are unknown and the parser encounters errors while parsing, I added a fallback method that attempts to parse it using a synthetic v flag 1854bf9.

Useful links:

With correct parsing, this no longer produces an false positive in Closes #18854.

@github-actions github-actions bot added the JS label Feb 28, 2025
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 84fddf1 to 94adaf8 Compare March 2, 2025 15:56
@Napalys Napalys force-pushed the js/ecma-2024-regex branch 2 times, most recently from 605456f to f93419e Compare March 2, 2025 18:24
@Napalys Napalys changed the title JS: WIP: Ecma 2024 regex JS: Add ECMAScript 2024 v Flag Operators for Regex Parsing Mar 3, 2025
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 6fe7753 to 430514b Compare March 3, 2025 12:00
@Napalys Napalys marked this pull request as ready for review March 3, 2025 13:17
@Copilot Copilot bot review requested due to automatic review settings March 3, 2025 13:17
@Napalys Napalys requested a review from a team as a code owner March 3, 2025 13:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request introduces support for ECMAScript 2024 regex constructs under the new "v" flag. Key changes include:

  • New AST node classes for character class operations (Subtraction, QuotedString, Intersection, Union)
  • Enhancements to RegExpParser to conditionally enable nested character classes, new operators, and quoted string parsing with a fallback mechanism when errors are encountered
  • New test inputs covering quoted strings, unions, intersections, subtractions, and nested character classes

Reviewed Changes

File Description
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassSubtraction.java New AST node for subtraction operator in character classes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassQuotedString.java New AST node for handling quoted string escapes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassIntersection.java New AST node for intersection operator in character classes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassUnion.java New AST node for union operator in character classes
javascript/extractor/src/com/semmle/js/parser/RegExpParser.java Extended parser functionality to support the new "v" flag and corresponding regex operations
javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java and RegExpExtractor.java Updated extraction logic to accommodate new AST node types and conditional flag handling

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more

@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 78aa5dc to 9e1f050 Compare March 3, 2025 13:38
Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I have a couple of comments to keep you busy during the week 😄

import java.util.List;

public class CharacterClassIntersection extends RegExpTerm {
private final List<RegExpTerm> intersections;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to elements and likewise for the other lists in the new AST classes

// If the regular expression was created using RegExp(), the flags might be unknown.
// In this case, we will also attempt to parse it using the "v" (Unicode sets) flag.
String flagsStr = isRegExprCall ? null : source.substring(source.lastIndexOf('/') + 1);
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), sourceMap, nd, false, flagsStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

Normally the argument to new RegExp() will be a string, e.g. new RegExp("..."), but here's you're checking for the pattern where it is a regexp literal, e.g. new RegExp(/.../). This is an unusual pattern since the regexp literal already evalautes to a RegExp object and the call to new RegExp is redundant. It's more important to focus on the case where the argument is a string (which is handled in the if clause immediately below this one).

Second, the matching in isRegExpCall is a little too specific to the case where the enclosing statement has the shape var v = new RegExp(...). Information from parent nodes can be passed down via the Context class, which all the visit methods receive as the second parameter. This can be cumbersome I'll admit, but that's the way to go if we're going to do this. I'd say either add enough information to Context or leave this feature out of the PR. Given that the flags can in general be unknown I'd say it's fine to just limit the scope of this PR to focus on regexp literals.

@@ -82,6 +88,23 @@ public Result parse(String src) {
return new Result(root, errors);
}

public Result parse(String src) {
Result res = tryParse(src);
if(flags == null && !res.getErrors().isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible this will add too much overhead, since all string literals are parsed as regexps (in what we call "speculative parsing") and a lot of these string literals will have parse errors. This means many string literals will then be parsed twice in practice, even for projects that don't use v regexps at all. It would be nice if the first parsing attempt remembered if it has seen syntax that indicates likely v-syntax in the regexp, so we only have to retry when it makes sense.

In any case we should keep a close eye on extraction times from DCA.

sb.append(c);
}

String literal = sb.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a StringBuilder could you implement this as src.substring(startPos, endPos) like in parseAtom()?

@@ -421,6 +483,12 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) {
return this.finishTerm(new NamedBackReference(loc, name, "\\k<" + name + ">"));
}

if (this.match("q{")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check for the v flag here? It seems like the regexp /\q{4}/ will be parsed incorrectly now.

Same for the p{ case below, actually.

@@ -493,6 +561,7 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) {
}

private RegExpTerm parseCharacterClass() {
if (flags != null && flags.contains("v")) return parseNestedCharacterClass();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we'll have to check for the v flag multiple times during parsing, I'd say it's probably a good idea to store it in a boolean field.

Comment on lines +603 to +604
else if (lookahead("&&")) {
this.match("&&");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else if (lookahead("&&")) {
this.match("&&");
else if (match("&&")) {

Comment on lines +607 to +608
else if (lookahead("--")) {
this.match("--");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else if (lookahead("--")) {
this.match("--");
else if (match("--")) {

import com.semmle.js.ast.SourceLocation;
import java.util.List;

public class CharacterClassUnion extends RegExpTerm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should merge this with the standard CharacterClass node since they're both unions.

// 2. We have more than one element
// 3. We have at least one complex element (i.e. a nested character class or a UnicodePropertyEscape)
if (containsComplex && classType == CharacterClassType.STANDARD && elements.size() > 1) {
classType = CharacterClassType.UNION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested earlier, let's merge the union and standard character classes. It would simplify the AST and we wouldn't need this check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JavaScript: false positive with unicode sets for character classes that contain brackets
2 participants