Skip to content

JS: Add ECMAScript 2024 v Flag Operators for Regex Parsing #18899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Mar 11, 2025

Conversation

Napalys
Copy link
Contributor

@Napalys Napalys commented Feb 28, 2025

This pull request adds support for parsing ECMAScript 2024 v flag operators, including:

  • Nested Classes: Enables using nested character classes in regexes.
    Example: /[[abc][cz]]/v
  • Intersection (&&): Matches characters common to both sets.
    Example: /[[abc]&&[cz]]/v
  • Subtraction (--): Removes characters from a set.
    Example: /[[abc]--[cz]]/v
    Mixing operations at the same level is not allowed:
    • Invalid: /[[abc]&&[cz]--[zz]]/v
    • Valid: /[[abc]&&[[cz]--[zz]]]/v
  • Union: Combines multiple sets.
    Example: /[[abc][cz]]/v
  • Quoted Strings (\q{}): Allows matching exact sequences.
    Example: /[\q{ab|cb|db}]/v

Commit by commit review encouraged.

Useful links:

With correct parsing, this no longer produces an false positive in Closes #18854.

@github-actions github-actions bot added the JS label Feb 28, 2025
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 84fddf1 to 94adaf8 Compare March 2, 2025 15:56
@Napalys Napalys force-pushed the js/ecma-2024-regex branch 2 times, most recently from 605456f to f93419e Compare March 2, 2025 18:24
@Napalys Napalys changed the title JS: WIP: Ecma 2024 regex JS: Add ECMAScript 2024 v Flag Operators for Regex Parsing Mar 3, 2025
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 6fe7753 to 430514b Compare March 3, 2025 12:00
@Napalys Napalys marked this pull request as ready for review March 3, 2025 13:17
@Copilot Copilot AI review requested due to automatic review settings March 3, 2025 13:17
@Napalys Napalys requested a review from a team as a code owner March 3, 2025 13:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request introduces support for ECMAScript 2024 regex constructs under the new "v" flag. Key changes include:

  • New AST node classes for character class operations (Subtraction, QuotedString, Intersection, Union)
  • Enhancements to RegExpParser to conditionally enable nested character classes, new operators, and quoted string parsing with a fallback mechanism when errors are encountered
  • New test inputs covering quoted strings, unions, intersections, subtractions, and nested character classes

Reviewed Changes

File Description
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassSubtraction.java New AST node for subtraction operator in character classes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassQuotedString.java New AST node for handling quoted string escapes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassIntersection.java New AST node for intersection operator in character classes
javascript/extractor/src/com/semmle/js/ast/regexp/CharacterClassUnion.java New AST node for union operator in character classes
javascript/extractor/src/com/semmle/js/parser/RegExpParser.java Extended parser functionality to support the new "v" flag and corresponding regex operations
javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java and RegExpExtractor.java Updated extraction logic to accommodate new AST node types and conditional flag handling

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more

@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 78aa5dc to 9e1f050 Compare March 3, 2025 13:38
Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I have a couple of comments to keep you busy during the week 😄

@Napalys Napalys force-pushed the js/ecma-2024-regex branch from d6df34e to 8558ead Compare March 5, 2025 08:33
@Napalys Napalys requested a review from asgerf March 5, 2025 11:10
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from 6380ec8 to d40ff96 Compare March 10, 2025 10:17
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from d40ff96 to f48eab9 Compare March 10, 2025 10:18
Co-authored-by: Asgerf <asgerf@github.com>
@Napalys Napalys force-pushed the js/ecma-2024-regex branch from a337863 to 9c8e0a5 Compare March 10, 2025 12:29
@Napalys Napalys requested review from asgerf and erik-krogh March 10, 2025 12:58
Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

I didn't look through it thoroughly, I assume Asger did that.

Did you run database creation on the latest main of https://github.com/babel/babel and https://github.com/tc39/test262?
Those projects contain all kinds of valid and invalid syntax, so it's a nice test of whether something is horribly wrong.

@Napalys
Copy link
Contributor Author

Napalys commented Mar 11, 2025

Nice work 👍

I didn't look through it thoroughly, I assume Asger did that.

Did you run database creation on the latest main of https://github.com/babel/babel and https://github.com/tc39/test262? Those projects contain all kinds of valid and invalid syntax, so it's a nice test of whether something is horribly wrong.

For Babel, the extraction failed catastrophically on this Invalid Syntax File. However, after deleting it, the database was successfully created. I assume this is expected since the file contains invalid syntax?

The test262 database was created successfully without any issues.

@erik-krogh
Copy link
Contributor

For Babel, the extraction failed catastrophically on this Invalid Syntax File. However, after deleting it, the database was successfully created. I assume this is expected since the file contains invalid syntax?

No, that is not expected.
Database creation should succeed, but with some extracted syntax errors.

However, that seems to be unrelated to this PR.
But maybe you could look into fixing that crash later? (And bump the SHA for babel/babel in DCA in the process).

Co-authored-by: Erik Krogh Kristensen <erik-krogh@github.com>
erik-krogh
erik-krogh previously approved these changes Mar 11, 2025
asgerf
asgerf previously approved these changes Mar 11, 2025
Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Co-authored-by: Asger F <asgerf@github.com>
@Napalys Napalys dismissed stale reviews from asgerf and erik-krogh via a900f2c March 11, 2025 10:57
@Napalys Napalys merged commit a4f2264 into github:main Mar 11, 2025
14 checks passed
@Napalys Napalys deleted the js/ecma-2024-regex branch March 12, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JavaScript: false positive with unicode sets for character classes that contain brackets
3 participants