Skip to content

JS: Add ECMAScript 2024 v Flag Operators for Regex Parsing #18899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Mar 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cb448db
Exposed flags to the regex parser
Napalys Mar 2, 2025
d162acf
Added quoted string \q parser test cases
Napalys Mar 2, 2025
ed418be
Add support for '\q{}' escape sequence in regular expressions.
Napalys Feb 28, 2025
ab7e08f
Added test cases for nested character class.
Napalys Feb 28, 2025
de6f3b1
Add additional test cases.
Napalys Feb 28, 2025
2333c53
Added ability to parse nested character classes while using `v` flag.
Napalys Mar 2, 2025
fa5093f
Added test cases for intersection
Napalys Mar 2, 2025
381b5eb
Added intersection support
Napalys Mar 2, 2025
ee83c42
Added test cases for subtraction `--`.
Napalys Mar 2, 2025
3664d50
Added support for `--` subtraction opetor.
Napalys Mar 2, 2025
1e05f32
Added test cases for union.
Napalys Mar 2, 2025
fe6de2f
Added support for character class union in regex processing
Napalys Mar 3, 2025
c0202f6
Updated dbscheme
Napalys Mar 3, 2025
c7f03df
Added change note
Napalys Mar 3, 2025
9ea89cd
Added a test case from #18854
Napalys Mar 4, 2025
8099423
Renamed character class operators lists to `elements`.
Napalys Mar 5, 2025
8086c25
Removed `Union` as standard character class is already an union.
Napalys Mar 5, 2025
95d05ce
Now store `vFlagEnabled` instead of each time searching for it.
Napalys Mar 5, 2025
d884e5f
Upgraded `javascrip` database schema
Napalys Mar 5, 2025
9cc2620
Add test cases for `v` flag operators in RegExp library-tests.
Napalys Mar 6, 2025
e0f20b2
Add RegExpIntersection class to support intersection terms in regex
Napalys Mar 7, 2025
8cbc0ae
Add `RegExpQuotedString` class to support quoted string escapes in regex
Napalys Mar 7, 2025
f48eab9
Add `RegExpSubtraction` class to support subtraction terms in regex
Napalys Mar 9, 2025
9c8e0a5
Applied changes from comments.
Napalys Mar 10, 2025
08c07f8
Improved documentation, removed union fram change note.
Napalys Mar 11, 2025
3191b2c
Update javascript/extractor/src/com/semmle/js/parser/RegExpParser.java
Napalys Mar 11, 2025
a900f2c
Update javascript/ql/lib/change-notes/2025-03-03-regex-v.md
Napalys Mar 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,193 changes: 1,193 additions & 0 deletions javascript/downgrades/5b5db607d20c7b449cef2d1c926b24d77c69bebb/old.dbscheme

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
description: Add support for quoted string, intersection and subtraction
compatibility: backwards
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package com.semmle.js.ast.regexp;

import com.semmle.js.ast.SourceLocation;
import java.util.List;

/**
* A character class intersection in a regular expression available only with the `v` flag.
* Example: [[abc]&&[ab]&&[b]] matches character `b` only.
*/
public class CharacterClassIntersection extends RegExpTerm {
private final List<RegExpTerm> elements;

public CharacterClassIntersection(SourceLocation loc, List<RegExpTerm> elements) {
super(loc, "CharacterClassIntersection");
this.elements = elements;
}

@Override
public void accept(Visitor v) {
v.visit(this);
}

public List<RegExpTerm> getElements() {
return elements;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
package com.semmle.js.ast.regexp;

import com.semmle.js.ast.SourceLocation;

/**
* A quoted string escape sequence '\q{}' in a regular expression.
* This feature is a non-standard extension that requires the 'v' flag.
*
* Example: [\q{abc|def}] creates a character class that matches either the string
* "abc" or "def". Within the quoted string, only the alternation operator '|' is supported.
*/
public class CharacterClassQuotedString extends RegExpTerm {
private final RegExpTerm term;

public CharacterClassQuotedString(SourceLocation loc, RegExpTerm term) {
super(loc, "CharacterClassQuotedString");
this.term = term;
}

public RegExpTerm getTerm() {
return term;
}

@Override
public void accept(Visitor v) {
v.visit(this);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package com.semmle.js.ast.regexp;

import com.semmle.js.ast.SourceLocation;
import java.util.List;

/**
* A character class subtraction in a regular expression available only with the `v` flag.
* Example: [[abc]--[a]--[b]] matches character `c` only.
*/
public class CharacterClassSubtraction extends RegExpTerm {
private final List<RegExpTerm> elements;

public CharacterClassSubtraction(SourceLocation loc, List<RegExpTerm> elements) {
super(loc, "CharacterClassSubtraction");
this.elements = elements;
}

@Override
public void accept(Visitor v) {
v.visit(this);
}

public List<RegExpTerm> getElements() {
return elements;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,10 @@ public interface Visitor {
public void visit(ZeroWidthNegativeLookbehind nd);

public void visit(UnicodePropertyEscape nd);

public void visit(CharacterClassQuotedString nd);

public void visit(CharacterClassIntersection nd);

public void visit(CharacterClassSubtraction nd);
}
Original file line number Diff line number Diff line change
Expand Up @@ -600,7 +600,7 @@ public Label visit(Literal nd, Context c) {
SourceMap sourceMap =
SourceMap.legacyWithStartPos(
SourceMap.fromString(nd.getRaw()).offsetBy(0, offsets), startPos);
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), sourceMap, nd, false);
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), sourceMap, nd, false, source.substring(source.lastIndexOf('/'), source.length()));
} else if (nd.isStringLiteral()
&& !c.isInsideType()
&& nd.getRaw().length() < 1000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@
import com.semmle.js.ast.regexp.Caret;
import com.semmle.js.ast.regexp.CharacterClass;
import com.semmle.js.ast.regexp.CharacterClassEscape;
import com.semmle.js.ast.regexp.CharacterClassQuotedString;
import com.semmle.js.ast.regexp.CharacterClassRange;
import com.semmle.js.ast.regexp.CharacterClassSubtraction;
import com.semmle.js.ast.regexp.Constant;
import com.semmle.js.ast.regexp.ControlEscape;
import com.semmle.js.ast.regexp.ControlLetter;
Expand All @@ -22,6 +24,7 @@
import com.semmle.js.ast.regexp.Group;
import com.semmle.js.ast.regexp.HexEscapeSequence;
import com.semmle.js.ast.regexp.IdentityEscape;
import com.semmle.js.ast.regexp.CharacterClassIntersection;
import com.semmle.js.ast.regexp.Literal;
import com.semmle.js.ast.regexp.NamedBackReference;
import com.semmle.js.ast.regexp.NonWordBoundary;
Expand Down Expand Up @@ -92,6 +95,9 @@ public RegExpExtractor(TrapWriter trapwriter, LocationManager locationManager) {
termkinds.put("ZeroWidthPositiveLookbehind", 25);
termkinds.put("ZeroWidthNegativeLookbehind", 26);
termkinds.put("UnicodePropertyEscape", 27);
termkinds.put("CharacterClassQuotedString", 28);
termkinds.put("CharacterClassIntersection", 29);
termkinds.put("CharacterClassSubtraction", 30);
}

private static final String[] errmsgs =
Expand Down Expand Up @@ -344,10 +350,32 @@ public void visit(CharacterClassRange nd) {
visit(nd.getLeft(), lbl, 0);
visit(nd.getRight(), lbl, 1);
}

@Override
public void visit(CharacterClassQuotedString nd) {
Label lbl = extractTerm(nd, parent, idx);
visit(nd.getTerm(), lbl, 0);
}

@Override
public void visit(CharacterClassIntersection nd) {
Label lbl = extractTerm(nd, parent, idx);
int i = 0;
for (RegExpTerm element : nd.getElements())
visit(element, lbl, i++);
}

@Override
public void visit(CharacterClassSubtraction nd) {
Label lbl = extractTerm(nd, parent, idx);
int i = 0;
for (RegExpTerm element : nd.getElements())
visit(element, lbl, i++);
}
}

public void extract(String src, SourceMap sourceMap, Node parent, boolean isSpeculativeParsing) {
Result res = parser.parse(src);
public void extract(String src, SourceMap sourceMap, Node parent, boolean isSpeculativeParsing, String flags) {
Result res = parser.parse(src, flags);
if (isSpeculativeParsing && res.getErrors().size() > 0) {
return;
}
Expand All @@ -364,4 +392,8 @@ public void extract(String src, SourceMap sourceMap, Node parent, boolean isSpec
this.emitLocation(err, lbl);
}
}

public void extract(String src, SourceMap sourceMap, Node parent, boolean isSpeculativeParsing) {
extract(src, sourceMap, parent, isSpeculativeParsing, "");
}
}
97 changes: 95 additions & 2 deletions javascript/extractor/src/com/semmle/js/parser/RegExpParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@
import com.semmle.js.ast.regexp.Caret;
import com.semmle.js.ast.regexp.CharacterClass;
import com.semmle.js.ast.regexp.CharacterClassEscape;
import com.semmle.js.ast.regexp.CharacterClassQuotedString;
import com.semmle.js.ast.regexp.CharacterClassRange;
import com.semmle.js.ast.regexp.CharacterClassSubtraction;
import com.semmle.js.ast.regexp.Constant;
import com.semmle.js.ast.regexp.ControlEscape;
import com.semmle.js.ast.regexp.ControlLetter;
Expand All @@ -18,6 +20,7 @@
import com.semmle.js.ast.regexp.Group;
import com.semmle.js.ast.regexp.HexEscapeSequence;
import com.semmle.js.ast.regexp.IdentityEscape;
import com.semmle.js.ast.regexp.CharacterClassIntersection;
import com.semmle.js.ast.regexp.NamedBackReference;
import com.semmle.js.ast.regexp.NonWordBoundary;
import com.semmle.js.ast.regexp.OctalEscape;
Expand All @@ -36,6 +39,7 @@
import com.semmle.js.ast.regexp.ZeroWidthPositiveLookbehind;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

/** A parser for ECMAScript 2018 regular expressions. */
Expand Down Expand Up @@ -67,6 +71,8 @@ public List<Error> getErrors() {
private List<Error> errors;
private List<BackReference> backrefs;
private int maxbackref;
private boolean vFlagEnabled = false;
private boolean uFlagEnabled = false;

/** Parse the given string as a regular expression. */
public Result parse(String src) {
Expand All @@ -82,6 +88,12 @@ public Result parse(String src) {
return new Result(root, errors);
}

public Result parse(String src, String flags) {
vFlagEnabled = flags != null && flags.contains("v");
uFlagEnabled = flags != null && flags.contains("u");
return parse(src);
}

private static String fromCodePoint(int codepoint) {
if (Character.isValidCodePoint(codepoint)) return new String(Character.toChars(codepoint));
// replacement character
Expand Down Expand Up @@ -277,6 +289,43 @@ private RegExpTerm parseTerm() {
return this.finishTerm(this.parseQuantifierOpt(loc, this.parseAtom()));
}

private RegExpTerm parseDisjunctionInsideQuotedString() {
SourceLocation loc = new SourceLocation(pos());
List<RegExpTerm> disjuncts = new ArrayList<>();
disjuncts.add(this.parseAlternativeInsideQuotedString());
while (this.match("|")) {
disjuncts.add(this.parseAlternativeInsideQuotedString());
}
if (disjuncts.size() == 1) return disjuncts.get(0);
return this.finishTerm(new Disjunction(loc, disjuncts));
}

private RegExpTerm parseAlternativeInsideQuotedString() {
SourceLocation loc = new SourceLocation(pos());
int startPos = this.pos;
boolean escaped = false;
while (true) {
// If we're at the end of the string, something went wrong.
if (this.atEOS()) {
this.error(Error.UNEXPECTED_EOS);
break;
}
// We can end parsing if we're not escaped and we see a `|` which would mean Alternation
// or `}` which would mean the end of the Quoted String.
if(!escaped && this.lookahead(null, "|", "}")){
break;
}
char c = this.nextChar();
// Track whether the character is an escape character.
escaped = !escaped && (c == '\\');
}
String literal = src.substring(startPos, pos);
loc.setEnd(pos());
loc.setSource(literal);

return new Constant(loc, literal);
}

private RegExpTerm parseQuantifierOpt(SourceLocation loc, RegExpTerm atom) {
if (this.match("*")) return this.finishTerm(new Star(loc, atom, !this.match("?")));
if (this.match("+")) return this.finishTerm(new Plus(loc, atom, !this.match("?")));
Expand Down Expand Up @@ -421,7 +470,13 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) {
return this.finishTerm(new NamedBackReference(loc, name, "\\k<" + name + ">"));
}

if (this.match("p{", "P{")) {
if (vFlagEnabled && this.match("q{")) {
RegExpTerm term = parseDisjunctionInsideQuotedString();
this.expectRBrace();
return this.finishTerm(new CharacterClassQuotedString(loc, term));
}

if ((vFlagEnabled || uFlagEnabled) && this.match("p{", "P{")) {
String name = this.readIdentifier();
if (this.match("=")) {
value = this.readIdentifier();
Expand Down Expand Up @@ -493,6 +548,7 @@ private RegExpTerm parseAtomEscape(SourceLocation loc, boolean inCharClass) {
}

private RegExpTerm parseCharacterClass() {
if (vFlagEnabled) return parseNestedCharacterClass();
SourceLocation loc = new SourceLocation(pos());
List<RegExpTerm> elements = new ArrayList<>();

Expand All @@ -508,6 +564,43 @@ private RegExpTerm parseCharacterClass() {
return this.finishTerm(new CharacterClass(loc, elements, inverted));
}

private enum CharacterClassType {
STANDARD,
INTERSECTION,
SUBTRACTION
}

// ECMA 2024 `v` flag allows nested character classes.
private RegExpTerm parseNestedCharacterClass() {
SourceLocation loc = new SourceLocation(pos());
List<RegExpTerm> elements = new ArrayList<>();
CharacterClassType classType = CharacterClassType.STANDARD;

this.match("[");
boolean inverted = this.match("^");
while (!this.match("]")) {
if (this.atEOS()) {
this.error(Error.EXPECTED_RBRACKET);
break;
}
if (lookahead("[")) elements.add(parseNestedCharacterClass());
else if (this.match("&&")) classType = CharacterClassType.INTERSECTION;
else if (this.match("--")) classType = CharacterClassType.SUBTRACTION;
else elements.add(this.parseCharacterClassElement());
}

// Create appropriate RegExpTerm based on the detected class type
switch (classType) {
case INTERSECTION:
return this.finishTerm(new CharacterClass(loc, Collections.singletonList(new CharacterClassIntersection(loc, elements)), inverted));
case SUBTRACTION:
return this.finishTerm(new CharacterClass(loc, Collections.singletonList(new CharacterClassSubtraction(loc, elements)), inverted));
case STANDARD:
default:
return this.finishTerm(new CharacterClass(loc, elements, inverted));
}
}

private static final List<String> escapeClasses = Arrays.asList("d", "D", "s", "S", "w", "W");

private RegExpTerm parseCharacterClassElement() {
Expand All @@ -519,7 +612,7 @@ private RegExpTerm parseCharacterClassElement() {
return atom;
}
}
if (!this.lookahead("-]") && this.match("-") && !(atom instanceof CharacterClassEscape))
if (!this.lookahead("-]") && !this.lookahead("--") && this.match("-") && !(atom instanceof CharacterClassEscape))
return this.finishTerm(new CharacterClassRange(loc, atom, this.parseCharacterClassAtom()));
return atom;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/^p(ost)?[ |\.]*o(ffice)?[ |\.]*(box)?[ 0-9]*[^[a-z ]]*/g;
/([ ]*[a-z0-9&#*=?@\\><:,()$[\]_.{}!+%^-]+)+X/;
7 changes: 7 additions & 0 deletions javascript/extractor/tests/es2024/input/intersection.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
/[[abc]&&[bcd]]/v; // Valid use of intersection operator, matches b or c
/abc&&bcd/v; //Valid regex, but no intersection operation: Matches the literal string "abc&&bcd"
/[abc]&&[bcd]/v; // Valid regex, but incorrect intersection operation:
// - Matches a single character from [abc]
// - Then the literal "&&"
// - Then a single character from [bcd]
/[[abc]&&[bcd]&&[c]]/v; // Valid use of intersection operator, matches c
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/[[]]/v; //Previously not allowed to nest character classes now completely valid with v flag.
/[[a]]/v;
/[ [] [ [] [] ] ]/v;
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/[\q{abc}]/v;
/[\q{abc|cbd|dcb}]/v;
/[\q{\}}]/v;
/[\q{\{}]/v;
/[\q{cc|\}a|cc}]/v;
3 changes: 3 additions & 0 deletions javascript/extractor/tests/es2024/input/subtraction.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/[\p{Script_Extensions=Greek}--\p{Letter}]/v;
/[[abc]--[cbd]]/v;
/[[abc]--[cbd]--[bde]]/v;
1 change: 1 addition & 0 deletions javascript/extractor/tests/es2024/input/test.js
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
const regex = /\b(?:https?:\/\/|mailto:|www\.)(?:[\S--[\p{P}<>]]|\/|[\S--[\[\]]]+[\S--[\p{P}<>]])+|\b[\S--[@\p{Ps}\p{Pe}<>]]+@([\S--[\p{P}<>]]+(?:\.[\S--[\p{P}<>]]+)+)/gmv;
6 changes: 6 additions & 0 deletions javascript/extractor/tests/es2024/input/union.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/[\p{Script_Extensions=Greek}\p{RGI_Emoji}]/v;
/[[abc][cbd]]/v;
/[\p{Emoji}\q{a&}byz]/v;
/[\q{\\\}a&}byz]/v;
/[\q{\\}]/v;
/[\q{abc|cbd|\}}]/v;
3 changes: 3 additions & 0 deletions javascript/extractor/tests/es2024/options.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"experimental": true
}
Loading