|
| 1 | +--- |
| 2 | +layout: doc |
| 3 | +title: Regular expressions |
| 4 | +--- |
| 5 | + |
| 6 | +[JavaScript regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) are different from [Java regular expressions](https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html). |
| 7 | +For `java.util.regex.Pattern` (and its derivatives like `scala.util.matching.Regex` and the `.r` method), Scala.js implements the semantics of Java regular expressions, although with some limitations. |
| 8 | +The semantics and feature set of JavaScript regular expressions is available through `js.RegExp`, as any other JavaScript API. |
| 9 | + |
| 10 | +## Support |
| 11 | + |
| 12 | +The set of supported features for `Pattern` depends on the target ECMAScript version, specified in `ESFeatures.esVersion`. |
| 13 | +By default, Scala.js targets ECMAScript 2015. |
| 14 | +It is possible to change that target with the following setting: |
| 15 | + |
| 16 | +{% highlight scala %} |
| 17 | +scalaJSLinkerConfig ~= (_.withESFeatures(_.withESVersion(ESVersion.ES2018))) |
| 18 | +{% endhighlight %} |
| 19 | + |
| 20 | +**Attention!** While this enables more features of regular expressions, it restricts your application to environments that support recent JavaScript features. |
| 21 | +If you maintain a library, this restriction applies to all downstream libraries and applications. |
| 22 | +We therefore recommend to try and avoid the additional features, and prefer additional logic in code if that is possible. |
| 23 | + |
| 24 | +In particular, we recommend avoiding the `MULTILINE` flag, aka `(?m)`, which requires ES2018. |
| 25 | +We give some hints on how to avoid it below. |
| 26 | + |
| 27 | +### Not supported |
| 28 | + |
| 29 | +The following features are never supported: |
| 30 | + |
| 31 | +* the `CANON_EQ` flag, |
| 32 | +* the `\X`, `\b{g}` and `\N{...}` expressions, |
| 33 | +* `\p{In𝘯𝘢𝘮𝘦}` character classes representing Unicode *blocks*, |
| 34 | +* the `\G` boundary matcher, *except* if it appears at the very beginning of the regex (e.g., `\Gfoo`), |
| 35 | +* embedded flag expressions with inner groups, i.e., constructs of the form `(?idmsuxU-idmsuxU:𝑋)`, |
| 36 | +* embedded flag expressions without inner groups, i.e., constructs of the form `(?idmsuxU-idmsuxU)`, *except* if they appear at the very beginning of the regex (e.g., `(?i)abc` is accepted, but `ab(?i)c` is not), and |
| 37 | +* numeric "back" references to groups that are defined later in the pattern (note that even Java does not support *named* back references like that). |
| 38 | + |
| 39 | +### Conditionally supported |
| 40 | + |
| 41 | +The following features require `esVersion >= ESVersion.ES2015` (which is true by default): |
| 42 | + |
| 43 | +* the `UNICODE_CASE` flag. |
| 44 | + |
| 45 | +The following features require `esVersion >= ESVersion.ES2018` (which is false by default): |
| 46 | + |
| 47 | +* the `MULTILINE` and `UNICODE_CHARACTER_CLASS` flags, |
| 48 | +* look-behind assertions `(?<=𝑋)` and `(?<!𝑋)`, |
| 49 | +* the `\b` and `\B` expressions used together with the `UNICODE_CASE` flag, |
| 50 | +* `\p{𝘯𝘢𝘮𝘦}` expressions where `𝘯𝘢𝘮𝘦` is not one of the [POSIX character classes](https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html#posix). |
| 51 | + |
| 52 | +### Always supported |
| 53 | + |
| 54 | +It is worth noting that, among others, the following features *are* supported in all cases, even when no equivalent feature exists in ECMAScript at all, or in the target version of ECMAScript: |
| 55 | + |
| 56 | +* correct handling of surrogate pairs (natively supported in ES 2015+), |
| 57 | +* the `\G` boundary matcher when it is at the beginning of the pattern (corresponding to the 'y' flag, natively supported in ES 2015+), |
| 58 | +* named groups and named back references (natively supported in ES 2018+), |
| 59 | +* the `DOTALL` flag (natively supported in ES 2018+), |
| 60 | +* ASCII case-insensitive matching (`CASE_INSENSITIVE` on but `UNICODE_CASE` off), |
| 61 | +* comments with the `COMMENTS` flag, |
| 62 | +* POSIX character classes in ASCII mode, or their Unicode variant with `UNICODE_CHARACTER_CLASS` (if the latter is itself supported, see above), |
| 63 | +* complex character classes with unions and intersections (e.g., `[a-z&&[^g-p]]`), |
| 64 | +* atomic groups `(?>𝑋)`, |
| 65 | +* possessive quantifiers `𝑋*+`, `𝑋++` and `𝑋?+`, |
| 66 | +* the `\A`, `\Z` and `\z` boundary matchers, |
| 67 | +* the `\R` expression, |
| 68 | +* embedded quotations with `\Q` and `\E`, both outside and inside character classes. |
| 69 | + |
| 70 | +All the supported features have the correct semantics from Java. |
| 71 | +This is even true for features that exist in JavaScript but with different semantics, among which: |
| 72 | + |
| 73 | +* the `^` and `$` boundary matchers with the `MULTILINE` flag (when the latter is supported), |
| 74 | +* the predefined character classes `\h`, `\s`, `\v`, `\w` and their negated variants, respecting the `UNICODE_CHARACTER_CLASS` flag, |
| 75 | +* the `\b` and `\B` boundary matchers, respecting the `UNICODE_CHARACTER_CLASS` flag, |
| 76 | +* the internal format of `\p{𝘯𝘢𝘮𝘦}` character classes, including the `\p{java𝘔𝘦𝘵𝘩𝘰𝘥𝘕𝘢𝘮𝘦}` classes, |
| 77 | +* octal escapes and control escapes. |
| 78 | + |
| 79 | +## Guarantees |
| 80 | + |
| 81 | +If a feature is not supported, a `PatternSyntaxException` is thrown at the time of `Pattern.compile()`. |
| 82 | + |
| 83 | +If `Pattern.compile()` succeeds, the regex is guaranteed to behave exactly like on the JVM, *except* for capturing groups within repeated segments (both for their back references and subsequent calls to `group`, `start` and `end`): |
| 84 | + |
| 85 | +* on the JVM, a capturing group always captures whatever substring was successfully matched last by that group during the processing of the regex: |
| 86 | + - even if it was in a previous iteration of a repeated segment and the last iteration did not have a match for that group, or |
| 87 | + - if it was during a later iteration of a repeated segment that was subsequently backtracked; |
| 88 | +* in JS and hence in Scala.js, capturing groups within repeated segments always capture what was matched (or not) during the last iteration that was eventually kept. |
| 89 | + |
| 90 | +The behavior of JavaScript is more "functional", whereas that of the JVM is more "imperative". |
| 91 | +This imperative nature is also reflected in the `hitEnd()` and `requireEnd()` methods of `Matcher`, which are not supported (they do not link). |
| 92 | + |
| 93 | +The behavior of the JVM does not appear to be specified, and is questionable. |
| 94 | +There are several open issues that argue it is buggy: |
| 95 | + |
| 96 | +* [JDK-8027747](https://bugs.openjdk.java.net/browse/JDK-8027747) |
| 97 | +* [JDK-8187083](https://bugs.openjdk.java.net/browse/JDK-8187083) |
| 98 | +* [JDK-8187080](https://bugs.openjdk.java.net/browse/JDK-8187080) |
| 99 | +* [JDK-8187082](https://bugs.openjdk.java.net/browse/JDK-8187082) |
| 100 | + |
| 101 | +Scala.js keeps the the JavaScript behavior, and does not try to replicate the JVM behavior (potentially at great cost). |
| 102 | + |
| 103 | +## Avoiding the `MULTILINE` flag, aka `(?m)` |
| 104 | + |
| 105 | +The 'm' flag of JavaScript's `RegExp` is subtly different from that of Java's `Pattern`. |
| 106 | +It considers that the position in the middle of a `\r\n` sequence is both the beginning and end of a line, whereas `Pattern` considers that neither is true. |
| 107 | +The semantics of `Pattern` correspond to Unicode recommendations. |
| 108 | + |
| 109 | +In general, we cannot implement the `Pattern` behavior without look-behind asertions (`(?<=𝑋)`), which are only available in ECMAScript 2018+. |
| 110 | +However, in most concrete cases, it is possible to replace the usage of the 'm' flag with a combination of a) more complicated patterns and b) some ad hoc logic in the code using the regex. |
| 111 | + |
| 112 | +Consider the following simple example, which matches every `foo` or `bar` or empty string on a line and prints them: |
| 113 | + |
| 114 | +{% highlight scala %} |
| 115 | +val regex = """(?m)^(foo|bar|)$""".r |
| 116 | +for (m <- regex.findAllMatchIn(input)) |
| 117 | + println(m.matched) |
| 118 | +{% endhighlight %} |
| 119 | + |
| 120 | +Assuming that, in the particular use case we are facing, only UNIX newlines can appear in the `input` string, we can rewrite the regex without the `(?m)` flag: |
| 121 | + |
| 122 | +{% highlight scala %} |
| 123 | +val regex2 = """(?:^|\n)(foo|bar|)(?=\n|$)""".r |
| 124 | +{% endhighlight %} |
| 125 | + |
| 126 | +`regex2` has exactly one match for each match of `regex`, and can therefore be used instead. |
| 127 | +However, the specific string being matched changes, since the newline characters are included in the matched substrings. |
| 128 | +The surrounding code can compensate for that discrepancy, using the capturing group in the middle: |
| 129 | + |
| 130 | +{% highlight scala %} |
| 131 | +for (m <- regex2.findAllMatchIn(input)) |
| 132 | + println(m.group(1)) // `group(1)` instead of `matched` |
| 133 | +{% endhighlight %} |
| 134 | + |
| 135 | +If other newline characters must be recognized, a more complicated pattern needs to be used. |
| 136 | +If it is acceptable to consider the position in the middle of `\r\n` as the start and end of a line (like JavaScript's `RegExp` does), the following regex works: |
| 137 | + |
| 138 | +{% highlight scala %} |
| 139 | +val regex3 = """(?:^|[\n\r\u0085\u2028\u2029])(foo|bar|)(?=[\n\r\u0085\u2028\u2029]|$)""".r |
| 140 | +for (m <- regex3.findAllMatchIn(input)) |
| 141 | + println(m.group(1)) |
| 142 | +{% endhighlight %} |
| 143 | + |
| 144 | +If not, invalid matches must be rejected a posteriori using ad hoc logic: |
| 145 | + |
| 146 | +{% highlight scala %} |
| 147 | +def isBetweenCRAndNL(i: Int): Boolean = |
| 148 | + i > 0 && i < input.length() && input.charAt(i - 1) == '\r' && input.charAt(i) == '\n' |
| 149 | + |
| 150 | +for { |
| 151 | + m <- regex3.findAllMatchIn(input) |
| 152 | + if !isBetweenCRAndNL(m.start(1)) && !isBetweenCRAndNL(m.end(1)) |
| 153 | +} { |
| 154 | + println(m.group(1)) |
| 155 | +} |
| 156 | +{% endhighlight %} |
0 commit comments