Fix term prefix/postfix regular expressions#715
Merged
spencermountain merged 1 commit intospencermountain:devfrom Feb 21, 2020
Merged
Fix term prefix/postfix regular expressions#715spencermountain merged 1 commit intospencermountain:devfrom
spencermountain merged 1 commit intospencermountain:devfrom
Conversation
If compromise gets bundled/packed/minified/transpiled into an aggregate Javascript file, it's possible for the term prefix/postfix regular expressions to change semantically, because they aren't strictly correct.
In particular these two lines include the escapes `\&` and `\•`, which aren't necessary. The first passes through just fine, but `\•` gets converted (in my case) to `\\u2022`. Since this is in the context of a charset (`[]`), it now means that the prefix/postfix will include `\`, `u`, `2`, and `0`. This is **not** what is expected. The minimal fix is simply to remove the unneeded `\` non-escapes; this seems to be enough in my case.
An open question is whether the expressions are what they _should_ be, but that's a more difficult question. The code indicates that the list of punctuation comes from Wikipedia, but I don't feel that's a normative reference for punctuation. I think a more robust expression might be something like:
```javascript
/(\s|\p{General_Category=Punctuation}|\p{General_Category=Symbol}|º|ª)+/u
```
... which is "all whitespace, all punctuation, and all symbols, plus the male and female ordinal characters (which are categorized as letters, sadly)". This seems to cover all of the characters from the original expressions, plus several more. It also avoids having different-but-hard-to-notice differences between the `startings` and `endings` expressions (once they are anchored as needed). If you really _want_ different starting/ending punctuation, the open-/close-specific unicode categories can be explicitly listed in the appropriate expression.
Just a thought for a future fix/improvement. :-)
Owner
|
whoa! thanks Jared - I always forget what needs escaping in and out of the ya- what a mess those regexes are. If I can remember, there were some subtle differences between what is allowed in a starting and an ending. I'd love for some help making this stronger and altogether more sensible. Please give it a spin! change is welcome. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If compromise gets bundled/packed/minified/transpiled into an aggregate Javascript file, it's possible for the term prefix/postfix regular expressions to change semantically, because they aren't strictly correct.
In particular these two lines include the escapes
\&and\•, which aren't necessary. The first passes through just fine, but\•gets converted (in my case) to\\u2022. Since this is in the context of a charset ([]), it now means that the prefix/postfix will include\,u,2, and0. This is not what is expected. The minimal fix is simply to remove the unneeded\non-escapes; this seems to be enough in my case.An open question is whether the expressions are what they should be, but that's a more difficult question. The code indicates that the list of punctuation comes from Wikipedia, but I don't feel that's a normative reference for punctuation. I think a more robust expression might be something like:
/(\s|\p{General_Category=Punctuation}|\p{General_Category=Symbol}|º|ª)+/u... which is "all whitespace, all punctuation (as per Unicode), and all symbols (as per Unicode), plus the male and female ordinal characters (which are categorized as letters, sadly)". This seems to cover all of the characters from the original expressions, plus several more. It also avoids having different-but-hard-to-notice differences between the
startingsandendingsexpressions (once they are anchored as needed). If you really want different starting/ending punctuation, the open-/close-specific unicode categories can be explicitly listed in the appropriate expression.Just a thought for a future fix/improvement. :-)