Fix term prefix/postfix regular expressions #715
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If compromise gets bundled/packed/minified/transpiled into an aggregate Javascript file, it's possible for the term prefix/postfix regular expressions to change semantically, because they aren't strictly correct.
In particular these two lines include the escapes
\&
and\•
, which aren't necessary. The first passes through just fine, but\•
gets converted (in my case) to\\u2022
. Since this is in the context of a charset ([]
), it now means that the prefix/postfix will include\
,u
,2
, and0
. This is not what is expected. The minimal fix is simply to remove the unneeded\
non-escapes; this seems to be enough in my case.An open question is whether the expressions are what they should be, but that's a more difficult question. The code indicates that the list of punctuation comes from Wikipedia, but I don't feel that's a normative reference for punctuation. I think a more robust expression might be something like:
/(\s|\p{General_Category=Punctuation}|\p{General_Category=Symbol}|º|ª)+/u
... which is "all whitespace, all punctuation (as per Unicode), and all symbols (as per Unicode), plus the male and female ordinal characters (which are categorized as letters, sadly)". This seems to cover all of the characters from the original expressions, plus several more. It also avoids having different-but-hard-to-notice differences between the
startings
andendings
expressions (once they are anchored as needed). If you really want different starting/ending punctuation, the open-/close-specific unicode categories can be explicitly listed in the appropriate expression.Just a thought for a future fix/improvement. :-)