Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix term prefix/postfix regular expressions #715

Merged

Conversation

JaredReisinger
Copy link
Contributor

If compromise gets bundled/packed/minified/transpiled into an aggregate Javascript file, it's possible for the term prefix/postfix regular expressions to change semantically, because they aren't strictly correct.

In particular these two lines include the escapes \& and \•, which aren't necessary. The first passes through just fine, but \• gets converted (in my case) to \\u2022. Since this is in the context of a charset ([]), it now means that the prefix/postfix will include \, u, 2, and 0. This is not what is expected. The minimal fix is simply to remove the unneeded \ non-escapes; this seems to be enough in my case.

Just for reference, I noticed this with nlp('today at 2:30pm').text('normal'). In development, this returns today at 2:30pm as expected. In production, I was getting today at 30pm because the 2: was treated as a starting prefix.


An open question is whether the expressions are what they should be, but that's a more difficult question. The code indicates that the list of punctuation comes from Wikipedia, but I don't feel that's a normative reference for punctuation. I think a more robust expression might be something like:

/(\s|\p{General_Category=Punctuation}|\p{General_Category=Symbol}|º|ª)+/u

... which is "all whitespace, all punctuation (as per Unicode), and all symbols (as per Unicode), plus the male and female ordinal characters (which are categorized as letters, sadly)". This seems to cover all of the characters from the original expressions, plus several more. It also avoids having different-but-hard-to-notice differences between the startings and endings expressions (once they are anchored as needed). If you really want different starting/ending punctuation, the open-/close-specific unicode categories can be explicitly listed in the appropriate expression.

Just a thought for a future fix/improvement. :-)

If compromise gets bundled/packed/minified/transpiled into an aggregate Javascript file, it's possible for the term prefix/postfix regular expressions to change semantically, because they aren't strictly correct.

In particular these two lines include the escapes `\&` and `\•`, which aren't necessary.  The first passes through just fine, but `\•` gets converted (in my case) to `\\u2022`.  Since this is in the context of a charset (`[]`), it now means that the prefix/postfix will include `\`, `u`, `2`, and `0`.  This is **not** what is expected.  The minimal fix is simply to remove the unneeded `\` non-escapes; this seems to be enough in my case.

An open question is whether the expressions are what they _should_ be, but that's a more difficult question.  The code indicates that the list of punctuation comes from Wikipedia, but I don't feel that's a normative reference for punctuation.  I think a more robust expression might be something like:

```javascript
/(\s|\p{General_Category=Punctuation}|\p{General_Category=Symbol}|º|ª)+/u
```

... which is "all whitespace, all punctuation, and all symbols, plus the male and female ordinal characters (which are categorized as letters, sadly)".  This seems to cover all of the characters from the original expressions, plus several more.  It also avoids having different-but-hard-to-notice differences between the `startings` and `endings` expressions (once they are anchored as needed).  If you really _want_ different starting/ending punctuation, the open-/close-specific unicode categories can be explicitly listed in the appropriate expression.

Just a thought for a future fix/improvement.  :-)
@spencermountain
Copy link
Owner

whoa! thanks Jared - I always forget what needs escaping in and out of the [] sections.
I didn't know that babel touches regex, either.

ya- what a mess those regexes are. If I can remember, there were some subtle differences between what is allowed in a starting and an ending. I'd love for some help making this stronger and altogether more sensible. Please give it a spin! change is welcome.
thanks!

@spencermountain spencermountain merged commit 3932c55 into spencermountain:dev Feb 21, 2020
@JaredReisinger JaredReisinger deleted the term-pre-post-regexps branch August 7, 2020 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants