New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Unicode build scripts #220

Merged
merged 6 commits into from Feb 17, 2018

Conversation

Projects
None yet
3 participants
@mathiasbynens
Collaborator

mathiasbynens commented Feb 14, 2018

This uses node-unicode-data instead of my old Python scripts, reducing the need to maintain the Unicode data parser as part of this repository.

@@ -42,6 +43,8 @@
"browserify": "^12.0.1",
"eslint": "^3.19.0",
"jasmine": "^2.5.3",
"jsesc": "^2.5.1",
"unicode-9.0.0": "^0.7.4",

This comment has been minimized.

@mathiasbynens

mathiasbynens Feb 14, 2018

Collaborator

I’ve used unicode-9.0.0 (and not unicode-10.0.0) here to more clearly show the actual diff. We can easily update to unicode-10.0.0 later by just swapping out the dependency (i.e. changing this line) and then running npm run build-unicode-data again. No more manually updating Unicode data files and Python scripts!

This comment has been minimized.

@josephfrazier

josephfrazier Feb 14, 2018

Collaborator

Yeah, I think this is the way to go! I may also look into using a modified version of prettier-diff to ignore differences in the object key ordering.

This comment has been minimized.

@mathiasbynens

mathiasbynens Feb 14, 2018

Collaborator

I tried to make sure the order remains the same. AFAICT the only difference is that the property names are now quoted (jsesc does this), i.e. name: 'Foo' becomes 'name': 'Foo', and that the escaping for various characters is different.

bmp: '\u31C0-\u31EF'
},
{
name: 'InCJK_Symbols_and_Punctuation',

This comment has been minimized.

@mathiasbynens

mathiasbynens Feb 14, 2018

Collaborator

Turns out the canonical name is InCJK_Symbols_And_Punctuation, with a capital letter A. This is technically a breaking change.

This comment has been minimized.

@josephfrazier

josephfrazier Feb 14, 2018

Collaborator

Can we make it preserve the pre-existing names, so as to avoid the breaking change for now? Of course, we should probably open a separate issue about using the canonical names.

This comment has been minimized.

@slevithan

slevithan Feb 14, 2018

Owner

It's not a breaking change to change casing. Feel free to update it to the canonical name! Per Unicode's recommendations, XRegExp already treats Unicode property names as case insensitive, with any spaces, hyphens, and underscores ignored. See

// Generates a token lookup name: lowercase, with hyphens, spaces, and underscores removed
function normalize(name) {
return name.replace(/[- _]+/g, '').toLowerCase();
}

This comment has been minimized.

@mathiasbynens

mathiasbynens Feb 15, 2018

Collaborator

Oh, right! Woohoo :) (I was thinking about native Unicode property escapes, where I explicitly decided not to normalize.)

Patch updated.

This comment has been minimized.

@josephfrazier

josephfrazier Feb 15, 2018

Collaborator

Ooh, I wasn't aware of that. Nice!

@josephfrazier

This comment has been minimized.

Collaborator

josephfrazier commented Feb 14, 2018

Thanks for working on this! I had been thinking about it, but I hadn't yet worked out how to go about the port. For example, using jsesc and unicode-property-value-aliases would have taken me a while to figure out. I'll comment further inline.

josephfrazier added a commit to mathiasbynens/xregexp that referenced this pull request Feb 14, 2018

josephfrazier added a commit to mathiasbynens/xregexp that referenced this pull request Feb 14, 2018

@@ -132,7 +132,7 @@ const createBmpRange = (r, { addBrackets } = { addBrackets: true }) => {
const assemble = ({ name, alias, codePoints }) => {
const { bmp, astral, isBmpLast } = createRange(codePoints);
const result = { name };
const result = { name: name.replace('_And_', '_and_').replace('_For_', '_for_') };

This comment has been minimized.

@mathiasbynens

mathiasbynens Feb 14, 2018

Collaborator

FWIW, I’m against this change. IMHO we should use the canonical names, not an arbitrarily-cased variation of them. I understand the desire to merge this with the workaround but I’m hoping it can be removed later.

This comment has been minimized.

@josephfrazier

josephfrazier Feb 14, 2018

Collaborator

Absolutely, we should ultimately use the canonical names :) I just want to keep the refactoring separate from behavioral changes.

josephfrazier added a commit to mathiasbynens/xregexp that referenced this pull request Feb 14, 2018

josephfrazier added a commit to mathiasbynens/xregexp that referenced this pull request Feb 14, 2018

Sort Unicode categories by name
This makes it easier to diff against the Python output.
See slevithan#220 (comment)
Simplify Unicode build scripts
This uses node-unicode-data instead of Python scripts, reducing the
need to maintain the Unicode data parser as part of this repository.
@mathiasbynens

This comment has been minimized.

Collaborator

mathiasbynens commented Feb 15, 2018

@josephfrazier Thanks for the help here! I’ve squashed all commits to remove the casing changes that are not needed per @slevithan’s latest comment.

@josephfrazier

This comment has been minimized.

Collaborator

josephfrazier commented Feb 15, 2018

Thanks @mathiasbynens ! Now that we're using JS for the build scripts, I went ahead and added linting. We can always squash upon merge, instead of having to repeatedly force-push the branch. Let me know what you think.

@mathiasbynens

This comment has been minimized.

Collaborator

mathiasbynens commented Feb 15, 2018

I disagree with the linter on some stylistic choices 😅 But your changes LGTM. Thanks!

@josephfrazier

This comment has been minimized.

Collaborator

josephfrazier commented Feb 15, 2018

FWIW, the lint rules were only set up to enforce the pre-existing project style, and it sounds like @slevithan is open to changing the style (see #184), but we can do that separately.

Thanks for the review! I think I'm comfortable with the changes here, but would like @slevithan to sign off as well. If it helps, I found that using prettier-diff makes it easier to ignore the stylistic changes to tools/output/*js. The semantic changes appear to be:

  • Adding Cased_Letter to categories.js
  • Adding a lot of values to properties.js
  • Unescaping some characters in bmp values throughout
@slevithan

This comment has been minimized.

Owner

slevithan commented Feb 17, 2018

This looks awesome!

Cased_Letter is nice to add.

I'm not confident about adding all of the binary properties in unicode-properties.js. Many of them do not seem very useful (of course this is subjective), and a few of those that seem more useful already have similar equivalents in unicode-categories.js, or can be gotten close to using multiple categories there. The list so far has intentionally been only those 9 binary properties needed to meet the UTS 18 Level 1 RL1.2 requirements for Unicode regex support. See http://unicode.org/reports/tr18/#RL1.2, and note that Assigned is already supported through the special handling at

unicodeData.push({
name: 'Assigned',
// Since this is defined as the inverse of Unicode category Cn (Unassigned), the Unicode
// Categories addon is required to use this property
inverseOf: 'Cn'
});
.

But I'm fine with deferring to @mathiasbynens on this. Mathias, if you give the confirmation here that you think it's the right call to add them all, then we can move forward with that. I guess people who are worried about filesize have the option to leave out unicode-properties.js.

Either way, we should account for or update the handling for Assigned that I highlighted above. Do you think it's the right thing to do to continue using the special inverseOf handling for Assigned rather than explicitly including all the codepoints for it in unicode-properties.js?

Happy to use more global lint rules (I don't think there's benefit from significant differences for .eslintrc.js between root, tests, and tools/scripts), and to change lint rules in whatever ways you guys think is appropriate based on what enables us to get there and based on modern ES5+ conventions.

I'm including the list of binary properties being considered for addition below, for posterity. None of these have been supported to this point.

ASCII_Hex_Digit
Bidi_Control
Bidi_Mirrored
Case_Ignorable
Cased
Changes_When_Casefolded
Changes_When_Casemapped
Changes_When_Lowercased
Changes_When_NFKC_Casefolded
Changes_When_Titlecased
Changes_When_Uppercased
Composition_Exclusion
Dash
Deprecated
Diacritic
Expands_On_NFC
Expands_On_NFD
Expands_On_NFKC
Expands_On_NFKD
Extender
Full_Composition_Exclusion
Grapheme_Base
Grapheme_Extend
Grapheme_Link
Hex_Digit
Hyphen
IDS_Binary_Operator
IDS_Trinary_Operator
ID_Continue
ID_Start
Ideographic
Join_Control
Logical_Order_Exception
Math
Other_Alphabetic
Other_Default_Ignorable_Code_Point
Other_Grapheme_Extend
Other_ID_Continue
Other_ID_Start
Other_Lowercase
Other_Math
Other_Uppercase
Pattern_Syntax
Pattern_White_Space
Prepended_Concatenation_Mark
Quotation_Mark
Radical
Sentence_Terminal
Soft_Dotted
Terminal_Punctuation
Unified_Ideograph
Variation_Selector
XID_Continue
XID_Start
Provide only binary properties for UTS 18 Level 1 RL1.2 requirements …
…for Unicode regex support

See #220 (comment)

The list was made by running the following on the master branch:

    require('./tools/output/properties').map(o => o.name)
@josephfrazier

This comment has been minimized.

Collaborator

josephfrazier commented Feb 17, 2018

In the interest of making this change more of a refactor than a functional change, I pushed b8ebf6c to revert to the pre-existing list of properties. Regarding file sizes, here's some before/after this commit:

xregexp-all.js:

  • Before: 340K
  • After: 236K

xregexp-4.0.0-next.tgz (what npm install xregexp downloads):

  • Before: 148K
  • After: 132K

You can see these for yourself by running the following command (NOTE: this will delete any files untracked by git):

git clean -fxd && npm install && npm pack && du -sh xregexp*tgz && du -sh xregexp-all.js

Obviously, feel free to revert, @mathiasbynens :)

@slevithan slevithan merged commit 23cd568 into slevithan:master Feb 17, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@slevithan

This comment has been minimized.

Owner

slevithan commented Feb 17, 2018

That works, and thanks for the file size data.

Merging this, but @mathiasbynens feel free to follow up with a PR for the other properties if you'd like.

@mathiasbynens mathiasbynens deleted the mathiasbynens:update-scripts branch Feb 18, 2018

@mathiasbynens

This comment has been minimized.

Collaborator

mathiasbynens commented Feb 18, 2018

For the binary properties, I agree we don’t need to support all of them.

Here’s the diff between the list of properties that are supported in native JS \p{…} regular expressions and those in XRegExp: https://tc39.github.io/ecma262/#table-binary-unicode-properties

-ASCII_Hex_Digit
-Bidi_Control
-Bidi_Mirrored
-Case_Ignorable
-Cased
-Changes_When_Casefolded
-Changes_When_Casemapped
-Changes_When_Lowercased
-Changes_When_NFKC_Casefolded
-Changes_When_Titlecased
-Changes_When_Uppercased
-Dash
-Deprecated
-Diacritic
-Emoji
-Emoji_Component
-Emoji_Modifier
-Emoji_Modifier_Base
-Emoji_Presentation
-Extender
-Grapheme_Base
-Grapheme_Extend
-Hex_Digit
-IDS_Binary_Operator
-IDS_Trinary_Operator
-ID_Continue
-ID_Start
-Ideographic
-Join_Control
-Logical_Order_Exception
-Math
-Pattern_Syntax
-Pattern_White_Space
-Quotation_Mark
-Radical
-Regional_Indicator
-Sentence_Terminal
-Soft_Dotted
-Terminal_Punctuation
-Unified_Ideograph
-Variation_Selector
-White_Space
-XID_Continue
-XID_Start

Maybe we should document this difference somewhere, and refer developers to native \p{…} in case they need any of these properties?

@slevithan

This comment has been minimized.

Owner

slevithan commented Feb 18, 2018

@mathiasbynens, for reference, where do the differences in our lists come from? E.g. your list does not include the Expands_On_* properties, and mine didn't include the Emoji_* properties.

It's also interesting to see the aliases included for binary properties and scripts. XRegExp already uses its alias feature in unicode-categories.js, and could easily support e.g. Alpha, DI, Lower, NChar, Upper, and space, if these are generatable from unicode-property-value-aliases.

In the future, it would be great for XRegExp to defer to native \p{...} support when available, rewriting the property names and prefixes as necessary to support that. I'm also very open to breaking changes that make XRegExp stricter by matching ES2019 support/rules for \p{...} and or making XRegExp's support for \p{...} a subset of what is supported by ES2019. I'd also be fine with removing XRegExp's support for Unicode blocks, to match ES2019.

@mathiasbynens

This comment has been minimized.

Collaborator

mathiasbynens commented Feb 18, 2018

The list of properties supported in native \p{…} changed several times as the proposal evolved. tc39/proposal-regexp-unicode-property-escapes#27 provides some background on a subset of these changes. Expands_On_* are deprecated properties, so we decided not to support them for this new (to JS RegExp) API.

I'm also very open to breaking changes that make XRegExp stricter by matching ES2018 support/rules for \p{...} and or making XRegExp's support for \p{...} a subset of what is supported by ES2018.

+1 I was hoping we could do this!

(Nit: s/ES2018/ES2019/, not that the version number really matters…)

I'd also be fine with removing XRegExp's support for Unicode blocks, to match ES2018.

I was gonna suggest this in particular. Blocks are not a very useful grouping anyway, so it makes sense to get rid of them. Also see tc39/proposal-regexp-unicode-property-escapes#18 where this was discussed for native \p{…}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment