Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add opt-in astral support to Unicode addons, without separate files #29

Closed
slevithan opened this issue Jun 12, 2012 · 26 comments
Closed

Comments

@slevithan
Copy link
Owner

@mathiasbynens, @walling, this issue picks up from #25, since that's now a closed/merged pull request.


Prior to merging the opt-in astral support from the Unicode Categories Astral addon into the (default) Unicode Categories addon (which is automatically included in xregexp-all.js, and therefore in the npm package), following are the changes I think would be beneficial:

Prerequisite:

Data for base categories like \p{L} needs to be added, not just \p{Ll}, \p{Lu}, etc.

Changes:

The XRegExp.addUnicodePackage function in Unicode Base should change from accepting an object with BMP data and an object with optional aliases to instead accept the following array:

[
    {
        name: 'Ll',
        alias: 'Lowercase_Letter', // optional; used to support full category names
        bmp: '0000-FFFF', // compressed BMP data or null
        astral: '010000-10FFFF' // compressed astral data or null
    },
    
]

The above data will be stored in the private unicode object, without any preprocessing. Two new private lookup objects will be added: bmp and astral. These won't be populated automatically, but instead augmented on first use of each Unicode name in a regex. In other words, these are used to cache generated data.

Astral ranges with surrogate pairs will be built and cached in JavaScript code, on first use.

For scripts and blocks that exist only within astral planes, the bmp property of the objects accepted by addUnicodePackage should be set to null. For addons that include astral support, the astral property should always be included, with null as the value for properties that have no astral code points. The astral property shouldn't be included by addons that don't yet include astral support.

The \p{…} (etc.) syntax token handler in Unicode Base should be updated to check XRegExp.isInstalled("astral") in its handler (main) function. If true, combine data from the bmp and astral objects, and throw a SyntaxError if the match scope is "class". The trigger function currently used in unicode-categories-astral.js will no longer be necessary.

Since the BMP and astral data will be split up, these changes shouldn't inflate unicode-categories.js too much. At least, BMP data will not be included twice.

With these changes in place, separate BMP and all-plane addons won't be needed, and users can opt in or out of astral support at any time.

@mathiasbynens
Copy link
Collaborator

I like it!

I’m just not sure using compressed data even for the astral ranges (and decompressing it in JavaScript) is the way to go. It just seems more efficient to simply include the “raw” ranges as strings, and then build the regular expressions as needed (possibly lazily/on-demand, as you suggested). I would be interested to see how much of a difference in file size the compressed ranges make after gzip. I assume the savings gained by compressing the data would be minimal, and wouldn’t be worth the overhead of having to decompress the data in JavaScript — but maybe that’s just me.

@slevithan
Copy link
Owner Author

I just did some tests that suggest you're probably right. These are very limited tests, but here's what I've got:

Unicode Categories 1.2.0 (BMP only):

Gzipped, unminified:
10,387 bytes.

Gzipped, unminified, with "\u" prefixes added and "\xhh" instead of "\u00hh" (where possible):
11,678 bytes (12.4% or 1.3KB larger)

Numbers for astral data are harder to generate, at the moment. Here is the astral portion of the So category, using the lightly compressed format I suggested earlier: http://pastebin.com/Lx1RahPi

After gzipping, that comes out at 229 bytes.

Here is the same data, as a pregenerated regex pattern with surrogate pairs: http://pastebin.com/C8LHSnMq

After gzipping, that's 250 bytes, or 9.2% larger.

I think I'm willing to live with these numbers. Lets go without compression, at least for astral data.

What do you think about BMP data, though? Should that also not be compressed? Are the modest savings not worth it? Keep in mind that the difference stacks up when loading all of the Unicode addon files. But then, not using any compression would mean avoiding any need for lazy caching. I'm leaning toward removing all compression, unless you or others disagree.

@mathiasbynens
Copy link
Collaborator

1.3 KB is something, but what really matters (IMHO) is the difference between gzipping and not gzipping after minification. Any numbers on that? I’m still leaning towards “not worth it”, but that’s just me :)

Either way, for BMP data it’s less work to decompress the data, so in that case run-time performance might not be much of an issue. But I definitely wouldn’t do it for the astral data.

Just my €.02.

@slevithan
Copy link
Owner Author

My rough guess is that removing BMP compression will add 4 or 5 KB to xregexp-all-min.js after gzipping. That's definitely something, but I think I can live with it, especially since "XRegExp All" is most useful on the server, where file size doesn't matter in the same way.

Adding minification to the numbers I posted earlier probably wouldn't make much difference, since unicode-categories.js includes far more Unicode data (which is not minifiable) than whitespace, etc.

OK, so astral compression is out. BMP compression could go either way, but since we're both leaning against it, let's leave that out, too.

What would be incredibly helpful is if you could update your XRegExp-specific category and script generators (and perhaps add a block generator, too) to match the following format. Separate BMP and all-plane generators are no longer needed. Since your unicode-data project doesn't yet include a generator for most of the properties in unicode-properties.js, never mind that one. That can stay BMP-only, for now.

If you're able to put up a web page somewhere that accepts a list of 21-bit code points and spits out a range with surrogate pairs (or if someone is able to point me to an existing page that does this), I can manually update the data in unicode-properties.js. There are only nine properties in that file, so manual updates without a generator would be fine.

Here's a format that I think will work well:

[
    {
        name: 'Name',
        alias: 'Alias_Name', // included only for categories
        noMerge: true, // included only when needed
        bmp: '\0-\uFFFF',
        astral: '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
    },
    
]

Notes:

  • \xHH should be used instead of \u00HH.
  • The literal characters a-z, A-Z, and 0-9 can be used instead of \xHH. \0 can be used instead of \x00. These are nice-to-haves, but not required.
  • The bmp and astral properties should always be included. If either of them have no code points to list, their value should be null. This will allow supporting (third-party?) addons that add BMP-only or astral-only properties. When the bmp or astral property is absent, I plan to error on that name when in the absent mode.
  • The alias property should only be included for categories.
  • The noMerge property indicates that the astral value should be used exclusively when in astral mode, rather than merging the bmp and astral values. This should only be included for items that include orphaned high surrogate code points in the bmp value, since the astral version will need to push the high surrogates to the end of the alternation list. In other words, noMerge is only included when its value is true. When noMerge is present, the bmp code points should be duplicated in the astral value.
  • If you add a generator for blocks, their names should be prefixed with 'In'.
  • A nice-to-have for astral-only blocks would be to include readable ranges in comments after the regex pattern. E.g., astral: '[\uD800-\uDBFF][\uDC00-\uDFFF]' // \u{10000}-\u{10FFFF}.
  • Categories don't have to continue using the not-quite-alphabetical ordering that I currently use. They can be fully alphabetical (i.e., the C category can come first), if you prefer.
  • Use single quotes. XRegExp currently uses double quotes in code but single quotes in comments, which is needlessly inconsistent. I plan to switch to all single quotes in the next release.

No rush at all, for any of this. If you have suggestions for changing the generated output, I'm certainly interested.

You're welcome to update unicode-base.js, etc., to work with this new data, if you want. If you'd prefer, though, I'd be happy to take care of the addon files myself. It would be much harder for me to update the Python-based Unicode range generators on my own.

Thanks again for all the work you've done to make XRegExp's Unicode support kick serious internationalized booty. :-)

@mathiasbynens
Copy link
Collaborator

No worries, I’ll gladly update the XRegExp-specific range generators. Thanks for the detailed instructions.

@slevithan
Copy link
Owner Author

Much appreciated, good sir!

FYI, I've edited my last comment to add more details, including the noMerge property.

@mathiasbynens
Copy link
Collaborator

Some feedback:


The noMerge property indicates that the astral value should be used exclusively when in astral mode, rather than merging the bmp and astral values. This should only be included for items that include orphaned high surrogate code points in the bmp value, since the astral version will need to push the high surrogates to the end of the alternation list. In other words, noMerge is only included when its value is true. When noMerge is present, the bmp code points should be duplicated in the astral value.

If we’re gonna use an extra property anyway, why not split it up like this?

[
    {
        name: 'Name',
        alias: 'Alias_Name', // included only for categories
        bmp: '[BMP-only range]',
        astral: '[astral-only range, based on surrogate pairs]',
        orphanedHighSurrogates: '[range that matches the standalone high surrogates]'
    },
    
]

(I’m sure you can come up with a more concise name for orphanedHighSurrogates.)


\xHH should be used instead of \u00HH.
The literal characters a-z, A-Z, and 0-9 can be used instead of \xHH. \0 can be used instead of \x00. These are nice-to-haves, but not required.

http://git.io/unicode already takes care of all that. :) It also uses the U+0020 space symbol unescaped. I don’t unescape (or simplify the escapes of) other symbols, as the hexadecimal code point values are much easier to read in regular expression ranges IMHO. It looks like you agree.


The bmp and astral properties should always be included. If either of them have no code points to list, their value should be null. This will allow supporting (third-party?) addons that add BMP-only or astral-only properties. When the bmp or astral property is absent, I plan to error on that name when in the absent mode.

Why not just an empty string instead of null? It’s falsy too, can easily be detected (just like null), so you’d still be able to do all the things you mentioned — but it allows for easier regex building, because it’s just a matter of string concatenation then. If you’d use null you’d have to explicitly check for that, as String(null) == 'null' and not an empty string. In pseudo-code:

// Using `null`:
all = (bmp === null ? '' : bmp) + '|' + (astral === null ? '' : astral)  + '|' + (orphanedHighSurrogates === null ? '' : orphanedHighSurrogates) 
// Using empty strings:
all = bmp + '|' + astral + '|' + orphanedHighSurrogates

Pending your feedback on my above suggestions, I have the scripts ready to generate output for categories, blocks and scripts. Here are some numbers:

$ gz categories.js
orig size (bytes): 52063
gzipped size (bytes): 14208
$ gz scripts.js
orig size (bytes): 15930
gzipped size (bytes): 4667
$ gz blocks.js
orig size (bytes): 20117
gzipped size (bytes): 3328

If you're able to put up a web page somewhere that accepts a list of 21-bit code points and spits out a range with surrogate pairs (or if someone is able to point me to an existing page that does this), I can manually update the data in unicode-properties.js. There are only nine properties in that file, so manual updates without a generator would be fine.

If you can provide me with the code points as an array (e.g. [0x0, 0x1]) I could very easily run it through my script to generate a range based on that.

Some properties require parsing PropList.txt and UnicodeData.txt which is why I haven’t gotten to writing a script for this yet. Someday, though; someday…

@slevithan
Copy link
Owner Author

Thanks for all the feedback. Definitely want to improve this where possible.

The noMerge property indicates that the astral value should be used exclusively when in astral mode, rather than merging the bmp and astral values. [...] When noMerge is present, the bmp code points should be duplicated in the astralvalue.

If we’re gonna use an extra property anyway, why not split it up like this?

Three reasons. In order of least to most important:

  1. In BMP-only mode, the output would be, e.g., \0-\uD7FF\uDC00-\uFFFF\uD800-\uDBFF, instead of the cleaner \0-\uFFFF. No biggie, though. This doesn't really matter.
  2. If one addon adds BMP only support for a particular token name (leaving astral absent) and then another addon adds an astral version of the same name (leaving bmp absent), the two addons would both need to include the orphanedHighSurrogates property (rather than only including it in the addon that provides astral data), so that the two addons can be used either independently or together. This isn't a big deal, but it does add some opportunity for error / nonmatching values.
  3. This is the big one: The invert function in Unicode Base (needed for negated Unicode tokens that are used within character classes) would have to become more complicated, to be able to produce the inversion of the combined non-high-surrogate and high-surrogate ranges. They can't be inverted separately, or you'd come up with the wrong value. E.g., merging the inversion of the combined ranges \x01-\x10 and \x05-\x15 should result in \x00\x16-\uFFFF, not \x00\x11-\uFFFF\x00-\x04\x16-\uFFFF.

I know the noMerge thing is a hack, but since it would only be present for a few items (C, Any....any others?), it's probably the easiest way to go. The alternative is to create a new private function that's able to combine Unicode BMP ranges, which would be called before the inversion.

There is a significant side benefit to creating a function that can combine ranges, though. Hypothetically, you could significantly reduce the file size of unicode-categories.js by not including BMP data for the base categories, and instead generating them by merging the ranges of their subcategories.

If you're able to tackle the merge function (I think it would be nontrivial to do it efficiently), then I'm totally down with dropping noMerge and adding orphanedHighSurrogates (possible alternative name: orphans) instead.

http://git.io/unicode already takes care of all that. :) It also uses the U+0020 space symbol unescaped. I don’t unescape (or simplify the escapes of) other symbols, as the hexadecimal code point values are much easier to read in regular expression ranges IMHO. It looks like you agree.

Yeah, I noticed those in your generator, which is why I mentioned them. :) I don't personally agree about literal space rather than \x20, though. I think that can be hard to spot in ranges, especially when text wrapping is involved. But if you keep literal spaces in your unicode-data generator, there's no reason to make an exception for XRegExp. I'd rather have XRegExp match the unicode-data project's output.

Why not just an empty string instead of null? It’s falsy too, can easily be detected (just like null), so you’d still be able to do all the things you mentioned — but it allows for easier regex building, because it’s just a matter of string concatenation then. If you’d use null you’d have to explicitly check for that, as String(null) == 'null' and not an empty string. In pseudo-code:

// Using `null`:
all = (bmp === null ? '' : bmp) + '|' + (astral === null ? '' : astral)  + '|' + (orphanedHighSurrogates === null ? '' : orphanedHighSurrogates) 
// Using empty strings:
all = bmp + '|' + astral + '|' + orphanedHighSurrogates

Empty string instead of null is fine with me. But I don't think it would make anything easier. You can't do simple string concatenation like that because it might destroy the regex's accuracy by including a leading or trailing |, or a midpoint ||.

Here are some numbers:

$ gz categories.js
orig size (bytes): 52063
gzipped size (bytes): 14208
$ gz scripts.js
orig size (bytes): 15930
gzipped size (bytes): 4667
$ gz blocks.js
orig size (bytes): 20117
gzipped size (bytes): 3328

(I take it those are numbers for unminified files.)

Thanks for posting the numbers. But that's actually worse than I expected, and is kind of a tough sell considering that there is no benefit by default (unless you opt-in to astral mode and your data includes astral code points). Oh well, I suppose it's still better than including separate files for each addon.

Gzipped numbers for the current BMP-only versions of the files:

  • unicode-categories.js (excluding L, which is in base): orig: 25205, gzipped: 10387 (37% larger after gzipping)
  • unicode-scripts.js: orig: 6642, gzipped: 3348 (39% larger after gzipping)
  • unicode-blocks.js: orig: 7205, gzipped: 2447 (36% larger after gzipping)

An alternative approach might be to leave unicode-base.js, unicode-categories.js, unicode-scripts.js, unicode-blocks.js, and unicode-properties.js with BMP-only data (i.e., they use the new data format, but they leave out the astral property), and then add astral support for all of them in a single file called, e.g., unicode-astral.js (which leaves out the bmp properties). The new unicode-astral.js file would be included in xregexp-all.js, so astral support would always be available on the server when installed via npm. Browser users could then easily choose to add astral support in one shot for any of the other Unicode addons they included. And because the files would leave out bmp or astral rather than including them with the value null or '', everything would work fine if, e.g., you chose to include just unicode-base.js, unicode-categories.js, and unicode-astral.js. You'd get both BMP and astral modes for categories, and astral-only support for scripts, blocks, and properties (i.e., if you tried to use scripts, blocks, or properties when not in astral mode, you'd get an error).

Your thoughts? This alternative approach might require, e.g., changing the generators to accept arguments for whether they're outputting astral, BMP, or combined data (or just including separate astral and BMP generators, as is currently done for categories).

If you can provide me with the code points as an array (e.g. [0x0, 0x1]) I could very easily run it through my script to generate a range based on that.

That would be awesome--thanks! That format is totally doable. I can send you a pastebin link or some such for the property data, after the other details about the generators are worked out.

@mathiasbynens
Copy link
Collaborator

How would you feel about renaming noMerge into hasOrphans? (I personally dislike variable names that already negate something.)

Btw, according to my script, this property is only needed for the C and Cs categories, as well as the InHigh_Private_Use_Surrogates and InHigh_Surrogates blocks.

I also noticed a lot of blocks are missing from the current unicode-blocks.js addon (possibly intentionally). Want me to blacklist some of these, so the scripts won’t generate output for them?


You can preview the changes here: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support

Let me know if more tweaks are needed!

Re: the alternative approach, I don’t know. I think it makes the most sense to match astral code points by default, as AFAIK that’s what e.g. \p{L} does in any other regex dialect that natively supports Unicode categories.

@slevithan
Copy link
Owner Author

How would you feel about renaming noMerge into hasOrphans? (I personally dislike variable names that already negate something.)

I appreciate that you're trying to hold me to a good name that both defaults to false and has a non-negated name.

The point of noMerge (possibly better name: hasIndependentSets) was that the astral pattern should include BMP code points and be used without merging the range with the bmp property's value. That would allow pushing orphaned high surrogates (or any subset of them) to the end of the astral range, where they would be included as the last alternation option. This is necessary because the high surrogates cannot be removed from the bmp range nor concatenated with a high-surrogate-free list of BMP code points (when in BMP mode); otherwise, the invert function in Unicode Base (needed when negated BMP Unicode tokens are used within character classes) will not work correctly (unless we add a merge or union function for BMP ranges, but that would probably be nontrivial to implement efficiently). And since the bmp property needs to include high surrogates, it can't be concatenated with astral when in astral mode, or else high surrogates will always be matched on their own, and never as part of a surrogate pair.

If the what or why of any of this is not fully clear, it might be easier to discuss it on Skype or similar. If you think that would help, email me at steves_list@hotmail.com for my Skype name.

Btw, according to my script, this property is only needed for the C and Cs categories, as well as the InHigh_Private_Use_Surrogates and InHigh_Surrogates blocks.

Thanks for this list. Based on these, though, the only one that needs special handling (and therefore the hasIndependentSets property, or whatever mechanism we come up with) is C, since the other three (Cs, InHigh_Private_Use_Surrogates, and InHigh_Surrogates) include BMP code points only. If no astral code points are included, no special handling is needed since high surrogates don't need to be pushed to the end in order to allow matching orphaned high surrogates while still preferring surrogate pairs.

There are two items in unicode-properties.js that will also need hasIndependentSets: Any and Assigned. E.g., Any could be set up with something like this:

{
    name: 'Any',
    hasIndependentSets: true,
    bmp: '\0-\uFFFF',
    astral: '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
    // Alternatively, BMP code points (when included in astral) can come last:
    // astral: '[\uD800-\uDBFF][\uDC00-\uDFFF]|[\0-\uFFFF]'
}

I also noticed a lot of blocks are missing from the current unicode-blocks.js addon (possibly intentionally). Want me to blacklist some of these, so the scripts won’t generate output for them?

I imagine you're talking about the astral-only blocks that include code points 0x10000 and higher. I haven't included them so far since, up to now, XRegExp has only supported the BMP. For such blocks (as well as astral-only scripts), the bmp property should be absent, so that the name errors when in BMP-only mode.

You can preview the changes here: https://github.com/mathiasbynens/XRegExp/compare/astral-support
Let me know if more tweaks are needed!

Looks cool. Still do need a few tweaks, though. Sorry for all the hassle and for not making things clear enough the first time around. The main changes needed are for the surrogate-related stuff, described above. There's one other thing, though: I see that you've wrapped the bmp data in [...]. In order to work with that in BMP mode, I'd have to strip off the character class brackets when using BMP Unicode properties inside character classes. It might be easier or more efficient to just add the character class brackets in code when they're needed.

Re: the alternative approach, I don’t know. I think it makes the most sense to match astral code points by default, as AFAIK that’s what e.g. \p{L} does in any other regex dialect that natively supports Unicode categories.

Only a select few regex flavors support astral code points. Java only started supporting them via, e.g., \p{L} in JDK7. In versions 4 through 6, Java's \p{L}, etc. supported only BMP code points. I believe .NET still supports only BMP code points with \p{L}, etc. JGsoft software like RegexBuddy doesn't support astral code points, either, and has no near-term plans to add support for them. Very few people care. Performance and backward compatibility concerns tend to outweigh the benefits of astral support. In XRegExp, IMO support for character classes by default is more important than astral support. But the generator probably shouldn't be changed because of this. If I do end up splitting things out into a new unicode-astral.js file, I can do this manually with some searches and replacements on the generator's output.

@mathiasbynens
Copy link
Collaborator

The point of noMerge […]

Ok, I made bmp output the BMP-only range, and astral the complete range (i.e. BMP + astral code points). One last question regarding this: mathiasbynens@fd8a185#commitcomment-1471951 Once you answer this, I’ll push the updates to GitHub so you can take another look.

I imagine you're talking about the astral-only blocks that include code points 0x10000 and higher. I haven't included them so far since, up to now, XRegExp has only supported the BMP. For such blocks (as well as astral-only scripts), the bmp property should be absent […]

Done.

There's one other thing, though: I see that you've wrapped the bmp data in [...]. In order to work with that in BMP mode, I'd have to strip off the character class brackets when using BMP Unicode properties inside character classes. It might be easier or more efficient to just add the character class brackets in code when they're needed.

No problem, I got rid of the brackets.

Only a select few regex flavors support astral code points.

Oh, in that case I fully agree with your new proposal!

But the generator probably shouldn't be changed because of this. If I do end up splitting things out into a new unicode-astral.js file, I can do this manually with some searches and replacements on the generator's output.

I really don’t mind tweaking the generator — if you just let me know what exactly needs to be changed, I’ll be happy to change it so you get the output you need.

@slevithan
Copy link
Owner Author

One last question regarding this: mathiasbynens@fd8a185#commitcomment-1471951 Once you answer this, I’ll push the updates to GitHub so you can take another look.

Answered: #commitcomment-1472579. Unfortunately, the answer is complicated and leads to more questions. But I think we're almost there.

@mathiasbynens
Copy link
Collaborator

Okay, I’ve applied the suggested changes (I think!): https://github.com/mathiasbynens/XRegExp/compare/add-astral-support

Let me know if I missed anything, or if more tweaks are needed.

@slevithan
Copy link
Owner Author

Wo0t! Both the output and the generator scripts are looking lovely.

blocks.js looks perfect, to me.

A couple last issues for scripts.js and categories.js:

  • When both bmp and astral properties are included, the bmp value should not be repeated in astral, unless hasIndependentSets is present and true. Here's an example of the current output, from scripts.js:
    {
        name: 'Bamum',
        bmp: '\uA6A0-\uA6F7',
        astral: '\uA6A0-\uA6F7|\uD81A[\uDC00-\uDE38]'
    },

That should be:

    {
        name: 'Bamum',
        bmp: '\uA6A0-\uA6F7',
        astral: '\uD81A[\uDC00-\uDE38]'
    },
  • When hasIndependentSets is true and therefore the bmp value is repeated in astral, the BMP part of the astral value should be wrapped in square brackets. Following is a clipped copy of the current C output, from categories.js:
    {
        name: 'C',
        alias: 'Other',
        hasIndependentSets: true,
        bmp: '\0-\x1F...\uFFFF',
        astral: '\0-\x1F...\uFFFF|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
    },

That should be:

    {
        name: 'C',
        alias: 'Other',
        hasIndependentSets: true,
        bmp: '\0-\x1F...\uFFFF',
        astral: '[\0-\x1F...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
    },

In your defense, I never mentioned anything about this rule. :)


As we discussed on Twitter, I'm thinking of adding A as a per-regex astral-support-opt-in flag.

@mathiasbynens
Copy link
Collaborator

Thanks, the example output helps a lot!

Updated: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support

@slevithan
Copy link
Owner Author

Schweeeeeet!

But, ack! One last thing....sorry I didn't catch this last time!!

Here's the current output for C:

    {
        name: 'C',
        alias: 'Other',
        hasIndependentSets: true,
        bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
        astral: '[\0-\x1F...\uD7FC-\uF8FF...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
    },

That includes high surrogates in the first alternation option in the astral value, and therefore will never match a high surrogate as part of a complete surrogate pair. On the other hand, the bmp value is perfect and must stay the same.

The easiest and probably best way to fix this is to simply put the BMP value last (for items that include hasIndependentSets), like this:

    {
        name: 'C',
        alias: 'Other',
        hasIndependentSets: true,
        bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
        astral: '\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]|[\0-\x1F...\uD7FC-\uF8FF...\uFFFF]'
    },

Another more complicated way to fix this would be to put only the high surrogates last, like this:

    {
        name: 'C',
        alias: 'Other',
        hasIndependentSets: true,
        bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
        astral: '[\0-\x1F...\uD7FC-\uD7FF\uDC00-\uF8FF...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]|[\uD800-\uDBFF]'
    },

This alternative approach is probably not worth it.

@mathiasbynens
Copy link
Collaborator

Done: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support

If this looks alright, I’ll submit a pull request.

@slevithan
Copy link
Owner Author

If this looks alright, I’ll submit a pull request.

Do eeet! Looks good to me. Thanks for your longsuffering work and collaboration on this and the extensive revisions. I'll start work on Unicode Base so it can work with this data.

After this is merged and the Unicode addons are updated to work with the new data, I'll submit the code point data for unicode-properties.js in the format you requested. That should be the very last piece of this.

@slevithan
Copy link
Owner Author

OK, I've largely rewritten Unicode Base to work with the new data. I've also added some tests, and added the newly generated data to all of the Unicode addons except Unicode Properties (which doesn't yet have the new-style data).

When plugging the data into the addons, I realized that, since you're now simply pushing the entire BMP range (rather than just high surrogates) to the end of astral, we never need to duplicate the BMP range--we just need a property that tells us whether to concat the BMP range at the end instead of the beginning. I've therefore renamed hasIndependentSets as isBmpLast. I've already updated the generators and result data to match this, so no further changes are needed for this. (If you dislike the new variable name, I'd be happy to improve it.)


There is one additional change I'm hoping you can take care of, since I'm not sure how you'd prefer to handle it:

Because the generator outputs string escapes that end up being passed to XRegExp as literal characters (in other words, the output is '\0\x2D-\x0100' instead of '\\0\\x2D-\\u0100'), the 13 regex metacharacters matched by /[-[{()*+?.\\^$|]/ might or definitely will (depending on the specific character) cause bugs unless they are escaped. In other words, the output for the range I just mentioned should be either '\0\\\x2D-\x0100' or '\0\\--\x0100'. (I'm thinking the latter might be preferable, since if the characters already need special handling, they might as well be made shorter, too.)

The hex codes for each of the 13 metacharacters are as follows:

  • $ is 0x24
  • ( is 0x28
  • ) is 0x29
  • * is 0x2A
  • + is 0x2B
  • - is 0x2D
  • . is 0x2E
  • ? is 0x3F
  • [ is 0x5B
  • \ is 0x5C
  • _Edit:_ ] is 0x5D
  • ^ is 0x5E
  • { is 0x7B
  • | is 0x7C

I'll be posting the data for unicode-properties.js in the format you requested as soon as I have some more free time to deal with this.

@mathiasbynens
Copy link
Collaborator

In other words, the output for the range I just mentioned should be either '\0\\\x2D-\x0100' or '\0\\--\x0100'. (I'm thinking the latter might be preferable, since if the characters already need special handling, they might as well be made shorter, too.)

See #34. I’ve made a commit implementing the first option, then another one that switched to the latter option (which you seemed to prefer), so you can always switch back to the other approach if you want. IMHO the first option is more readable, and I doubt the byte savings of using the other option would make a difference after gzip compression, but it’s your call :)


Turns out the “compact” option results in a larger file after gzip compression.

Using hexadecimal escapes (e.g. \\x2A):

$ curl -s https://raw.github.com/mathiasbynens/XRegExp/caec8eaa3366b35a32b40e29821a045951f78200/tools/output/categories.js | gzip | wc -c
14187

Using simple backslash escapes (e.g. \\*):

$ curl -s https://raw.github.com/mathiasbynens/XRegExp/c9cd440fcb6bd20d69414ed25b60533ea5987b0c/tools/output/categories.js | gzip | wc -c
14200

@slevithan
Copy link
Owner Author

Merged. I reverted the second commit, so that the first option (hexadecimal escapes, e.g. \\x2A) is used. I've also updated Unicode Base's invertBmp function to work with the modified data.

Apart from unicode-properties.js (which I still need to delay, due to being very busy at the moment), the new Unicode addons are 100% done. It's all working wonderfully, and I couldn't have done it without you, so thanks again! 😄

FYI, all planned changes for the XRegExp 2.1.0 package are now done, with two exceptions:

  1. Need to add the new-style data to unicode-properties.js.
  2. Need to finish some basic code cleanup (e.g., changing double quotes to single quotes).

The v2.1.0 changes are listed here.

@mathiasbynens
Copy link
Collaborator

It's all working wonderfully, and I couldn't have done it without you, so thanks again!

Right back at you! ⁵

@slevithan
Copy link
Owner Author

OK, here be the 21-bit code points for the Unicode Level-1 properties used in unicode-properties.js:

Here's the Default_Ignorable_Code_Point data, as an example:

[0x00AD, 0x034F, 0x115F, 0x1160, 0x17B4, 0x17B5, '180B-180D', '200B-200F', '202A-202E', '2060-206F', 0x3164, 'FE00-FE0F', 0xFEFF, 0xFFA0, 'FFF0-FFF8', '1D173-1D17A', 'E0000-E0FFF']

Note that, for ranges, I've included them as hyphen-delimited strings, rather than listing out each code point. If you'd like me to change the data to list out each code point as a number, just let me know.

@mathiasbynens
Copy link
Collaborator

Having each code point as a number (rather than the ranges) would be much easier for me, as it would allow me to re-use the existing createRange() method in utils.py. If it’s not too much work for you to generate the list / array that way, I’d prefer it. If not, just let me know and I’ll write a new script for it.

I already have the data for these three properties:

@mathiasbynens
Copy link
Collaborator

No response for four days, so I’ll assume it’s not straightforward to generate the data in the format I requested. No worries, I just wrote a script that converts your data into a format createRange can handle.

You can sanity-check the output here: #36

By the way, may I ask how you generated this data manually? I’m stilll looking in to how it could be done efficiently in unicode-data.

@slevithan
Copy link
Owner Author

slevithan commented Jun 28, 2012

Thank you!!

I've been travelling for the last few days, which was the main reason I hadn't gotten back to you. But the solution you came up with (using a new Python generator script to process the data I provided earlier) is easier and better than what I likely would have done.

To generate the data manually, I used UnicodeSet from unicode.org. E.g., for \p{Alphabetic}. Using UnicodeSet's output, it just takes a few regex search-and-replace operations in a text editor to produce the data I provided. UnicodeSet doesn't support \p{Assigned}, but you can use \P{Cn} to get the same result.

I suppose it's entirely possible to scrape UnicodeSet output for unicode-data, but then (AFAIK) you could only work with the latest version of Unicode. Is it possible that PyICU might give you some tools or data to work with that would be helpful toward this?


At long last, this issue is closedededed. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants