Who else is exploring exposing sequences in regular expressions? #12

nathanhammond · 2018-06-15T18:41:37Z

Is TC39 the only group exploring directly exposing emoji sequences in regular expressions?

If so, why, and who can we work with to expand the number of people thinking about it?
If not, who else is doing it, and what have they discovered?

mathiasbynens · 2018-06-16T12:38:15Z

I haven’t seen any similar proposals. AFAIK this is entirely new ground.

I’ve made the Unicode Consortium aware of this proposal. If the proposal eventually makes it to Stage 4, it would certainly be nice to have the Unicode consortium’s blessing by means of updating UTS18 with an optional recommendation to implement this functionality.

All that is out of scope for this particular repository/proposal, though.

gibson042 · 2018-09-26T15:45:23Z

There's a limited form in Perl: https://ideone.com/iO8kBg

#!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";

my $re = qr/\N{LATIN SMALL LETTER A WITH MACRON AND GRAVE}/u;
my @cases = ("\x{0101}", "\x{0300}", "\x{0101}\x{0300}");

for my $str (@cases) {
  if ( $str =~ $re ) {
    print ("matches $re: $str\n");
  } else {
    print ("does not match $re: $str\n");
  }
}

does not match (?^u:\N{U+101.300}): ā
does not match (?^u:\N{U+101.300}): ̀
matches (?^u:\N{U+101.300}): ā̀

Alhadis · 2018-10-15T12:57:20Z

Are there any sequence properties which aren't pictograph-related? I couldn't help noticing every property under "Proposed Solution" was related to Emoji.

I'm curious if there are any other potential properties which are or might one day be used.

Could this syntax be extended to cover Unicode decomposition? Or is that something for another proposal? Something like this:

const before = /[1₁➊①]/; // Huge list of anything that decomposes to U+0031
const after  = /1/d;      // “Decompose” flag

after.match("¹"); // ⇒ true
after.match("1"); // ⇒ true

Basically, it'd replicate how most contemporary browsers use “smart” matching when searching for "1" on a page (which matches Unicode sequences which have an equivalent decomposition mapping).

mathiasbynens · 2019-01-04T13:33:24Z

@Alhadis That sounds like a separate proposal to me.

@nathanhammond To your original question, I've been discussing this proposal with Unicode folks at Google, and more recently have created a formal proposal at the Unicode level, which has now been officially submitted. Mark Davis will present the proposal during the January UTC meeting at Google MTV from January 22–25. If the proposal gets accepted, we'll know what the official term is for sequence properties.

One of the things the proposal addresses is the idea of this functionality not being JavaScript-only. If the proposal ends up getting accepted, it would explicitly mention sequence properties in UTS#18, enabling us to proceed with this proposal knowing we have the Unicode Consortium's blessing, and enabling other languages to consider adopting this functionality.

nathanhammond · 2019-01-10T18:21:54Z

@mathiasbynens I've just read the proposal and I really appreciate you doing the upstream legwork in Unicode land. If those changes land upstream I would be in favor of advancing this proposal.

In my head I'm still generally uneasy with the idea of heading down a path for standardizing this for JS without the PCRE folks. However, I also feel that TC39 (you) have been paying close attention to this space recently and JS is therefore well-suited to be the first one to deliver.

markusicu · 2021-06-24T17:32:10Z

Update: Unicode has agreed to formalize “properties of strings”: https://www.unicode.org/reports/tr18/#domain_of_properties

The recommendation for which properties to support in regular expressions has been updated to include the 7 binary properties of strings: https://www.unicode.org/reports/tr18/#Full_Properties → “Properties marked with * are properties of strings, not just single code points.”

These are the 7 emoji properties of strings defines in UTS #51 and its data files.

markusicu · 2021-06-24T17:47:13Z

Could this syntax be extended to cover Unicode decomposition? Or is that something for another proposal? Something like this:
const before = /[1₁➊①]/; // Huge list of anything that decomposes to U+0031
const after  = /1/d;      // “Decompose” flag

after.match("¹"); // ⇒ true
after.match("1"); // ⇒ true

Those kinds of things are possible, and are discussed in UTS 18: https://www.unicode.org/reports/tr18/#Wildcard_Properties

They are actually typically fancy ways of creating sets of characters, not of strings. You can see that in the UnicodeSet demo: [:toNFKD=1:]

But other functions like this could yield sets of strings.

And as Richard pointed out, \N{character name} can yield a multi-character string. See UTS 18 “The data in NamedSequences.txt is also used in \N{…}.” → https://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt

All of these are out of scope for the current proposal.

Basically, it'd replicate how most contemporary browsers use “smart” matching when searching for "1" on a page (which matches Unicode sequences which have an equivalent decomposition mapping).

The browsers I am aware of implement “ctrl+F” in-page search via collation (e.g., using ICU's class StringSearch), which is a very different kind of algorithm (UCA=UTS 10, and CLDR extensions).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Who else is exploring exposing sequences in regular expressions? #12

Who else is exploring exposing sequences in regular expressions? #12

nathanhammond commented Jun 15, 2018

mathiasbynens commented Jun 16, 2018

gibson042 commented Sep 26, 2018

Alhadis commented Oct 15, 2018

mathiasbynens commented Jan 4, 2019

nathanhammond commented Jan 10, 2019

markusicu commented Jun 24, 2021

markusicu commented Jun 24, 2021 •

edited

Loading

Who else is exploring exposing sequences in regular expressions? #12

Who else is exploring exposing sequences in regular expressions? #12

Comments

nathanhammond commented Jun 15, 2018

mathiasbynens commented Jun 16, 2018

gibson042 commented Sep 26, 2018

Alhadis commented Oct 15, 2018

mathiasbynens commented Jan 4, 2019

nathanhammond commented Jan 10, 2019

markusicu commented Jun 24, 2021

markusicu commented Jun 24, 2021 • edited Loading

markusicu commented Jun 24, 2021 •

edited

Loading