Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

Who else is exploring exposing sequences in regular expressions? #12

Open
nathanhammond opened this issue Jun 15, 2018 · 7 comments
Open

Comments

@nathanhammond
Copy link
Member

Is TC39 the only group exploring directly exposing emoji sequences in regular expressions?

  • If so, why, and who can we work with to expand the number of people thinking about it?
  • If not, who else is doing it, and what have they discovered?
@mathiasbynens
Copy link
Member

I haven’t seen any similar proposals. AFAIK this is entirely new ground.

I’ve made the Unicode Consortium aware of this proposal. If the proposal eventually makes it to Stage 4, it would certainly be nice to have the Unicode consortium’s blessing by means of updating UTS18 with an optional recommendation to implement this functionality.

All that is out of scope for this particular repository/proposal, though.

@gibson042
Copy link

There's a limited form in Perl: https://ideone.com/iO8kBg

#!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";

my $re = qr/\N{LATIN SMALL LETTER A WITH MACRON AND GRAVE}/u;
my @cases = ("\x{0101}", "\x{0300}", "\x{0101}\x{0300}");

for my $str (@cases) {
  if ( $str =~ $re ) {
    print ("matches $re: $str\n");
  } else {
    print ("does not match $re: $str\n");
  }
}
does not match (?^u:\N{U+101.300}): ā
does not match (?^u:\N{U+101.300}): ̀
matches (?^u:\N{U+101.300}): ā̀

@Alhadis
Copy link

Alhadis commented Oct 15, 2018

Are there any sequence properties which aren't pictograph-related? I couldn't help noticing every property under "Proposed Solution" was related to Emoji.

I'm curious if there are any other potential properties which are or might one day be used.

Could this syntax be extended to cover Unicode decomposition? Or is that something for another proposal? Something like this:

const before = /[1₁➊①]/; // Huge list of anything that decomposes to U+0031
const after  = /1/d;      // “Decompose” flag

after.match("¹"); // ⇒ true
after.match("1"); // ⇒ true

Basically, it'd replicate how most contemporary browsers use “smart” matching when searching for "1" on a page (which matches Unicode sequences which have an equivalent decomposition mapping).

@mathiasbynens
Copy link
Member

@Alhadis That sounds like a separate proposal to me.

@nathanhammond To your original question, I've been discussing this proposal with Unicode folks at Google, and more recently have created a formal proposal at the Unicode level, which has now been officially submitted. Mark Davis will present the proposal during the January UTC meeting at Google MTV from January 22–25. If the proposal gets accepted, we'll know what the official term is for sequence properties.

One of the things the proposal addresses is the idea of this functionality not being JavaScript-only. If the proposal ends up getting accepted, it would explicitly mention sequence properties in UTS#18, enabling us to proceed with this proposal knowing we have the Unicode Consortium's blessing, and enabling other languages to consider adopting this functionality.

@nathanhammond
Copy link
Member Author

@mathiasbynens I've just read the proposal and I really appreciate you doing the upstream legwork in Unicode land. If those changes land upstream I would be in favor of advancing this proposal.

In my head I'm still generally uneasy with the idea of heading down a path for standardizing this for JS without the PCRE folks. However, I also feel that TC39 (you) have been paying close attention to this space recently and JS is therefore well-suited to be the first one to deliver.

@markusicu
Copy link
Collaborator

Update: Unicode has agreed to formalize “properties of strings”: https://www.unicode.org/reports/tr18/#domain_of_properties

The recommendation for which properties to support in regular expressions has been updated to include the 7 binary properties of strings: https://www.unicode.org/reports/tr18/#Full_Properties → “Properties marked with * are properties of strings, not just single code points.”

These are the 7 emoji properties of strings defines in UTS #51 and its data files.

@markusicu
Copy link
Collaborator

markusicu commented Jun 24, 2021

Could this syntax be extended to cover Unicode decomposition? Or is that something for another proposal? Something like this:

const before = /[1₁➊①]/; // Huge list of anything that decomposes to U+0031
const after  = /1/d;      // “Decompose” flag

after.match("¹"); // ⇒ true
after.match("1"); // ⇒ true

Those kinds of things are possible, and are discussed in UTS 18: https://www.unicode.org/reports/tr18/#Wildcard_Properties

They are actually typically fancy ways of creating sets of characters, not of strings. You can see that in the UnicodeSet demo: [:toNFKD=1:]

But other functions like this could yield sets of strings.

And as Richard pointed out, \N{character name} can yield a multi-character string. See UTS 18 “The data in NamedSequences.txt is also used in \N{…}.” → https://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt

All of these are out of scope for the current proposal.

Basically, it'd replicate how most contemporary browsers use “smart” matching when searching for "1" on a page (which matches Unicode sequences which have an equivalent decomposition mapping).

The browsers I am aware of implement “ctrl+F” in-page search via collation (e.g., using ICU's class StringSearch), which is a very different kind of algorithm (UCA=UTS 10, and CLDR extensions).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants