Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

complement syntax/semantics #7

Closed
markusicu opened this issue Feb 5, 2021 · 5 comments
Closed

complement syntax/semantics #7

markusicu opened this issue Feb 5, 2021 · 5 comments

Comments

@markusicu
Copy link
Collaborator

markusicu commented Feb 5, 2021

We are hoping to advance both this proposal as well as the properties-of-strings proposal. Therefore, we need to define the semantics of ^ when there are multi-character strings.

We have used a brainstorm doc, discussed many options, and arrived at a consensus. The following is a somewhat abbreviated version of that doc, for the record.

Note that in ICU class UnicodeSet pattern strings and in the Unicode regex spec (UTS #18), a set/class can contain multi-character strings.

When only code points (= single-character strings) are involved, then the complement is simply all other code points (out of the 1.1M possible Unicode code points). Changing this is neither necessary nor desirable.

In discussions among Andy Heninger, Mark Davis, Markus Scherer, Mathias Bynens, and Shane Carr, we had consensus to propose the strictest option below (1.5.b Error if strings may be present) for ECMAScript.

1 Simple complement

1.1 Existing syntax & semantics

[^a-z] all Unicode characters except ASCII a-z
\S all Unicode characters except White_Space
\P{gc=L} all Unicode characters except Letters
[:^gc=L:] all Unicode characters except Letters

We should not change these.

1.2 Simple complement vs. strings

What should the complement operator do when the input set also contains multi-character strings?

1.3 Complement via symmetric difference

ICU (since 2002) has implemented UnicodeSet.complement() as complementing the code points but leaving the other strings unchanged.

Example: [^a-z{ch}] = [[^a-z]{ch}] (almost all Unicode characters, and the two-character string “ch”)

Mathematically: [^A] = \p{any}⊖A (\p{any} = all code points; = symmetric difference)

From the API docs: “This is equivalent to complement(MIN_VALUE, MAX_VALUE).”

With this, we get [^[^A]] = A as one would expect. However, when A contains strings, then we do not get A&&[^B] == A--B, as one would also expect.

Consequence: \P{Basic_Emoji} would be the set of multi-character strings in Basic_Emoji, and all Unicode code points except for the single-character Basic_Emoji.

1.4 Complement via subtraction

The simple complement could remove all non-single-character strings. This seems less surprising.

This is also the standard math interpretation, except: In set theory, you get the same result for subtracting from the Universe vs. performing the symmetrical difference with the Universe. The problem here is that “all code points” is not truly the Universe.

Example: [^a-z{ch}] = [^a-z] (almost all Unicode characters, and nothing else)

Mathematically: [^A] = \p{any}∖A ( = set difference; this is not an ASCII backslash)

With this, we get [^[^A]] = A only when A contains only code points. When A does contain multi-character strings, [^[^A]] ≠ A ([^[^A]] ⊂ A)

Consequence: \P{Basic_Emoji} would be the set of all Unicode code points except for the single-character Basic_Emoji, without any multi-character strings.

[a\q{👧🏿}] contains a and 👧🏿
[^a\q{👧🏿}] contains all single code points except a, and no strings

The current draft update of UTS #18 defines the character class complement ^ like this.

1.5 Error if strings are or could be present

The simple complement could throw an exception if strings are present or could be present.

This is based on the assumption that there is no real use case for writing [^a-z{ch}] or \P{Basic_Emoji}. If we find a good use case later, we could then propose allowing this combination and define its behavior.

As a workaround, someone could write [^a-z] and [\p{any}--\p{Basic_Emoji}].

1.5.a Error if strings are in fact present

A regex parser needs access to the UCD to resolve property expressions and build a functioning pattern matcher object. Therefore, it could reject the complement based on the actual contents of a property expression or nested character class.
[^a-z{ch}] // error
\P{Basic_Emoji} // error
[^\p{Basic_Emoji}--\p{RGI_Emoji}] // ok
[^\p{Basic_Emoji}--\p{Foo}] // currently resolves to zero strings, but could change in the future

1.5.b Error if strings may be present

We could support syntactic regex validation with access only to metadata including whether the domain of a property is single code points or includes strings. This would also ensure that a new Unicode version that changes which strings have which properties cannot cause an error for a regex that used to work. (Requirement: A property of code points will never become a property of strings.)
[^a-z{ch}] // error
\P{Basic_Emoji} // error
[^\p{Basic_Emoji}--\p{RGI_Emoji}] // error

It would be possible to allow the following, because set-theoretically it cannot contain strings, but we don’t want to have implementations do the match.
[^\p{Basic_Emoji}--[\p{Basic_Emoji}--\p{any}]]

Note: Any regex validation would want to check already whether the property expressions are well-formed and supported. Checking whether a property name is valid and supported can easily be expanded to also returning information about the domain of the property.

Validation Algorithm

Prerequisites:

  1. A validator needs to know which properties are supported, according to a list in the ECMAScript standard. (Regardless of whether properties of strings are supported.)
  2. For supported properties, a validator needs to know which ones are properties of strings, according to a list in the ECMAScript standard.
  3. If \N{...name...} is supported, then a validator needs to know the names in the Unicode character name space, and it needs to know which names represent multi-character strings.

Algorithm:

Note: Not every regex spec or implementation supports all of the following features. In particular, those prefixed with a * are relevant for UTS #18 but not supported (neither currently nor in pending proposals) in ECMAScript.

  1. \P would not be allowed for a property of strings
  2. ^ would not be allowed if the remainder of the character class may contain strings
  3. * \q{...} would be checked for whether it contains exactly one code point
  4. * \N{...name...} is a string if the name is in NamedSequences.txt
  5. A property of strings may contain strings.
  6. The result of union may contain strings if either operand may contain strings
  7. The result of subtraction may contain strings if the left operand may contain strings
  8. The result of intersection may contain strings if both operands may contain strings
  9. * The result of symmetric difference may contain strings if either operand may contain strings

1.6 Consensus

Proposal based on 1.5.b.
If we choose 1.5.b now, we could later “upgrade” to 1.5.a.
If we choose either 1.5 now, we could later “upgrade” to 1.4.

2 Full complement

One could implement a full complement as “all possible strings except for those in the input set/class”, and define syntax for it, so that it is possible to do real set operations where multi-character strings are useful.

However, we are confident that there is no use case for full complement that isn’t handled by subtraction.

@markusicu
Copy link
Collaborator Author

We have advanced to stage 2 with the draft spec text based on the consensus noted above.

@RunDevelopment
Copy link

Before I start: Is it ok to keep commenting on this issue or should I open a new issue?

I would like to ask whether to following solution for set operations for property of strings (POS) has been considered. It wasn't mentioned in the brainstorm doc, POS proposal

The idea:

The idea is to not only keep track of accepted strings (sequences of character with length >= 2) but also of rejected strings. Each (nested) character (class/ranges/set) will evaluate to a tuple (c, a, r) where c is the set of characters accepted, a is the set of strings accepted, and r is the set of strings rejected. a and r are disjoint of course.

Example: [a-z{ch}] evaluates to c=[a-z], a=["ch"], and r=[]

Negation is as simple as swapping accepted and rejected strings, as well as the usual set negation for characters.

Example: [^a-z{ch}] evaluates to c=[^a-z], a=[], and r=["ch"]

Union is defined as follows: (c1, a1, r1) ∪ (c2, a2, r2) := (c1 ∪ c2, a1 ∪ a2, r1 ∩ r2). Accpeted strings and characters as use a set union, rejected strings use a set intersection.

Intersection is defined as follows: (c1, a1, r1) ∩ (c2, a2, r2) := (c1 ∩ c2, a1 ∩ a2, r1 ∪ r2). Accpeted strings and characters as use a set intersection, rejected strings use a set union.

Set difference follows from the above operations: (c1, a1, r1) \ (c2, a2, r2) := (c1 \ c2, a1 ∩ r2, r1 ∪ a2).

What it solves:

The above idea solves the problem of set operations and property of strings. It defines set operations for POS that are compatible the "old" character set operations and consistent with itself.

What it doesn't solve:

I have no idea how e.g. [^a-z{ch}] = (c = [^a-z], a = [], r = ["ch"]) is to be converted into a regex.

One idea would be to ignore all rejected strings. This is pretty much the solution "1.4 Complement via subtraction" from above. The difference is that the subtraction of rejected strings only happens at top-level character class/set.

Consensus:

This idea is compatible with the reached consensus as mentioned in 1.6.

@markusicu
Copy link
Collaborator Author

Before I start: Is it ok to keep commenting on this issue or should I open a new issue?

We have spent a lot of time on this issue, and have considered it as settled.

What it doesn't solve:

I have no idea how e.g. [^a-z{ch}] = (c = [^a-z], a = [], r = ["ch"]) is to be converted into a regex.

This... would be deal-breaker, wouldn't it? :-)
Also, while this idea is interesting, I think it's even less intuitive than the other options that we considered.

@macchiati
Copy link
Collaborator

macchiati commented Jul 25, 2021 via email

@RunDevelopment
Copy link

I have no idea how e.g. [^a-z{ch}] = (c = [^a-z], a = [], r = ["ch"]) is to be converted into a regex.

This... would be deal-breaker, wouldn't it? :-)

Same thing for all the other approaches 1.3 to 1.5, so I did see this as a deal breaker.

I think it's even less intuitive than the other options that we considered.

I get that. Rejected strings are in this strange limbo where it's not really clear how they relate to the final regex. However, it is necessary to keep track of rejected strings in some form or another to achieve well-defined set operations.

I don't see how this solves anything that we hadn't solved before.

It solves set operations. The previously presented approaches 1.3-1.5 were unable to provide consistent set operations for strings (union, intersection, negation, difference). The idea I presented provides well-defined set operations for character sets and string sets.

This means that set operations within character classes can always be evaluated. No restrictions at all.
E.g. [\p{Basic_Emoji}--[{foo}{bar}]] and [^\P{Basic_Emoji}] can be evaluated without problems.

Is insensitivity where you think your approach advances the ball?

No, my approach isn't related to case insensitivity.

We have spent a lot of time on this issue, and have considered it as settled.

Ok. :)

In that case, I won't push further.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants