Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\pC is not accepted #466

Closed
CAD97 opened this issue Apr 16, 2018 · 3 comments
Closed

\pC is not accepted #466

CAD97 opened this issue Apr 16, 2018 · 3 comments
Labels

Comments

@CAD97
Copy link
Contributor

CAD97 commented Apr 16, 2018

Long form \p{Other} is accepted.

C is not what I think of for General_Category=Other, but it's the official short value. Since regex claims to support the short values for the General_Category sets, \pC should probably also be supported.

@BurntSushi
Copy link
Member

Interesting! There's probably a silly bug somewhere, since that alias is in the Unicode alias table.

Other aliases, like \pZ, work: https://play.rust-lang.org/?gist=ab7ba8ea46f953129e8c926f7449f846&version=stable

@BurntSushi BurntSushi added the bug label Apr 16, 2018
BurntSushi added a commit to BurntSushi/ucd-generate that referenced this issue Apr 28, 2018
This commit fixes a bug where 'isc' was canonicalized to 'c'. 'isc' is an
alias for 'ISO_Comment', but the 'is' prefix was being dropped since
canonicalization permits ignoring 'is' prefixes when designating property
names.

This is the root cause of a bug in the regex library:
rust-lang/regex#466
@CAD97
Copy link
Contributor Author

CAD97 commented Apr 28, 2018

....wow. That was, indeed, a silly bug.

UAX44-LM3. Ignore case, whitespace, underscore ('_'), hyphens, and any initial prefix string "is".

  • "linebreak" is equivalent to "Line_Break" or "Line-break"
  • "lb=BA" is equivalent to "lb=ba" or "LB=BA"
  • "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"

[...]

Existing and future property aliases and property value aliases are guaranteed to be unique within their relevant namespaces, even if an initial prefix string "is" is ignored. The existing cases of note for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value alias does not cause any problem, because there is no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is the entire property value alias, and is not a prefix. There is no null value for the Line_Break property for it to contrast with, but implementations of loose matching should be careful of this edge case, so that "lb=IS" is not misinterpreted as matching a null value.

Should we write the Unicode Consortium and propose a revision to this text to mention ISO_COMMENT (isc)? Or is this technically a non-issue for a rigorous application of UTS18 Unicode Regular Expressions because ISO_COMMENT is a non-binary property?

UTS18-1.2 Properties

[...]

Some properties are binary: they are either true or false for a given code point. In that case, only the property name is required. Others have multiple values, so for uniqueness both the property name and the property value need to be included.

For example, Alphabetic is a binary property, but it is also a value of the enumerated Line_Break property. So \p{Alphabetic} would refer to the binary property, whereas \p{Line Break:Alphabetic} or \p{Line_Break=Alphabetic} would refer to the enumerated Line_Break property.

There are two exceptions to the general rule that expressions involving properties with multiple value should include both the property name and property value. The Script and General_Category properties commonly have their property name omitted. Thus \p{Unassigned} is equivalent to \p{General_Category = Unassigned}, and \p{Greek} is equivalent to \p{Script:Greek}.

According to my reading of the spec, the regex \p{isc} should resolve to p{General_Category=Other}

First, resolve isc. By looking at PropertyAliases.txt, we see that isc is equivalent to the property ISO_COMMENT. ISO_COMMENT is not a binary property; ISO_COMMENT is a Miscellaneous Property. Therefore, \p{isc} does not mean \p{ISO_COMMENT=true}. Thus, we check to see if \p{Script=isc} or \p{General_Category=isc} is accepted. By looking at PropertyValueAliases.txt, and ignoring the prefix is as per UAX44-LM3, we see that c is an alias for the value gc=Other. Thus, \p{isc} resolves to the canonical(ish) form of \p{General_Category=Other}.

That the alias c for ISO_COMMENT stopped the check advancing to General_Category=c is its own bug. If \p{isc} doesn't resolve to \p{General_Category=Other}, I'm submitting another bug on principle. Note that \p{gc=isc} currently does resolve to \p{General_Category=Other}, and \p{isp} resolves to \p{General_Category=Punctuation}. Note also that \p{ISO_COMMENT} is correctly rejected.

[TANGENT]
.... And while testing these I noticed that you handle \p{XID_START}, \p{XID_CONTINUE}, and \p{WHITE_SPACE}. Given that I'm currently assembling sets manually for those, I should probably stop doing that and just use the property reference. Sure, it's nice to control what version of Unicode my grammar rules are against, but I'm not adjusting UAX31 and am already using GENERAL_CATEGORY properties from regex, so I'm already just using "a Unicode standard" rather than a specific one.

Actually, what's your expected policy for tracking new Unicode standards as they release? Things like XID_START are bound by stability guarantees, but WHITE_SPACE isn't, and I think that leads to some "fun" C# programs that work differently depending on your compiler & system Unicode settings.
[/TANGENT]

@BurntSushi
Copy link
Member

I'm not sure what the right way to interpret the Unicode standard is for this particular case is. I certainly do not have the bandwidth to follow up with Unicode proper. :-)

Actually, what's your expected policy for tracking new Unicode standards as they release?

My plan is to just update all of the tables. I don't think we've generally considered updates to Unicode to be a breaking change in the Rust ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants