Skip to content

Editorial: Fix incorrect use of UnicodeMatchPropertyValue #3587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

gibson042
Copy link
Contributor

Fixes #3586

Also includes commits with incidental fixes in nearby algorithms and steps.

michaelficarra and others added 2 commits May 5, 2025 16:41
@Jack-Works
Copy link
Member

I have another question, if you search "scx" in https://unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt, you will find there is no record for it. And according to the spec,

UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue

It is a Syntax Error if the source text matched by UnicodePropertyValue is not a property value or property value alias for the Unicode property or property alias given by the source text matched by UnicodePropertyName listed in PropertyValueAliases.txt.

All RegExp that have the form of /\p{Script_Extensions=Anything}/u (e.g., /\p{Script_Extensions=Zanabazar_Square}/u) are an early error, but it is not true in the implementation.

@gibson042
Copy link
Contributor Author

Nice observation, @Jack-Works! Unicode property Script_Extensions (scx) is unusual in being set-valued rather than scalar-valued, and as such need special consideration in our spec. I have added editorial corrections to this PR, and opened #3590 for a potential followup.

1. If _p_ is `Script_Extensions`, then
1. Assert: _vs_ is a property value or property value alias for property “Script” listed in <a href="https://unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt"><code>PropertyValueAliases.txt</code></a>.
1. Let _v_ be the Set containing the “short name”, “long name”, and any other aliases corresponding with value _vs_ for property “Script” in <a href="https://unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt"><code>PropertyValueAliases.txt</code></a>.
1. Return the CharSet containing all Unicode code points whose character database definition includes the property “Script_Extensions” with value having a non-empty intersection with _v_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call MaybeSimpleCaseFolding here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that would affect any code point that case-folds across script (or to/from Common). I don't know if there are any, but it's easy enough to accommodate. Done.

…cript_Extensions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeMatchPropertyValue assertion fails by UnicodePropertyValueExpression
5 participants