-
Notifications
You must be signed in to change notification settings - Fork 18k
unicode: add CategoryAliases, LC, Cn #70780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related Issues (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
I implemented this, and there are a few additions. The proposal is now:
|
Change https://go.dev/cl/641395 mentions this issue: |
Change https://go.dev/cl/641376 mentions this issue: |
Change https://go.dev/cl/641377 mentions this issue: |
This proposal has been added to the active column of the proposals project |
Could there be any compatibility issues with new Unicode versions? Dropped or renamed or changed aliases? Will regexp then use the map? Edit: The changes to regexp are at #70781. |
In general, Unicode data is subject to change as Unicode changes. That said, I don't expect aliases to be deleted from the list. (We've seen them change the category of an individual code point in the past, but even that is rare.) |
Have all remaining concerns about this proposal been addressed? The proposal is to add:
|
Based on the discussion above, this proposal seems like a likely accept. The proposal is to add:
The C table is expanded to include unassigned code points (as it should have had from the start). |
No change in consensus, so accepted. 🎉 The proposal is to add:
The C table is expanded to include unassigned code points (as it should have had from the start). |
CategoryAliases is for regexp to use, for things like \p{Letter} as an alias for \p{L}. Cn and LC are special-case categories that were never implemented but should have been. For golang/go#70780. Change-Id: I1401c1be42106a0ebecabb085c25e97485c363cf Reviewed-on: https://go-review.googlesource.com/c/text/+/641395 Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Marcel van Lohuizen <mpvl@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter". The regexp package supports \p{L} but not \p{Letter}, because there was nothing in the Unicode tables that lets regexp know about Letter. Now that package unicode provides CategoryAliases (see #70780), we can use it to provide \p{Letter} as well. This is the only feature missing from making package regexp suitable for use in a JSON-API Schema implementation. (The official test suite includes usage of aliases like \p{Letter} instead of \p{L}.) For better conformity with Unicode TR18, also accept case-insensitive matches for names and ignore underscores, hyphens, and spaces; and add Any, ASCII, and Assigned. Fixes #70781. Change-Id: I50ff024d99255338fa8d92663881acb47f1e92a5 Reviewed-on: https://go-review.googlesource.com/c/go/+/641377 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>
The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".
The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.
In order to support \p{Letter}, I propose to add a new, small table to unicode,
This would be auto-generated from the Unicode database like all our other tables. For Unicode 15, the table would have only 38 entries, listed below.
The text was updated successfully, but these errors were encountered: