Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Annex 31 methods to `char` #2693

Open
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@notriddle
Copy link
Contributor

commented Apr 24, 2019

Rendered

@notriddle notriddle changed the title Create 0000-char-uax-31.md Add Unicode Annex 31 methods to `char Apr 24, 2019

@notriddle notriddle changed the title Add Unicode Annex 31 methods to `char Add Unicode Annex 31 methods to `char` Apr 24, 2019

@clarfon

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2019

I think this could probably be done as a PR to Rust directly rather than an RFC.

Show resolved Hide resolved text/0000-char-uax-31.md Outdated
Show resolved Hide resolved text/0000-char-uax-31.md Outdated

mibac138 and others added some commits Apr 25, 2019

Update text/0000-char-uax-31.md
Co-Authored-By: notriddle <michael@notriddle.com>
Update text/0000-char-uax-31.md
Co-Authored-By: notriddle <michael@notriddle.com>
@Manishearth
Copy link
Member

left a comment

I'm very iffy on this, I don't see a strong enough motivation to include it in the stdlib, while I see very clear reasons for avoiding things that change every unicode version to be in the stdlib. We moved unicode_segmentation out of tree as well for similar reasons, despite it being very useful in unicode-aware string handling.

Show resolved Hide resolved text/0000-char-uax-31.md Outdated
a standardized set of code point categories for defining computer language syntax.

This is being used in production Rust code already.
Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer,

This comment has been minimized.

Copy link
@Manishearth

Manishearth May 1, 2019

Member

This is for the unstable, old, non_ascii_idents feature, which we have already RFCd to change. The change will still need some function like this, but it may need tailoring for bidi characters. This is true for other implementors too, the XID functions often need tailoring.

They're not super hard to tailor, though, so we could still expose these functions to match spec and let those tailoring just suffix the calls with || ch == ... && ch != ...

This comment has been minimized.

Copy link
@notriddle

notriddle May 3, 2019

Author Contributor

The XID part is currently unstable. But stable Rust does respect Pattern_White_Space, so there's already committed-to Annex 31-based syntax. https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876


# Drawbacks
[drawbacks]: #drawbacks

This comment has been minimized.

Copy link
@Manishearth

Manishearth May 1, 2019

Member

Matching unicode versions is also a big issue here. This function will be inaccurate half the time as we don't always update our data files immediately (it's not always straightforward). Even if we do, the behavior of this function will change every year, and while we don't have a guarantee on stdlib behavior stability, this does mean that older compilers will lead to different results on code that compiles. This further makes me feel like this should be a versioned crate.

is_whitespace already opened the doors to this issue, but I don't want to make it worse. is_whitespace is a small relatively stable list whereas XID expands all the time.

This comment has been minimized.

Copy link
@notriddle

notriddle May 3, 2019

Author Contributor

Agreed. Added it.

# Motivation
[motivation]: #motivation

As a systems language, Rust is heavily used for parsing.

This comment has been minimized.

Copy link
@Manishearth

Manishearth May 1, 2019

Member

To me this reads as motivation for why Rust needs such functions somewhere, but inclusion in the stdlib is a much higher bar, especially when this gets us tangled up with unicode versioning.

Yes, we already have is_whitespace, but that's an old API that was grandfathered in, and it has fewer unicode stability issues than XID.

This comment has been minimized.

Copy link
@Manishearth

Manishearth May 1, 2019

Member

One argument for it being in std is discoverability, but if you're parsing some grammar that grammar will likely tell you about XID.

Show resolved Hide resolved text/0000-char-uax-31.md Outdated

mzji and others added some commits May 3, 2019

Update text/0000-char-uax-31.md
Co-Authored-By: notriddle <michael@notriddle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.