Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add u16::is_utf16_surrogate #94713

Merged
merged 1 commit into from
Mar 23, 2022
Merged

Conversation

clarfonthey
Copy link
Contributor

@clarfonthey clarfonthey commented Mar 7, 2022

Right now, there are methods in the standard library for encoding and decoding UTF-16, but at least for the moment, there aren't any methods specifically for u16 to help work with UTF-16 data. Since the full logic already exists, this wouldn't really add any code, just expose what's already there.

This method in particular is useful for working with the data returned by Windows OsStrExt::encode_wide. Initially, I was planning to also offer a TryFrom<u16> for char, but decided against it for now. There is plenty of code in rustc that could be rewritten to use this method, but I only checked within the standard library to replace them.

I think that offering more UTF-16-related methods to u16 would be useful, but I think this one is a good start. For example, one useful method might be u16::is_pattern_whitespace, which would check if something is the Unicode Pattern_Whitespace category. We can get away with this because all of the Pattern_Whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates.

@rust-highfive
Copy link
Collaborator

r? @scottmcm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 7, 2022
@clarfonthey clarfonthey force-pushed the is_char_surrogate branch 2 times, most recently from 39c0ae4 to 3b3117b Compare March 7, 2022 21:53
@rust-log-analyzer

This comment has been minimized.

@ChrisDenton
Copy link
Member

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

Is that guaranteed or could that change in the future?

@scottmcm
Copy link
Member

scottmcm commented Mar 8, 2022

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

I agree with @ChrisDenton here -- this doesn't seem like a safe assumption to bake into an API. People can convert it to a char themselves if they want to encode that assumption.

But it doesn't need to block this PR.

#[unstable(feature = "is_char_surrogate", issue = "none")]
#[rustc_const_unstable(feature = "is_char_surrogate", issue = "none")]
#[inline]
pub const fn is_char_surrogate(self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the _ascii_ methods on u8, this seems like a reasonable thing to add.

However, I think it needs a different name. A char can never be a surrogate, so mentioning it seems wrong.

I think the _ascii_ names and from_utf8 and such mean that it should mention the encoding it's using, not a datatype.

So how about this?

Suggested change
pub const fn is_char_surrogate(self) -> bool {
pub const fn is_utf16_surrogate(self) -> bool {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; I was originally thinking of going with is_unicode_surrogate but decided it was too long.

My main apprehension with calling it utf16 surrogate is it implies that it's specific to UTF-16, which it actually isn't; it's specific to Unicode, even though its inclusion in Unicode is specific for UTF-16. I will agree that the name is better than the original, though.

@@ -91,7 +91,7 @@ impl<I: Iterator<Item = u16>> Iterator for DecodeUtf16<I> {
None => self.iter.next()?,
};

if u < 0xD800 || 0xDFFF < u {
if !u.is_char_surrogate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this makes me think that the API might want to be -> Option<bool> or something to distinguish high/low surrogates.

This code looks like it would be better written with exhaustive range patterns today (they almost certainly didn't exist when it was written, though). So the nice code change, to me, would be one that could match c.tell_me_about_surrogate_ness() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I was debating on how exactly to introduce these APIs, but what I was thinking is a good logical step is some form of check whether a surrogate is a low or high surrogate, or a way to combine surrogates into a code point. But at least that last point is mostly covered by decode_utf16, so, I wasn't sure what extent was useful.

Either way, there is definitely room for expansion.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Mar 8, 2022

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

Is that guaranteed or could that change in the future?

So, I was partially right; although the White_Space property is open to expansion despite how unlikely this is, Pattern_White_Space (which is larger than just ASCII whitespace, but smaller than general White_Space) is guaranteed by the standard to never include future characters, and is strictly within the BMP. So, this could be a reasonable addition in the future.

@Dylan-DPC Dylan-DPC added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 8, 2022
@clarfonthey clarfonthey changed the title Add u16::is_char_surrogate Add u16::is_utf16_surrogate Mar 13, 2022
@clarfonthey
Copy link
Contributor Author

Renamed, rebased, fixed build errors (hopefully), and opened a tracking issue since this seems desired.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 22, 2022
@scottmcm
Copy link
Member

This looks good for nightly to me!

@bors r+

@bors
Copy link
Contributor

bors commented Mar 22, 2022

📌 Commit d580367 has been approved by scottmcm

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 22, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 23, 2022
Rollup of 6 pull requests

Successful merges:

 - rust-lang#91608 (Fold aarch64 feature +fp into +neon)
 - rust-lang#92955 (add perf side effect docs to `Iterator::cloned()`)
 - rust-lang#94713 (Add u16::is_utf16_surrogate)
 - rust-lang#95212 (Replace `this.clone()` with `this.create_snapshot_for_diagnostic()`)
 - rust-lang#95219 (Modernize `alloc-no-oom-handling` test)
 - rust-lang#95222 (interpret/validity: improve clarity)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 25acd93 into rust-lang:master Mar 23, 2022
@rustbot rustbot added this to the 1.61.0 milestone Mar 23, 2022
@clarfonthey clarfonthey deleted the is_char_surrogate branch April 16, 2022 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants