Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_ascii_uppercase and to_ascii_lowercase operate on non-ASCII characters #31203

Closed
DanielKeep opened this issue Jan 26, 2016 · 12 comments
Closed

Comments

@DanielKeep
Copy link
Contributor

Behold!

#[test]
fn liar_liar_pants_on_fire() {
    use std::ascii::AsciiExt;
    assert_eq!("café".to_ascii_uppercase(), "CAFÉ");
    assert_eq!("café".to_ascii_uppercase(), "CAFé");
    assert_eq!("CAFÉ".to_ascii_lowercase(), "café");
    assert_eq!("CAFÉ".to_ascii_lowercase(), "cafÉ");
}

This is obviously silly. The problem is that this is running an ASCII-only operation on Unicode strings without actually dealing with their Unicode-ness.

These functions should either correctly deal with grapheme clusters (by ignoring them since they're not in ASCII), or document that it does not correctly handle grapheme clusters, preferably with an example (like the above).

(Actually having a standard Ascii type would be even more muchly preferable, but I suspect that's way out of scope.)

@steveklabnik
Copy link
Member

This is the documented behavior: http://doc.rust-lang.org/std/ascii/trait.AsciiExt.html#tymethod.to_ascii_uppercase

/cc @rust-lang/libs

@frewsxcv
Copy link
Member

In case anyone is curious why this works: café is actually represented cafe\u{301}

https://en.wikipedia.org/wiki/Combining_character

@tbu-
Copy link
Contributor

tbu- commented Jan 27, 2016

@steveklabnik It would be nice if the documentation could mention that it operates on codepoints, not grapheme clusters.

@frewsxcv
Copy link
Member

Also, worth considering adding one of the examples above to go with it

@steveklabnik
Copy link
Member

@tbu- nothing in the standard library operates on grapheme clusters, though. I'm not sure what the right level of repeating this is.

Maybe @frewsxcv is right here, and an example is the best way for this particular function.

@tbu-
Copy link
Contributor

tbu- commented Jan 27, 2016

I think it's confusing that it talks about characters when it exhibits the weird behaviour mentioned above. Maybe it could just explicitly codepoints?

@steveklabnik
Copy link
Member

char is a unicode scalar value. So 'character' is always referring to a codepoint, not a grapheme.

@tbu-
Copy link
Contributor

tbu- commented Jan 27, 2016

Yes, it's confusing naming unfortunately. Maybe we could add these unexpected test cases to the documentation?

@DanielKeep
Copy link
Contributor Author

For what it's worth, although I'd prefer this method do "the right thing", simply having an example to make the behaviour with respect to combining code points clear would be a reasonable solution.

@bluss
Copy link
Member

bluss commented Jan 28, 2016

@steveklabnik Note that this function may transform "é" to "É", which is the "ascii violating" behavior

It needs pointing out in the doc. However, we don't need to mention grapheme clusters. It is application dependent which algorithms you use over your unicode data. Implying that grapheme clusters is always the right thing to do is misleading. Rust strings are brilliant as they are in providing minimal unicode consistency with low overhead.

@steveklabnik
Copy link
Member

Yes, I would also think that examples of this are a good idea.

@frewsxcv
Copy link
Member

frewsxcv commented Feb 4, 2016

I took a stab at this: #31401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants