`to_ascii_uppercase` and `to_ascii_lowercase` operate on non-ASCII characters #31203

DanielKeep · 2016-01-26T04:05:16Z

Behold!

#[test]
fn liar_liar_pants_on_fire() {
    use std::ascii::AsciiExt;
    assert_eq!("café".to_ascii_uppercase(), "CAFÉ");
    assert_eq!("café".to_ascii_uppercase(), "CAFé");
    assert_eq!("CAFÉ".to_ascii_lowercase(), "café");
    assert_eq!("CAFÉ".to_ascii_lowercase(), "cafÉ");
}

This is obviously silly. The problem is that this is running an ASCII-only operation on Unicode strings without actually dealing with their Unicode-ness.

These functions should either correctly deal with grapheme clusters (by ignoring them since they're not in ASCII), or document that it does not correctly handle grapheme clusters, preferably with an example (like the above).

(Actually having a standard Ascii type would be even more muchly preferable, but I suspect that's way out of scope.)

The text was updated successfully, but these errors were encountered:

steveklabnik · 2016-01-26T14:13:49Z

This is the documented behavior: http://doc.rust-lang.org/std/ascii/trait.AsciiExt.html#tymethod.to_ascii_uppercase

/cc @rust-lang/libs

frewsxcv · 2016-01-26T15:59:10Z

In case anyone is curious why this works: café is actually represented cafe\u{301}

https://en.wikipedia.org/wiki/Combining_character

tbu- · 2016-01-27T14:17:48Z

@steveklabnik It would be nice if the documentation could mention that it operates on codepoints, not grapheme clusters.

frewsxcv · 2016-01-27T14:19:07Z

Also, worth considering adding one of the examples above to go with it

steveklabnik · 2016-01-27T14:24:10Z

@tbu- nothing in the standard library operates on grapheme clusters, though. I'm not sure what the right level of repeating this is.

Maybe @frewsxcv is right here, and an example is the best way for this particular function.

tbu- · 2016-01-27T14:49:13Z

I think it's confusing that it talks about characters when it exhibits the weird behaviour mentioned above. Maybe it could just explicitly codepoints?

steveklabnik · 2016-01-27T14:55:57Z

char is a unicode scalar value. So 'character' is always referring to a codepoint, not a grapheme.

tbu- · 2016-01-27T15:03:07Z

Yes, it's confusing naming unfortunately. Maybe we could add these unexpected test cases to the documentation?

DanielKeep · 2016-01-28T04:44:29Z

For what it's worth, although I'd prefer this method do "the right thing", simply having an example to make the behaviour with respect to combining code points clear would be a reasonable solution.

bluss · 2016-01-28T09:51:59Z

@steveklabnik Note that this function may transform "é" to "É", which is the "ascii violating" behavior

It needs pointing out in the doc. However, we don't need to mention grapheme clusters. It is application dependent which algorithms you use over your unicode data. Implying that grapheme clusters is always the right thing to do is misleading. Rust strings are brilliant as they are in providing minimal unicode consistency with low overhead.

steveklabnik · 2016-01-28T15:17:17Z

Yes, I would also think that examples of this are a good idea.

frewsxcv · 2016-02-04T05:12:49Z

I took a stab at this: #31401

Fixes rust-lang#31203

steveklabnik added the A-libs label Jan 26, 2016

frewsxcv mentioned this issue Feb 4, 2016

Clarify scenario where AsciiExt appears to operate on non-ASCII #31401

Merged

frewsxcv added a commit to frewsxcv/rust that referenced this issue Feb 4, 2016

Clarify scenario where AsciiExt appears to operate on non-ASCII

93d6425

Fixes rust-lang#31203

steveklabnik added a commit to steveklabnik/rust that referenced this issue Feb 4, 2016

Rollup merge of rust-lang#31401 - frewsxcv:clarify-ascii, r=steveklabnik

73db842

Fixes rust-lang#31203

bors closed this as completed in #31401 Feb 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_ascii_uppercase` and `to_ascii_lowercase` operate on non-ASCII characters #31203

`to_ascii_uppercase` and `to_ascii_lowercase` operate on non-ASCII characters #31203

DanielKeep commented Jan 26, 2016

steveklabnik commented Jan 26, 2016

frewsxcv commented Jan 26, 2016

tbu- commented Jan 27, 2016

frewsxcv commented Jan 27, 2016

steveklabnik commented Jan 27, 2016

tbu- commented Jan 27, 2016

steveklabnik commented Jan 27, 2016

tbu- commented Jan 27, 2016

DanielKeep commented Jan 28, 2016

bluss commented Jan 28, 2016

steveklabnik commented Jan 28, 2016

frewsxcv commented Feb 4, 2016

to_ascii_uppercase and to_ascii_lowercase operate on non-ASCII characters #31203

to_ascii_uppercase and to_ascii_lowercase operate on non-ASCII characters #31203

Comments

DanielKeep commented Jan 26, 2016

steveklabnik commented Jan 26, 2016

frewsxcv commented Jan 26, 2016

tbu- commented Jan 27, 2016

frewsxcv commented Jan 27, 2016

steveklabnik commented Jan 27, 2016

tbu- commented Jan 27, 2016

steveklabnik commented Jan 27, 2016

tbu- commented Jan 27, 2016

DanielKeep commented Jan 28, 2016

bluss commented Jan 28, 2016

steveklabnik commented Jan 28, 2016

frewsxcv commented Feb 4, 2016

`to_ascii_uppercase` and `to_ascii_lowercase` operate on non-ASCII characters #31203

`to_ascii_uppercase` and `to_ascii_lowercase` operate on non-ASCII characters #31203