Corrected very minor documentation detail about Unicode and Japanese #40499

ghost · 2017-03-14T01:33:29Z

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");

r? @steveklabnik

rust-highfive · 2017-03-14T01:33:35Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @steveklabnik (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

steveklabnik · 2017-03-14T02:02:56Z

Interesting, I've never heard of this.

I'm gonna look into it tomorrow but if anyone else wants to r+ before then, feel free.

ghost · 2017-03-14T03:46:18Z

The wiki page Unicode Equivalence under the subtitle 'Typographic Conventions' has some more details.

nagisa · 2017-03-14T07:45:02Z

FULL WIDTH LATIN {SMALL,CAPITAL} LETTER A is still a Latin letter from the Latin script. One can attribute exactly 2 scripts to Japanese writing system kanji and kana. Neither of those have case and therefore the previous statement is just fine.

Now, I'm totally fine with making a change like this, but attributing logographs used in the whole CJK to Japanese seems... Unfair I guess?

How about we just use a kana (これ) instead of the current kanji for the example?

steveklabnik · 2017-03-14T15:05:03Z

How about we just use a kana (これ) instead of the current kanji for the example?

Sounds good to me.

ghost · 2017-03-14T15:36:13Z

One can attribute exactly 2 scripts to Japanese writing system kanji and kana.

Its not that cut and dried. Unicode is hard because we are dealing with human languages in all their complexity. By changing the documentation from 'Japanese' to 'Japanese kanji' we can avoid that complexity.

How about we just use a kana (これ) instead of the current kanji for the example?

I can't see the value in changing from kanji to hiragana, it doesn't change anything. Anyway, 山 is a nice character.

mzji · 2017-03-14T18:32:12Z

My little advice: how about using "CJK characters" (or CJKV characters?) instead of "Japanese kanji characters"? Since these characters are used widely in chinese & japanese & korean (and vietnamese), not only japanese.

ghost · 2017-03-14T23:50:30Z

How about

/// // Characters that do not have both uppercase and lowercase
/// // convert into themselves.
/// assert_eq!('山'.to_lowercase().to_string(), "山");

?

mzji · 2017-03-15T00:11:20Z

How about

/// // Characters that do not have both uppercase and lowercase
/// // convert into themselves.
/// assert_eq!('山'.to_lowercase().to_string(), "山");

?

Looks good.

steveklabnik · 2017-03-15T14:20:37Z

@bors: r+ rollup

thanks !

bors · 2017-03-15T14:20:38Z

📌 Commit 18a8494 has been approved by steveklabnik

@steveklabnik

Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik

@steveklabnik

Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik

@steveklabnik

Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik

@steveklabnik

Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik

@steveklabnik

Corrected very minor documentation detail about Unicode and Japanese Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example, ` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");` r? @steveklabnik

Rollup of 23 pull requests - Successful merges: #40387, #40433, #40452, #40456, #40457, #40458, #40463, #40466, #40467, #40495, #40496, #40497, #40499, #40500, #40503, #40505, #40512, #40514, #40517, #40520, #40536, #40545, #40586 - Failed merges:

nodakai · 2017-05-12T06:35:56Z

Late to the party, but is this a valid explanation of 'ﬀ'.to_uppercase() yielding "FF"?

FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;

FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

That is, there's no uppercase ligature FF in Unicode (to be clear, I'm concerned about the wording "do not have both uppercase and lowercase".)

The same almost applies to 'ß'.to_uppercase() yielding "SS"

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
...
1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

(Note the asymmetry here --- the uppercase eszett ẞ is non-orthographic in modern German)

nagisa · 2017-05-12T06:47:12Z

Well, this explanation is discussing the case-less characters. Both of these ligatures are in caseful, it is just the case of unicode having no assigned codepoint for the uppercase variant of the ligatures you’ve given as an example.

nodakai · 2017-05-14T06:33:39Z

@nagisa

Well, this explanation is discussing the case-less characters.

First, that assumption isn't evident from the text. Second, it isn't a good idea to focus on the "caseful/caseless" dichotomy because the input being caseful is only a necessary condition for any of casing conversions to be defined. E.g. ȷ is a Lowercase Letter (Ll) w/o an uppercase version:

0237;LATIN SMALL LETTER DOTLESS J;Ll;0;L;;;;;N;;;;;

I think all we can say is

When uppercase conversion isn't defined for the input character in Unicode, it is returned as-is.

Wdyt?

it is just the case of unicode having no assigned codepoint for the uppercase variant of the ligatures you’ve given as an example.

So... you're actually supporting my claim, right? They "do not have both uppercase and lowercase" and yet don't "convert into themselves."

nagisa · 2017-05-14T07:45:31Z

So... you're actually supporting my claim, right? They "do not have both uppercase and lowercase" and yet don't "convert into themselves."

In my comment I’ve very purposefully used “character”(1) to mean a real character used in a language out there somewhere and “code point”(2) to mean an assigned code point in Unicode.

That is, what I’m really saying that this text should (and, I think, it currently is, due to its use of the word “character”) be discussing the real world characters. I’m very open to improving the wording and/or making it more obvious.

As per unicode glossary:

(1): The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.
(2): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. Not all code points are assigned to encoded characters.

Corrected very minor documentation detail about Unicode and Japanese

5b7f330

rust-highfive assigned steveklabnik Mar 14, 2017

Ammended minor documentation detail abour Unicode cases.

18a8494

frewsxcv mentioned this pull request Mar 17, 2017

Rollup of 26 pull requests #40591

Closed

frewsxcv mentioned this pull request Mar 17, 2017

Rollup of 26 pull requests #40592

Closed

frewsxcv mentioned this pull request Mar 17, 2017

Rollup of 26 pull requests #40595

Closed

frewsxcv mentioned this pull request Mar 17, 2017

Rollup of 26 pull requests #40596

Closed

frewsxcv mentioned this pull request Mar 17, 2017

Rollup of 23 pull requests #40598

Merged

bors merged commit 18a8494 into rust-lang:master Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrected very minor documentation detail about Unicode and Japanese #40499

Corrected very minor documentation detail about Unicode and Japanese #40499

ghost commented Mar 14, 2017

rust-highfive commented Mar 14, 2017

steveklabnik commented Mar 14, 2017

ghost commented Mar 14, 2017

nagisa commented Mar 14, 2017

steveklabnik commented Mar 14, 2017

ghost commented Mar 14, 2017

mzji commented Mar 14, 2017 •

edited

ghost commented Mar 14, 2017 •

edited by ghost

mzji commented Mar 15, 2017

steveklabnik commented Mar 15, 2017

bors commented Mar 15, 2017

nodakai commented May 12, 2017

nagisa commented May 12, 2017

nodakai commented May 14, 2017

nagisa commented May 14, 2017

Corrected very minor documentation detail about Unicode and Japanese #40499

Corrected very minor documentation detail about Unicode and Japanese #40499

Conversation

ghost commented Mar 14, 2017

rust-highfive commented Mar 14, 2017

steveklabnik commented Mar 14, 2017

ghost commented Mar 14, 2017

nagisa commented Mar 14, 2017

steveklabnik commented Mar 14, 2017

ghost commented Mar 14, 2017

mzji commented Mar 14, 2017 • edited

ghost commented Mar 14, 2017 • edited by ghost

mzji commented Mar 15, 2017

steveklabnik commented Mar 15, 2017

bors commented Mar 15, 2017

nodakai commented May 12, 2017

nagisa commented May 12, 2017

nodakai commented May 14, 2017

nagisa commented May 14, 2017

mzji commented Mar 14, 2017 •

edited

ghost commented Mar 14, 2017 •

edited by ghost