Update Unicode tables to 9.0 #34599

cuviper · 2016-07-01T18:07:52Z

I just updated unicode.py's generated copyright year, then ran it.

rust-highfive · 2016-07-01T18:08:00Z

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

eddyb · 2016-07-01T18:38:36Z

r? @alexcrichton

alexcrichton · 2016-07-01T19:03:55Z

cc @SimonSapin, @rust-lang/libs

Thoughts on the backwards compatibility implications of a change like this? This seems like something we'd want to do although if it has bad implications we may just want to think through it.

cuviper · 2016-07-01T19:15:59Z

For reference, here are the official Unicode 9.0.0 changes, and the Migration section in particular should help evaluate compatibility. I don't know enough about how these properties are used to answer myself.

brson · 2016-07-01T20:14:13Z

When we've discussed unicode and compatibility in the past, I recall we've leaned toward giving ourselves leeway to upgrade. @cuviper based on the unicode changelog do you know what impacts this has on specific rust functions? If this makes e.g. changes to Unicode identifiers (which it looks like it does) that impacts the Rust language definition.

cuviper · 2016-07-01T21:01:04Z

@brson Yes, there are changes to the XID tables. I believe these are mostly additions for the new scripts, but I'm not sure of that. The UAX #31 Migration talks about changing the formal definitions of ID/XID, which isn't clear to me either, but I think it's just changing emphasis.

nagisa · 2016-07-01T22:12:58Z

To the best of my knowledge we have no public unicode functionality in libraries shipped with rustc or compiler itself which would be impacted by move to 9.0.

We do not NFKC normalisation for our identifiers and AFAIR our XID Tables were already explicit.
Adding new scripts is not breaking unless somebody relied on previously unassigned symbols becoming a replacement character in certain cases; that would not happen anymore with newly assigned codepoints;

The only thing we might want to do is check our “easily confused symbols” table thing and see if it needs adjustment for the new codepoints (doubtful about it).

nagisa · 2016-07-01T22:15:44Z

To the PR author, you might need to adjust script more for new properties and similar changes.

To the future reviewers: make sure the tables related to new properties are indeed correct and exhaustive.

cuviper · 2016-07-02T00:30:03Z

@nagisa There is one new property, Prepended_Concatenation_Mark, but it looks like unicode.py is already not exhaustive, only loading a few "interesting" properties:

props = load_properties("PropList.txt",
        ["White_Space", "Join_Control", "Noncharacter_Code_Point", "Pattern_White_Space"])

And then only White_Space and Pattern_White_Space are actually written to tables.rs.

So it seems rustc_unicode is already not trying to represent the entire Unicode standard. Is there anything in particular that you think needs to be added explicitly?

nagisa · 2016-07-02T00:34:52Z

I was more worried about

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.

but it seems fine too, since we do not load that either.

cuviper · 2016-07-11T16:11:56Z

Is there anything waiting on me here? Or is this just waiting for a review decision?

alexcrichton · 2016-07-12T18:21:37Z

@cuviper ah no it's all on our end, the libs team just needs to discuss this basically. (would love to get @SimonSapin's thoughts as well)

SimonSapin · 2016-07-12T20:55:34Z

In general I’m in favor of keeping up to date with Unicode. Hard-coding a Unicode version was one of the big issues of IDNA 2003. And I believe the Unicode Consortium to be mindful of backward compatibility when making changes. http://www.unicode.org/policies/policies.html talks about stability.

And for what it’s worth, a second-hand story from The Olden Days:

https://tools.ietf.org/html/rfc3629#section-5

ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, new versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly.

In general, the changes amount to adding new characters, which does not pose particular problems with old data. In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change (see Unicode Consortium Policies [1]).

That said I haven’t looked at all at what changed in 9.0. http://unicode.org/versions/Unicode9.0.0/#Migration would be the thing to review.

alexcrichton · 2016-07-12T23:15:27Z

Ok, cool, thanks @SimonSapin! I suspect that @rust-lang/libs will probably all respond with "lgtm" @cuviper

brson · 2016-07-13T00:16:12Z

lgtm

aturon · 2016-07-13T02:43:11Z

I'm happy to delegate to the experts here, so lgtm :)

alexcrichton · 2016-07-14T23:11:31Z

@bors: r+ 452e4ed

bors · 2016-07-15T00:29:16Z

⌛ Testing commit 452e4ed with merge 3e15fcc...

Update Unicode tables to 9.0 I just updated `unicode.py`'s generated copyright year, then ran it.

bors · 2016-07-15T03:29:53Z

Update Unicode tables to 9.0

452e4ed

rust-highfive assigned eddyb Jul 1, 2016

rust-highfive assigned alexcrichton and unassigned eddyb Jul 1, 2016

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 1, 2016

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 1, 2016

bors added a commit that referenced this pull request Jul 15, 2016

Auto merge of #34599 - cuviper:unicode-9.0, r=alexcrichton

3e15fcc

Update Unicode tables to 9.0 I just updated `unicode.py`'s generated copyright year, then ran it.

bors merged commit 452e4ed into rust-lang:master Jul 15, 2016

cuviper deleted the unicode-9.0 branch September 26, 2017 06:38

kennytm mentioned this pull request Nov 1, 2017

regenerate libcore/char_private.rs #45571

Merged

Update Unicode tables to 9.0 #34599

Update Unicode tables to 9.0 #34599

Uh oh!

Conversation

cuviper commented Jul 1, 2016

Uh oh!

rust-highfive commented Jul 1, 2016

Uh oh!

eddyb commented Jul 1, 2016

Uh oh!

alexcrichton commented Jul 1, 2016

Uh oh!

cuviper commented Jul 1, 2016

Uh oh!

brson commented Jul 1, 2016

Uh oh!

cuviper commented Jul 1, 2016

Uh oh!

nagisa commented Jul 1, 2016

Uh oh!

nagisa commented Jul 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Jul 2, 2016

Uh oh!

nagisa commented Jul 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Jul 11, 2016

Uh oh!

alexcrichton commented Jul 12, 2016

Uh oh!

SimonSapin commented Jul 12, 2016

Uh oh!

alexcrichton commented Jul 12, 2016

Uh oh!

brson commented Jul 13, 2016

Uh oh!

aturon commented Jul 13, 2016

Uh oh!

alexcrichton commented Jul 14, 2016

Uh oh!

bors commented Jul 15, 2016

Uh oh!

bors commented Jul 15, 2016

Uh oh!

Uh oh!

nagisa commented Jul 1, 2016 •

edited

Loading

nagisa commented Jul 2, 2016 •

edited

Loading