Update Unicode tables to 9.0 #34599

Merged
merged 1 commit into from Jul 15, 2016

Conversation

Projects
None yet
9 participants
@cuviper
Member

cuviper commented Jul 1, 2016

I just updated unicode.py's generated copyright year, then ran it.

@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive Jul 1, 2016

Collaborator

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

Collaborator

rust-highfive commented Jul 1, 2016

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@eddyb

This comment has been minimized.

Show comment
Hide comment
Member

eddyb commented Jul 1, 2016

@rust-highfive rust-highfive assigned alexcrichton and unassigned eddyb Jul 1, 2016

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 1, 2016

Member

cc @SimonSapin, @rust-lang/libs

Thoughts on the backwards compatibility implications of a change like this? This seems like something we'd want to do although if it has bad implications we may just want to think through it.

Member

alexcrichton commented Jul 1, 2016

cc @SimonSapin, @rust-lang/libs

Thoughts on the backwards compatibility implications of a change like this? This seems like something we'd want to do although if it has bad implications we may just want to think through it.

@alexcrichton alexcrichton added the T-libs label Jul 1, 2016

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 1, 2016

Member

For reference, here are the official Unicode 9.0.0 changes, and the Migration section in particular should help evaluate compatibility. I don't know enough about how these properties are used to answer myself.

Member

cuviper commented Jul 1, 2016

For reference, here are the official Unicode 9.0.0 changes, and the Migration section in particular should help evaluate compatibility. I don't know enough about how these properties are used to answer myself.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jul 1, 2016

Contributor

When we've discussed unicode and compatibility in the past, I recall we've leaned toward giving ourselves leeway to upgrade. @cuviper based on the unicode changelog do you know what impacts this has on specific rust functions? If this makes e.g. changes to Unicode identifiers (which it looks like it does) that impacts the Rust language definition.

Contributor

brson commented Jul 1, 2016

When we've discussed unicode and compatibility in the past, I recall we've leaned toward giving ourselves leeway to upgrade. @cuviper based on the unicode changelog do you know what impacts this has on specific rust functions? If this makes e.g. changes to Unicode identifiers (which it looks like it does) that impacts the Rust language definition.

@brson brson added the relnotes label Jul 1, 2016

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 1, 2016

Member

@brson Yes, there are changes to the XID tables. I believe these are mostly additions for the new scripts, but I'm not sure of that. The UAX #31 Migration talks about changing the formal definitions of ID/XID, which isn't clear to me either, but I think it's just changing emphasis.

Member

cuviper commented Jul 1, 2016

@brson Yes, there are changes to the XID tables. I believe these are mostly additions for the new scripts, but I'm not sure of that. The UAX #31 Migration talks about changing the formal definitions of ID/XID, which isn't clear to me either, but I think it's just changing emphasis.

@nagisa

This comment has been minimized.

Show comment
Hide comment
@nagisa

nagisa Jul 1, 2016

Contributor

To the best of my knowledge we have no public unicode functionality in libraries shipped with rustc or compiler itself which would be impacted by move to 9.0.

  • We do not NFKC normalisation for our identifiers and AFAIR our XID Tables were already explicit.
  • Adding new scripts is not breaking unless somebody relied on previously unassigned symbols becoming a replacement character in certain cases; that would not happen anymore with newly assigned codepoints;

The only thing we might want to do is check our “easily confused symbols” table thing and see if it needs adjustment for the new codepoints (doubtful about it).

Contributor

nagisa commented Jul 1, 2016

To the best of my knowledge we have no public unicode functionality in libraries shipped with rustc or compiler itself which would be impacted by move to 9.0.

  • We do not NFKC normalisation for our identifiers and AFAIR our XID Tables were already explicit.
  • Adding new scripts is not breaking unless somebody relied on previously unassigned symbols becoming a replacement character in certain cases; that would not happen anymore with newly assigned codepoints;

The only thing we might want to do is check our “easily confused symbols” table thing and see if it needs adjustment for the new codepoints (doubtful about it).

@nagisa

This comment has been minimized.

Show comment
Hide comment
@nagisa

nagisa Jul 1, 2016

Contributor

To the PR author, you might need to adjust script more for new properties and similar changes.

To the future reviewers: make sure the tables related to new properties are indeed correct and exhaustive.

Contributor

nagisa commented Jul 1, 2016

To the PR author, you might need to adjust script more for new properties and similar changes.

To the future reviewers: make sure the tables related to new properties are indeed correct and exhaustive.

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 2, 2016

Member

@nagisa There is one new property, Prepended_Concatenation_Mark, but it looks like unicode.py is already not exhaustive, only loading a few "interesting" properties:

props = load_properties("PropList.txt",
        ["White_Space", "Join_Control", "Noncharacter_Code_Point", "Pattern_White_Space"])

And then only White_Space and Pattern_White_Space are actually written to tables.rs.

So it seems rustc_unicode is already not trying to represent the entire Unicode standard. Is there anything in particular that you think needs to be added explicitly?

Member

cuviper commented Jul 2, 2016

@nagisa There is one new property, Prepended_Concatenation_Mark, but it looks like unicode.py is already not exhaustive, only loading a few "interesting" properties:

props = load_properties("PropList.txt",
        ["White_Space", "Join_Control", "Noncharacter_Code_Point", "Pattern_White_Space"])

And then only White_Space and Pattern_White_Space are actually written to tables.rs.

So it seems rustc_unicode is already not trying to represent the entire Unicode standard. Is there anything in particular that you think needs to be added explicitly?

@nagisa

This comment has been minimized.

Show comment
Hide comment
@nagisa

nagisa Jul 2, 2016

Contributor

I was more worried about

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.

but it seems fine too, since we do not load that either.

Contributor

nagisa commented Jul 2, 2016

I was more worried about

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.

but it seems fine too, since we do not load that either.

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 11, 2016

Member

Is there anything waiting on me here? Or is this just waiting for a review decision?

Member

cuviper commented Jul 11, 2016

Is there anything waiting on me here? Or is this just waiting for a review decision?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 12, 2016

Member

@cuviper ah no it's all on our end, the libs team just needs to discuss this basically. (would love to get @SimonSapin's thoughts as well)

Member

alexcrichton commented Jul 12, 2016

@cuviper ah no it's all on our end, the libs team just needs to discuss this basically. (would love to get @SimonSapin's thoughts as well)

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Jul 12, 2016

Contributor

In general I’m in favor of keeping up to date with Unicode. Hard-coding a Unicode version was one of the big issues of IDNA 2003. And I believe the Unicode Consortium to be mindful of backward compatibility when making changes. http://www.unicode.org/policies/policies.html talks about stability.

And for what it’s worth, a second-hand story from The Olden Days:

https://tools.ietf.org/html/rfc3629#section-5

ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, new versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly.

In general, the changes amount to adding new characters, which does not pose particular problems with old data. In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change (see Unicode Consortium Policies [1]).


That said I haven’t looked at all at what changed in 9.0. http://unicode.org/versions/Unicode9.0.0/#Migration would be the thing to review.

Contributor

SimonSapin commented Jul 12, 2016

In general I’m in favor of keeping up to date with Unicode. Hard-coding a Unicode version was one of the big issues of IDNA 2003. And I believe the Unicode Consortium to be mindful of backward compatibility when making changes. http://www.unicode.org/policies/policies.html talks about stability.

And for what it’s worth, a second-hand story from The Olden Days:

https://tools.ietf.org/html/rfc3629#section-5

ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, new versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly.

In general, the changes amount to adding new characters, which does not pose particular problems with old data. In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change (see Unicode Consortium Policies [1]).


That said I haven’t looked at all at what changed in 9.0. http://unicode.org/versions/Unicode9.0.0/#Migration would be the thing to review.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 12, 2016

Member

Ok, cool, thanks @SimonSapin! I suspect that @rust-lang/libs will probably all respond with "lgtm" @cuviper

Member

alexcrichton commented Jul 12, 2016

Ok, cool, thanks @SimonSapin! I suspect that @rust-lang/libs will probably all respond with "lgtm" @cuviper

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jul 13, 2016

Contributor

lgtm

Contributor

brson commented Jul 13, 2016

lgtm

@aturon

This comment has been minimized.

Show comment
Hide comment
@aturon

aturon Jul 13, 2016

Member

I'm happy to delegate to the experts here, so lgtm :)

Member

aturon commented Jul 13, 2016

I'm happy to delegate to the experts here, so lgtm :)

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
Member

alexcrichton commented Jul 14, 2016

@bors: r+ 452e4ed

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Jul 15, 2016

Contributor

⌛️ Testing commit 452e4ed with merge 3e15fcc...

Contributor

bors commented Jul 15, 2016

⌛️ Testing commit 452e4ed with merge 3e15fcc...

bors added a commit that referenced this pull request Jul 15, 2016

Auto merge of #34599 - cuviper:unicode-9.0, r=alexcrichton
Update Unicode tables to 9.0

I just updated `unicode.py`'s generated copyright year, then ran it.

@bors bors merged commit 452e4ed into rust-lang:master Jul 15, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@cuviper cuviper deleted the cuviper:unicode-9.0 branch Sep 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment