Prepare for Unicode 14 #40

Manishearth · 2021-02-11T06:12:50Z

This PR:

Adds Unicode 14 to all the requisite enums, etc
Copies Unicode 13's ucd and emoji folders.
Applies updated version/date headers to these files

(No other changes are made to the data files)

Some things missing from the documentation:

The .bat files are no longer generated and I couldn't figure out how to do it. Instead I followed the instructions here with the Python script
The extra and cldr folders do not exist under unicodetools/data/ucd/14.0.0, but they do under Generated. I only updated files which existed
Blocks.txt is under unicodetools/data/ucd/14.0.0 in unicodetools, and Generated/UCD/d1/extra in Generated. Copying it out is not enough; the Generated Blocks.txt uses short names for blocks, whereas the unicodetools one uses long ones, and UCD asks for an additional file for the mapping. For now I just patched the header on Blocks.txt, but it seems like we should be generating d1/Blocks.txt as well with the long codes.

I can upload the d1 files if you tell me when and where.

cc @markusicu @Ken-Whistler

macchiati

What I usually do is diff them against the previous version, to check that the data looks sensible. I can do that, but probably later during the week.

One quick note is that the emoji files are showing version 13.0, which will need fixing. Not your problem since you were going by the instructions (and they were clearly incomplete)

Manishearth · 2021-02-15T03:07:46Z

One quick note is that the emoji files are showing version 13.0, which will need fixing. Not your problem since you were going by the instructions (and they were clearly incomplete)

Should be easy enough to fix.

markusicu

Thanks, looks plausible for a start.

I always do a combination of following the notes on the unicodetools site plus “forensic programming”. I don't remember the details from one year to the next :-}

I see that you mostly copied Unicode 13 files as-is, but also put some Unicode 14 files in here. I think I used to start with a commit of just the old files in the new folders, for a more visible file history. Not sure if it makes much of a difference though. It might help git with the file history if the initial diffs for the new version are large enough to exceed git's threshold for whether the new version looks similar to the old one and record a "copy". Probably less important generally than in svn.

We will also need to copy the version-13 files for idna, security, and UCA, for a clean start there. Can be a separate PR.

As for generated files, they don't go directly back into the repo. We post the generated files to a location where KenW grabs them, looks them over, and if they are ok, he posts them in the Unicode Public folder. Then we grab all of the files from there again, put them into the repo's data folders, and run the tools again. Etc.

For example, here is where I posted files that I generated for Unicode 13: https://corp.unicode.org/~book/incoming/markus/u13/ plus (once UCD is settled) https://corp.unicode.org/~book/incoming/markus/uca13/

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt

unicodetools/org/unicode/text/utility/Utility.java

markusicu · 2021-02-17T21:48:09Z

The extra and cldr folders do not exist under unicodetools/data/ucd/14.0.0, but they do under Generated. I only updated files which existed

That's right. They are files for diagnosis or for use in CLDR and ICU, but don't go into the UCD/UCA/...

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

Blocks.txt is under unicodetools/data/ucd/14.0.0 in unicodetools, and Generated/UCD/d1/extra in Generated. Copying it out is not enough; the Generated Blocks.txt uses short names for blocks, whereas the unicodetools one uses long ones, and UCD asks for an additional file for the mapping. For now I just patched the header on Blocks.txt, but it seems like we should be generating d1/Blocks.txt as well with the long codes.

I think we update Blocks.txt manually, post for Ken, wait for his approval and copy to /Public, and ony then copy into the data folder.

Manishearth · 2021-02-17T22:19:24Z

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

No, I copied them from Unicode 13, as listed in the docs. I can copy those files instead.

Manishearth · 2021-02-17T22:23:31Z

(once this is done I hope to update the docs with everything I've learned)

markusicu · 2021-02-17T22:34:55Z

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

No, I copied them from Unicode 13, as listed in the docs. I can copy those files instead.

You copied the ucd and emoji files from the data/.../13.0.0 folders to the data/.../14.0.0 folders, but you also then overwrote some files like unicodetools/data/ucd/14.0.0-Update/PropertyAliases.txt which has a version 14 header. What I am saying is that these files should only ever come from the /Public folder, never directly from the Generated output.

Manishearth · 2021-02-17T23:08:28Z

Ah. I'll make an additional commit copying in 14 files from Public

Manishearth · 2021-02-18T00:19:31Z

but you also then overwrote some files like unicodetools/data/ucd/14.0.0-Update/PropertyAliases.txt which has a version 14 header

Oh, uh, not quite, this is the Unicode 13 file, I just added the Unicode 14 header to make the tools stop complaining about diffs 😄

Manishearth · 2021-02-18T01:08:59Z

Copied in the generated files
Addressed inline comments
Added new scripts and blocks to the code. I did not add them to the data files (e.g. PropertyValueAliases.txt for short codes), they are appearing in Generated but as you mentioned they don't need to be copied

Manishearth · 2021-02-18T01:14:08Z

It seems to be generating again, after working through all the errors. Note for when I document stuff; we need to document "new scripts" and "new blocks".

Manishearth · 2021-02-18T01:24:23Z

Oh, another thing:

The emoji/, auxiliary/, and extracted/ folders on /Public are empty. I didn't touch them
Unihan.zip does not contain files that overlap with Unihan/, so I didn't update that

The build throws these non-blocking errors:

***Rules missing from SAMPLES for Line: [12.3, 13.04, 21.2, 27.01]
...
java.lang.IllegalArgumentException: Internal error: cjkAccountingNumeric doesn't contain : []
java.lang.IllegalArgumentException: Internal error: cjkOtherNumeric doesn't contain : []
java.lang.IllegalArgumentException: Internal error: cjkPrimaryNumeric doesn't contain : []

markusicu · 2021-02-18T03:22:57Z

Oh, uh, not quite, this is the Unicode 13 file, I just added the Unicode 14 header to make the tools stop complaining about diffs 😄

I wouldn't make this kind of edit. The data files should be the old versions, or come from the /Public online folder, or have real edits for new data.

I believe that I used to make one commit (in svn) with just copies within the repo from the old-version data/ files to the new-version data/ files, and subsequent commits where I copied in /Public files and/or added new data.

Copied in the generated files

This sounds wrong, as discussed...

The emoji/, auxiliary/, and extracted/ folders on /Public are empty. I didn't touch them

They are empty until the files we generated get reviewed and posted.

Unihan.zip does not contain files that overlap with Unihan/, so I didn't update that

The files in Unihan.zip are hard to work with. Hard to diff, and properties can move between files.
Instead, we re-munge them into one file per property.
Instructions are in the "Unihan" section of https://sites.google.com/site/unicodetools/inputdata
(We will need to change that slightly to talk about git instead of svn.)

FYI I expect to post the last two ISO script codes tomorrow that we need for Unicode 14.
(I am the ISO 15924 registrar. There is a whole process for that.)

Manishearth · 2021-02-18T20:45:52Z

I wouldn't make this kind of edit. The data files should be the old versions, or come from the /Public online folder, or have real edits for new data.

Yep, the files are now from /Public

Manishearth · 2021-02-18T20:46:37Z

Well, except for the emoji files, which don't exist and cause errors until added

The UCD data will be overwritten with data from Public/, this is simply so that we have useful diffs in the "Add updated Unicode 14 data" commit

Manishearth · 2021-02-19T00:03:13Z

Okay, the current state is that there are no more "copied from Unicode 13" files in 14.0.0-Update., it is a clean copy of Public/. There is still a step in the git history that copies the relevant files from Unicode 13, but that is purely so that the "Add updated Unicode 14 data" commit can have useful diffs.

There are still files under emoji/14.0 that are copied from Unicode 13, because the program fails without it. What should my next steps be here: should I be getting Ken to upload my generated files? Or getting this PR merged?

markusicu

FYI Mark has a competing PR now adding emoji 13.1. Let's see who gets in first vs. who gets to merge :-)

FYI Next PR could update to https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt which adds new block and script names.

unicodetools/data/emoji/14.0/ReadMe.txt

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt

unicodetools/org/unicode/text/UCD/ShortBlockNames.txt

macchiati · 2021-03-01T21:33:04Z

Manish started first; the 13.1 update isn't as important... Mark

…

On Mon, Mar 1, 2021 at 1:11 PM Markus Scherer ***@***.***> wrote: ***@***.**** commented on this pull request. FYI Mark has a competing PR now adding emoji 13.1. Let's see who gets in first vs. who gets to merge :-) FYI Next PR could update to https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt which adds new block and script names. ------------------------------ In unicodetools/data/emoji/14.0/ReadMe.txt <#40 (comment)> : > @@ -0,0 +1,21 @@ +# Unicode Emoji +# © 2020 Unicode®, Inc. +# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. +# For terms of use, see http://www.unicode.org/terms_of_use.html + +This directory contains data files for Unicode Emoji, Version 13.0 + +Public/emoji/13.0/ The sequences files should probably start with 13.1 not 13.0: https://www.unicode.org/Public/emoji/13.1/ ------------------------------ In unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt <#40 (comment)> : > @@ -1,6 +1,6 @@ Generate: . -CopyrightYear: 2020 -DeltaVersion: 17 +CopyrightYear: 2021 +DeltaVersion: 1 We usually set this one higher than the highest dsomething that is published, so that there is no collision. Looks like we have a d7 now, so this wants to be 8 so that your generated files say d8. ------------------------------ In unicodetools/org/unicode/text/UCD/ShortBlockNames.txt <#40 (comment)> : > Wancho ; Wancho Warang_Citi ; Warang_Citi Yezidi ; Yezidi Yi_Radicals ; Yi_Radicals Yi_Syllables ; Yi_Syllables Yijing ; Yijing_Hexagram_Symbols Zanabazar_Square ; Zanabazar_Square +Znamenny_Musical ; Znamenny_Musical_Notation ⬇️ Suggested change -Znamenny_Musical ; Znamenny_Musical_Notation +Znamenny_Music ; Znamenny_Musical_Notation see https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMB4DFKPEQXN6DRWTCDTBP7I5ANCNFSM4XOILQEA> .

Manishearth · 2021-03-01T21:35:13Z

Should I still pull in 13.1?

Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this?

macchiati · 2021-03-01T21:39:08Z

13.1 was an emoji-sequences-only release, and the UCD files are identical to 13.0. I just copied in 13.0 files because it makes it easier to build for the UnicodeJsps (the tooling isn't set up quite right for having a delta like that), so I think you can simply proceed with 14.0 as you were. Mark

…

On Mon, Mar 1, 2021 at 1:35 PM Manish Goregaokar ***@***.***> wrote: Should I still pull in 13.1? Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFK6XANU23FGYLB6W3TBQCCBANCNFSM4XOILQEA> .

markusicu · 2021-03-01T21:39:40Z

Should I still pull in 13.1?

I think yes. Just the four files from https://www.unicode.org/Public/emoji/13.1/

Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this?

Possible, but I suggest to do that in the next PR.

Manishearth · 2021-03-01T21:43:30Z

I think yes. Just the four files from https://www.unicode.org/Public/emoji/13.1/

Done (and addressed the other comments).

Possible, but I suggest to do that in the next PR.

Will do, and then I suppose I should send the updated generated files to Ken?

markusicu · 2021-03-01T21:56:20Z

Possible, but I suggest to do that in the next PR.

Will do, and then I suppose I should send the updated generated files to Ken?

Sure. Best is to post the generated files somewhere and send to Ken with cc to ucd-dev.

We usually post to https://corp.unicode.org/~book/incoming/username -- do you have access to that? (Should be fine posting wherever is convenient.)

markusicu

looks plausible :-)

please squash & merge

Manishearth · 2021-03-01T22:52:23Z

Done! Going to pull in the latest data soon

We usually post to https://corp.unicode.org/~book/incoming/username -- do you have access to that? (Should be fine posting wherever is convenient.)

I have the password for book, but I don't know how to upload to it: I'm able to log in to the web with book and get a dir listing, but not to FTP as listed in https://corp.unicode.org/~book/bookurls.txt . Do i need to get an account?

Manishearth · 2021-03-02T01:56:36Z

Posted #45 and https://corp.unicode.org/~book/incoming/manishearth/u14/d8/

Manishearth · 2021-03-03T01:37:23Z

I made some doc changes to update https://sites.google.com/site/unicodetools/home to the latest set of steps.

macchiati · 2021-03-03T02:05:54Z

Thanks! Mark

…

On Tue, Mar 2, 2021 at 5:37 PM Manish Goregaokar ***@***.***> wrote: I made some doc changes to update https://sites.google.com/site/unicodetools/home to the latest set of steps. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMATLDAK4TJJWRHO45TTBWHGDANCNFSM4XOILQEA> .

Manishearth changed the title ~~Create directories for Unicode 14~~ Prepare for Unicode 14 Feb 11, 2021

macchiati reviewed Feb 14, 2021

View reviewed changes

markusicu reviewed Feb 17, 2021

View reviewed changes

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved

unicodetools/org/unicode/text/utility/Utility.java Outdated Show resolved Hide resolved

Prepare enums for Unicode 14

a92487f

Manishearth force-pushed the unicode-14 branch from 569eb3d to 3a5ea30 Compare February 18, 2021 01:05

Manishearth force-pushed the unicode-14 branch from 3a5ea30 to 4796ac1 Compare February 18, 2021 01:12

Manishearth force-pushed the unicode-14 branch from 4796ac1 to c447969 Compare February 18, 2021 23:59

Manishearth added 5 commits February 18, 2021 16:01

Add UCD and Emoji folders for Unicode 14

a6f8b2d

The UCD data will be overwritten with data from Public/, this is simply so that we have useful diffs in the "Add updated Unicode 14 data" commit

Add updated Unicode 14 data

612cb03

Add Toto enum values

d9349a7

Add short block names to ShortBlockNames.txt

3d66901

Improve diagnostics for ToolUnicodeProperty._getValue()

b7d80c0

markusicu reviewed Mar 1, 2021

View reviewed changes

unicodetools/data/emoji/14.0/ReadMe.txt Show resolved Hide resolved

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved

unicodetools/org/unicode/text/UCD/ShortBlockNames.txt Outdated Show resolved Hide resolved

Update short block name for Znamenny

78e3b96

Use d8 as the deltaversion

5a355f1

Manishearth force-pushed the unicode-14 branch from c447969 to 5a355f1 Compare March 1, 2021 21:39

Update emoji data for 14.0 to 13.1

df0c663

markusicu approved these changes Mar 1, 2021

View reviewed changes

Manishearth merged commit 05b47f2 into master Mar 1, 2021

Manishearth deleted the unicode-14 branch March 1, 2021 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for Unicode 14 #40

Prepare for Unicode 14 #40

Manishearth commented Feb 11, 2021

macchiati left a comment

Manishearth commented Feb 15, 2021

markusicu left a comment

markusicu commented Feb 17, 2021

Manishearth commented Feb 17, 2021

Manishearth commented Feb 17, 2021

markusicu commented Feb 17, 2021

Manishearth commented Feb 17, 2021

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021 •

edited

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021

markusicu commented Feb 18, 2021

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021 •

edited

Manishearth commented Feb 19, 2021 •

edited

markusicu left a comment

macchiati commented Mar 1, 2021 via email

Manishearth commented Mar 1, 2021

macchiati commented Mar 1, 2021 via email

markusicu commented Mar 1, 2021

Manishearth commented Mar 1, 2021

markusicu commented Mar 1, 2021

markusicu left a comment

Manishearth commented Mar 1, 2021

Manishearth commented Mar 2, 2021

Manishearth commented Mar 3, 2021

macchiati commented Mar 3, 2021 via email

Prepare for Unicode 14 #40

Prepare for Unicode 14 #40

Conversation

Manishearth commented Feb 11, 2021

macchiati left a comment

Choose a reason for hiding this comment

Manishearth commented Feb 15, 2021

markusicu left a comment

Choose a reason for hiding this comment

markusicu commented Feb 17, 2021

Manishearth commented Feb 17, 2021

Manishearth commented Feb 17, 2021

markusicu commented Feb 17, 2021

Manishearth commented Feb 17, 2021

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021 • edited

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021

markusicu commented Feb 18, 2021

Manishearth commented Feb 18, 2021

Manishearth commented Feb 18, 2021 • edited

Manishearth commented Feb 19, 2021 • edited

markusicu left a comment

Choose a reason for hiding this comment

macchiati commented Mar 1, 2021 via email

Manishearth commented Mar 1, 2021

macchiati commented Mar 1, 2021 via email

markusicu commented Mar 1, 2021

Manishearth commented Mar 1, 2021

markusicu commented Mar 1, 2021

markusicu left a comment

Choose a reason for hiding this comment

Manishearth commented Mar 1, 2021

Manishearth commented Mar 2, 2021

Manishearth commented Mar 3, 2021

macchiati commented Mar 3, 2021 via email

Manishearth commented Feb 18, 2021 •

edited

Manishearth commented Feb 18, 2021 •

edited

Manishearth commented Feb 19, 2021 •

edited