-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for Unicode 14 #40
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I usually do is diff them against the previous version, to check that the data looks sensible. I can do that, but probably later during the week.
One quick note is that the emoji files are showing version 13.0, which will need fixing. Not your problem since you were going by the instructions (and they were clearly incomplete)
Should be easy enough to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks plausible for a start.
I always do a combination of following the notes on the unicodetools site plus “forensic programming”. I don't remember the details from one year to the next :-}
I see that you mostly copied Unicode 13 files as-is, but also put some Unicode 14 files in here. I think I used to start with a commit of just the old files in the new folders, for a more visible file history. Not sure if it makes much of a difference though. It might help git with the file history if the initial diffs for the new version are large enough to exceed git's threshold for whether the new version looks similar to the old one and record a "copy". Probably less important generally than in svn.
We will also need to copy the version-13 files for idna, security, and UCA, for a clean start there. Can be a separate PR.
As for generated files, they don't go directly back into the repo. We post the generated files to a location where KenW grabs them, looks them over, and if they are ok, he posts them in the Unicode Public folder. Then we grab all of the files from there again, put them into the repo's data folders, and run the tools again. Etc.
For example, here is where I posted files that I generated for Unicode 13: https://corp.unicode.org/~book/incoming/markus/u13/ plus (once UCD is settled) https://corp.unicode.org/~book/incoming/markus/uca13/
That's right. They are files for diagnosis or for use in CLDR and ICU, but don't go into the UCD/UCA/... To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.
I think we update Blocks.txt manually, post for Ken, wait for his approval and copy to /Public, and ony then copy into the data folder. |
No, I copied them from Unicode 13, as listed in the docs. I can copy those files instead. |
(once this is done I hope to update the docs with everything I've learned) |
You copied the ucd and emoji files from the data/.../13.0.0 folders to the data/.../14.0.0 folders, but you also then overwrote some files like unicodetools/data/ucd/14.0.0-Update/PropertyAliases.txt which has a version 14 header. What I am saying is that these files should only ever come from the /Public folder, never directly from the Generated output. |
Ah. I'll make an additional commit copying in 14 files from Public |
Oh, uh, not quite, this is the Unicode 13 file, I just added the Unicode 14 header to make the tools stop complaining about diffs 😄 |
569eb3d
to
3a5ea30
Compare
|
3a5ea30
to
4796ac1
Compare
It seems to be generating again, after working through all the errors. Note for when I document stuff; we need to document "new scripts" and "new blocks". |
Oh, another thing:
The build throws these non-blocking errors:
|
I wouldn't make this kind of edit. The data files should be the old versions, or come from the /Public online folder, or have real edits for new data. I believe that I used to make one commit (in svn) with just copies within the repo from the old-version data/ files to the new-version data/ files, and subsequent commits where I copied in /Public files and/or added new data.
This sounds wrong, as discussed...
They are empty until the files we generated get reviewed and posted.
The files in Unihan.zip are hard to work with. Hard to diff, and properties can move between files. FYI I expect to post the last two ISO script codes tomorrow that we need for Unicode 14. |
Yep, the files are now from /Public |
Well, except for the emoji files, which don't exist and cause errors until added |
4796ac1
to
c447969
Compare
The UCD data will be overwritten with data from Public/, this is simply so that we have useful diffs in the "Add updated Unicode 14 data" commit
Okay, the current state is that there are no more "copied from Unicode 13" files in There are still files under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Mark has a competing PR now adding emoji 13.1. Let's see who gets in first vs. who gets to merge :-)
FYI Next PR could update to https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt which adds new block and script names.
Manish started first; the 13.1 update isn't as important...
Mark
…On Mon, Mar 1, 2021 at 1:11 PM Markus Scherer ***@***.***> wrote:
***@***.**** commented on this pull request.
FYI Mark has a competing PR now adding emoji 13.1. Let's see who gets in
first vs. who gets to merge :-)
FYI Next PR could update to
https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt
which adds new block and script names.
------------------------------
In unicodetools/data/emoji/14.0/ReadMe.txt
<#40 (comment)>
:
> @@ -0,0 +1,21 @@
+# Unicode Emoji
+# © 2020 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
+# For terms of use, see http://www.unicode.org/terms_of_use.html
+
+This directory contains data files for Unicode Emoji, Version 13.0
+
+Public/emoji/13.0/
The sequences files should probably start with 13.1 not 13.0:
https://www.unicode.org/Public/emoji/13.1/
------------------------------
In unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt
<#40 (comment)>
:
> @@ -1,6 +1,6 @@
Generate: .
-CopyrightYear: 2020
-DeltaVersion: 17
+CopyrightYear: 2021
+DeltaVersion: 1
We usually set this one higher than the highest dsomething that is
published, so that there is no collision. Looks like we have a d7 now, so
this wants to be 8 so that your generated files say d8.
------------------------------
In unicodetools/org/unicode/text/UCD/ShortBlockNames.txt
<#40 (comment)>
:
> Wancho ; Wancho
Warang_Citi ; Warang_Citi
Yezidi ; Yezidi
Yi_Radicals ; Yi_Radicals
Yi_Syllables ; Yi_Syllables
Yijing ; Yijing_Hexagram_Symbols
Zanabazar_Square ; Zanabazar_Square
+Znamenny_Musical ; Znamenny_Musical_Notation
⬇️ Suggested change
-Znamenny_Musical ; Znamenny_Musical_Notation
+Znamenny_Music ; Znamenny_Musical_Notation
see
https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMB4DFKPEQXN6DRWTCDTBP7I5ANCNFSM4XOILQEA>
.
|
Should I still pull in 13.1? Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this? |
13.1 was an emoji-sequences-only release, and the UCD files are identical
to 13.0.
I just copied in 13.0 files because it makes it easier to build for the
UnicodeJsps (the tooling isn't set up quite right for having a delta like
that), so I think you can simply proceed with 14.0 as you were.
Mark
…On Mon, Mar 1, 2021 at 1:35 PM Manish Goregaokar ***@***.***> wrote:
Should I still pull in 13.1?
Also, from the ucd-dev list it seems like a lot of data files have been
updated: Should I be pulling in the latest files before we can merge this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFK6XANU23FGYLB6W3TBQCCBANCNFSM4XOILQEA>
.
|
I think yes. Just the four files from https://www.unicode.org/Public/emoji/13.1/
Possible, but I suggest to do that in the next PR. |
Done (and addressed the other comments).
Will do, and then I suppose I should send the updated generated files to Ken? |
Sure. Best is to post the generated files somewhere and send to Ken with cc to ucd-dev. We usually post to https://corp.unicode.org/~book/incoming/username -- do you have access to that? (Should be fine posting wherever is convenient.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks plausible :-)
please squash & merge
Done! Going to pull in the latest data soon
I have the password for |
I made some doc changes to update https://sites.google.com/site/unicodetools/home to the latest set of steps. |
Thanks!
Mark
…On Tue, Mar 2, 2021 at 5:37 PM Manish Goregaokar ***@***.***> wrote:
I made some doc changes to update
https://sites.google.com/site/unicodetools/home to the latest set of
steps.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMATLDAK4TJJWRHO45TTBWHGDANCNFSM4XOILQEA>
.
|
This PR:
ucd
andemoji
folders.(No other changes are made to the data files)
Some things missing from the documentation:
.bat
files are no longer generated and I couldn't figure out how to do it. Instead I followed the instructions here with the Python scriptextra
andcldr
folders do not exist underunicodetools/data/ucd/14.0.0
, but they do underGenerated
. I only updated files which existedunicodetools/data/ucd/14.0.0
in unicodetools, andGenerated/UCD/d1/extra
in Generated. Copying it out is not enough; the Generated Blocks.txt uses short names for blocks, whereas the unicodetools one uses long ones, and UCD asks for an additional file for the mapping. For now I just patched the header on Blocks.txt, but it seems like we should be generatingd1/Blocks.txt
as well with the long codes.I can upload the
d1
files if you tell me when and where.cc @markusicu @Ken-Whistler