Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for Unicode 14 #40

Merged
merged 9 commits into from
Mar 1, 2021
Merged

Prepare for Unicode 14 #40

merged 9 commits into from
Mar 1, 2021

Conversation

Manishearth
Copy link
Member

This PR:

  • Adds Unicode 14 to all the requisite enums, etc
  • Copies Unicode 13's ucd and emoji folders.
  • Applies updated version/date headers to these files

(No other changes are made to the data files)

Some things missing from the documentation:

  • The .bat files are no longer generated and I couldn't figure out how to do it. Instead I followed the instructions here with the Python script
  • The extra and cldr folders do not exist under unicodetools/data/ucd/14.0.0, but they do under Generated. I only updated files which existed
  • Blocks.txt is under unicodetools/data/ucd/14.0.0 in unicodetools, and Generated/UCD/d1/extra in Generated. Copying it out is not enough; the Generated Blocks.txt uses short names for blocks, whereas the unicodetools one uses long ones, and UCD asks for an additional file for the mapping. For now I just patched the header on Blocks.txt, but it seems like we should be generating d1/Blocks.txt as well with the long codes.

I can upload the d1 files if you tell me when and where.

cc @markusicu @Ken-Whistler

@Manishearth Manishearth changed the title Create directories for Unicode 14 Prepare for Unicode 14 Feb 11, 2021
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I usually do is diff them against the previous version, to check that the data looks sensible. I can do that, but probably later during the week.

One quick note is that the emoji files are showing version 13.0, which will need fixing. Not your problem since you were going by the instructions (and they were clearly incomplete)

@Manishearth
Copy link
Member Author

One quick note is that the emoji files are showing version 13.0, which will need fixing. Not your problem since you were going by the instructions (and they were clearly incomplete)

Should be easy enough to fix.

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks plausible for a start.

I always do a combination of following the notes on the unicodetools site plus “forensic programming”. I don't remember the details from one year to the next :-}

I see that you mostly copied Unicode 13 files as-is, but also put some Unicode 14 files in here. I think I used to start with a commit of just the old files in the new folders, for a more visible file history. Not sure if it makes much of a difference though. It might help git with the file history if the initial diffs for the new version are large enough to exceed git's threshold for whether the new version looks similar to the old one and record a "copy". Probably less important generally than in svn.

We will also need to copy the version-13 files for idna, security, and UCA, for a clean start there. Can be a separate PR.

As for generated files, they don't go directly back into the repo. We post the generated files to a location where KenW grabs them, looks them over, and if they are ok, he posts them in the Unicode Public folder. Then we grab all of the files from there again, put them into the repo's data folders, and run the tools again. Etc.

For example, here is where I posted files that I generated for Unicode 13: https://corp.unicode.org/~book/incoming/markus/u13/ plus (once UCD is settled) https://corp.unicode.org/~book/incoming/markus/uca13/

unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved
unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved
unicodetools/org/unicode/text/utility/Utility.java Outdated Show resolved Hide resolved
@markusicu
Copy link
Member

  • The extra and cldr folders do not exist under unicodetools/data/ucd/14.0.0, but they do under Generated. I only updated files which existed

That's right. They are files for diagnosis or for use in CLDR and ICU, but don't go into the UCD/UCA/...

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

  • Blocks.txt is under unicodetools/data/ucd/14.0.0 in unicodetools, and Generated/UCD/d1/extra in Generated. Copying it out is not enough; the Generated Blocks.txt uses short names for blocks, whereas the unicodetools one uses long ones, and UCD asks for an additional file for the mapping. For now I just patched the header on Blocks.txt, but it seems like we should be generating d1/Blocks.txt as well with the long codes.

I think we update Blocks.txt manually, post for Ken, wait for his approval and copy to /Public, and ony then copy into the data folder.

@Manishearth
Copy link
Member Author

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

No, I copied them from Unicode 13, as listed in the docs. I can copy those files instead.

@Manishearth
Copy link
Member Author

(once this is done I hope to update the docs with everything I've learned)

@markusicu
Copy link
Member

To be clear, I assume that what you put into the data input folders is straight from https://www.unicode.org/Public/14.0.0/ right? Do not check in things from the Generated output folder. The generated files first go for review, and we only pick them up after KenW has looked them over and copied them to /Public.

No, I copied them from Unicode 13, as listed in the docs. I can copy those files instead.

You copied the ucd and emoji files from the data/.../13.0.0 folders to the data/.../14.0.0 folders, but you also then overwrote some files like unicodetools/data/ucd/14.0.0-Update/PropertyAliases.txt which has a version 14 header. What I am saying is that these files should only ever come from the /Public folder, never directly from the Generated output.

@Manishearth
Copy link
Member Author

Ah. I'll make an additional commit copying in 14 files from Public

@Manishearth
Copy link
Member Author

but you also then overwrote some files like unicodetools/data/ucd/14.0.0-Update/PropertyAliases.txt which has a version 14 header

Oh, uh, not quite, this is the Unicode 13 file, I just added the Unicode 14 header to make the tools stop complaining about diffs 😄

@Manishearth
Copy link
Member Author

Manishearth commented Feb 18, 2021

  • Copied in the generated files
  • Addressed inline comments
  • Added new scripts and blocks to the code. I did not add them to the data files (e.g. PropertyValueAliases.txt for short codes), they are appearing in Generated but as you mentioned they don't need to be copied

@Manishearth
Copy link
Member Author

It seems to be generating again, after working through all the errors. Note for when I document stuff; we need to document "new scripts" and "new blocks".

@Manishearth
Copy link
Member Author

Oh, another thing:

  • The emoji/, auxiliary/, and extracted/ folders on /Public are empty. I didn't touch them
  • Unihan.zip does not contain files that overlap with Unihan/, so I didn't update that

The build throws these non-blocking errors:

***Rules missing from SAMPLES for Line: [12.3, 13.04, 21.2, 27.01]
...
java.lang.IllegalArgumentException: Internal error: cjkAccountingNumeric doesn't contain : []
java.lang.IllegalArgumentException: Internal error: cjkOtherNumeric doesn't contain : []
java.lang.IllegalArgumentException: Internal error: cjkPrimaryNumeric doesn't contain : []

@markusicu
Copy link
Member

Oh, uh, not quite, this is the Unicode 13 file, I just added the Unicode 14 header to make the tools stop complaining about diffs 😄

I wouldn't make this kind of edit. The data files should be the old versions, or come from the /Public online folder, or have real edits for new data.

I believe that I used to make one commit (in svn) with just copies within the repo from the old-version data/ files to the new-version data/ files, and subsequent commits where I copied in /Public files and/or added new data.

Copied in the generated files

This sounds wrong, as discussed...

The emoji/, auxiliary/, and extracted/ folders on /Public are empty. I didn't touch them

They are empty until the files we generated get reviewed and posted.

Unihan.zip does not contain files that overlap with Unihan/, so I didn't update that

The files in Unihan.zip are hard to work with. Hard to diff, and properties can move between files.
Instead, we re-munge them into one file per property.
Instructions are in the "Unihan" section of https://sites.google.com/site/unicodetools/inputdata
(We will need to change that slightly to talk about git instead of svn.)


FYI I expect to post the last two ISO script codes tomorrow that we need for Unicode 14.
(I am the ISO 15924 registrar. There is a whole process for that.)

@Manishearth
Copy link
Member Author

I wouldn't make this kind of edit. The data files should be the old versions, or come from the /Public online folder, or have real edits for new data.

Yep, the files are now from /Public

@Manishearth
Copy link
Member Author

Manishearth commented Feb 18, 2021

Well, except for the emoji files, which don't exist and cause errors until added

@Manishearth
Copy link
Member Author

Manishearth commented Feb 19, 2021

Okay, the current state is that there are no more "copied from Unicode 13" files in 14.0.0-Update., it is a clean copy of Public/. There is still a step in the git history that copies the relevant files from Unicode 13, but that is purely so that the "Add updated Unicode 14 data" commit can have useful diffs.

There are still files under emoji/14.0 that are copied from Unicode 13, because the program fails without it. What should my next steps be here: should I be getting Ken to upload my generated files? Or getting this PR merged?

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Mark has a competing PR now adding emoji 13.1. Let's see who gets in first vs. who gets to merge :-)

FYI Next PR could update to https://www.unicode.org/Public/14.0.0/ucd/PropertyValueAliases-14.0.0d2.txt which adds new block and script names.

unicodetools/data/emoji/14.0/ReadMe.txt Show resolved Hide resolved
unicodetools/org/unicode/text/UCD/MakeUnicodeFiles.txt Outdated Show resolved Hide resolved
unicodetools/org/unicode/text/UCD/ShortBlockNames.txt Outdated Show resolved Hide resolved
@macchiati
Copy link
Member

macchiati commented Mar 1, 2021 via email

@Manishearth
Copy link
Member Author

Should I still pull in 13.1?

Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this?

@macchiati
Copy link
Member

macchiati commented Mar 1, 2021 via email

@markusicu
Copy link
Member

Should I still pull in 13.1?

I think yes. Just the four files from https://www.unicode.org/Public/emoji/13.1/

Also, from the ucd-dev list it seems like a lot of data files have been updated: Should I be pulling in the latest files before we can merge this?

Possible, but I suggest to do that in the next PR.

@Manishearth
Copy link
Member Author

I think yes. Just the four files from https://www.unicode.org/Public/emoji/13.1/

Done (and addressed the other comments).

Possible, but I suggest to do that in the next PR.

Will do, and then I suppose I should send the updated generated files to Ken?

@markusicu
Copy link
Member

Possible, but I suggest to do that in the next PR.

Will do, and then I suppose I should send the updated generated files to Ken?

Sure. Best is to post the generated files somewhere and send to Ken with cc to ucd-dev.

We usually post to https://corp.unicode.org/~book/incoming/username -- do you have access to that? (Should be fine posting wherever is convenient.)

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks plausible :-)

please squash & merge

@Manishearth Manishearth merged commit 05b47f2 into master Mar 1, 2021
@Manishearth
Copy link
Member Author

Done! Going to pull in the latest data soon

We usually post to https://corp.unicode.org/~book/incoming/username -- do you have access to that? (Should be fine posting wherever is convenient.)

I have the password for book, but I don't know how to upload to it: I'm able to log in to the web with book and get a dir listing, but not to FTP as listed in https://corp.unicode.org/~book/bookurls.txt . Do i need to get an account?

@Manishearth Manishearth deleted the unicode-14 branch March 1, 2021 22:55
@Manishearth
Copy link
Member Author

@Manishearth
Copy link
Member Author

I made some doc changes to update https://sites.google.com/site/unicodetools/home to the latest set of steps.

@macchiati
Copy link
Member

macchiati commented Mar 3, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants