Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 15 initial data files #171

Merged
merged 10 commits into from
Dec 9, 2021

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Dec 1, 2021

  • Unihan data files from @Tsengtsz dec01
  • new hardcoded CJK range for extension H
  • initial UCD files from @Ken-Whistler dec02..03
  • short & long block names in ShortBlockNames.txt, modified from Blocks.txt
  • script codes Kawi+Nagm; run GenerateEnums for UcdPropertyValues.java
  • generated files

Notes from Ken

The repertoire here covers ALL of the 15.0.0 additions, synched to the Pipeline page (and matching Michel's CDAM ballot draft as of 10/31, but not yet incorporating any CDAM ballot comment dispositions from later).

Note that there is one significant departure, to deal with the name collision for U+1DF27 LATIN SMALL LETTER N WITH LEFT HOOK. I've anticipated the most likely outcome and added "RAISED" into the names of 1DF25..1DF2A.

Notes from Ken about the initial drop for PropList.txt

This includes the non-automatic new property assignments:

  1. Added Ideographic and Unified_Ideograph for Extension H (31350..323AF)

  2. Added Other_Alphabetic for one Kannada mark (0CF3), one Khojki vowel
    sign (11241), and various Kawi signs and vowel signs (11F00.11F01,
    11F03, 11F34..11F3A, 11F3E..11F40).

  3. Added Diacritic for 3 Arabic word signs (10EFD..10EFF) and for the
    Cyrillic modifier letters (1E030..1E06C).

  4. Also added Other_Lowercase for the Cyrillic modifier letters.

  5. Added Terminal_Punctuation and Sentence_Terminal for the two Kawi
    dandas (11F43..11F44), for general consistency with the way the danda
    and double danda are treated in related scripts. The rest of the Kawi
    punctuation is really murky, with no real analysis presented in the
    proposal, so I didn't make any assumptions that it would play in
    sentence break or even be terminal in position. (Kawi is one of the SE
    Asian scripts with no word spaces, so it ends up as lb=SA and requires
    special handling for paragraph formatting, anyway.)

Notes from Markus

2 new sets of decimal digits

11F50..11F59  ; Decimal # Nd  [10] KAWI DIGIT ZERO..KAWI DIGIT NINE
1E4F0..1E4F9  ; Decimal # Nd  [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE

@markusicu
Copy link
Member Author

I regenerated the UCD files. No changes; in particular, no changes in DerivedNumericTypes/Values.

@markusicu
Copy link
Member Author

TestInvariants fails. I sent an email discussing how to deal with incomplete data drops, such as Unihan data for new characters before even UnicodeData.txt has entries for new characters.

Manishearth
Manishearth previously approved these changes Dec 2, 2021
Tsengtsz
Tsengtsz previously approved these changes Dec 2, 2021
@markusicu markusicu dismissed stale reviews from Tsengtsz and Manishearth via 78bf520 December 2, 2021 18:11
@markusicu markusicu changed the title Unihan 15 data 20211201 Unicode 15 initial data files Dec 2, 2021
Manishearth
Manishearth previously approved these changes Dec 3, 2021
@markusicu markusicu merged commit 9c9ef12 into unicode-org:main Dec 9, 2021
@markusicu markusicu deleted the unihan-15-2021dec01 branch December 9, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants