Unicode 15 initial data files #171

markusicu · 2021-12-01T21:21:26Z

Unihan data files from @Tsengtsz dec01
new hardcoded CJK range for extension H
initial UCD files from @Ken-Whistler dec02..03
short & long block names in ShortBlockNames.txt, modified from Blocks.txt
script codes Kawi+Nagm; run GenerateEnums for UcdPropertyValues.java
generated files

Notes from Ken

The repertoire here covers ALL of the 15.0.0 additions, synched to the Pipeline page (and matching Michel's CDAM ballot draft as of 10/31, but not yet incorporating any CDAM ballot comment dispositions from later).

Note that there is one significant departure, to deal with the name collision for U+1DF27 LATIN SMALL LETTER N WITH LEFT HOOK. I've anticipated the most likely outcome and added "RAISED" into the names of 1DF25..1DF2A.

Notes from Ken about the initial drop for PropList.txt

This includes the non-automatic new property assignments:

Added Ideographic and Unified_Ideograph for Extension H (31350..323AF)
Added Other_Alphabetic for one Kannada mark (0CF3), one Khojki vowel
sign (11241), and various Kawi signs and vowel signs (11F00.11F01,
11F03, 11F34..11F3A, 11F3E..11F40).
Added Diacritic for 3 Arabic word signs (10EFD..10EFF) and for the
Cyrillic modifier letters (1E030..1E06C).
Also added Other_Lowercase for the Cyrillic modifier letters.
Added Terminal_Punctuation and Sentence_Terminal for the two Kawi
dandas (11F43..11F44), for general consistency with the way the danda
and double danda are treated in related scripts. The rest of the Kawi
punctuation is really murky, with no real analysis presented in the
proposal, so I didn't make any assumptions that it would play in
sentence break or even be terminal in position. (Kawi is one of the SE
Asian scripts with no word spaces, so it ends up as lb=SA and requires
special handling for paragraph formatting, anyway.)

Notes from Markus

2 new sets of decimal digits

11F50..11F59  ; Decimal # Nd  [10] KAWI DIGIT ZERO..KAWI DIGIT NINE
1E4F0..1E4F9  ; Decimal # Nd  [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE

markusicu · 2021-12-01T21:28:22Z

I regenerated the UCD files. No changes; in particular, no changes in DerivedNumericTypes/Values.

markusicu · 2021-12-01T22:18:39Z

TestInvariants fails. I sent an email discussing how to deal with incomplete data drops, such as Unihan data for new characters before even UnicodeData.txt has entries for new characters.

…regen UCD

Unihan 15 data 20211201

ef9de93

markusicu requested review from Manishearth, pedberg-icu and Ken-Whistler December 1, 2021 21:21

markusicu requested a review from Tsengtsz December 1, 2021 22:17

Manishearth previously approved these changes Dec 2, 2021

View reviewed changes

Tsengtsz previously approved these changes Dec 2, 2021

View reviewed changes

UCD 15 initial data files

78bf520

markusicu dismissed stale reviews from Tsengtsz and Manishearth via 78bf520 December 2, 2021 18:11

markusicu changed the title ~~Unihan 15 data 20211201~~ Unicode 15 initial data files Dec 2, 2021

markusicu added 3 commits December 2, 2021 10:44

short & long block property names, and generated files

8d8df8e

new hardcoded CJK range for extension H

5c1bf66

U+11F3C KAWI VOWEL SIGN VOCALIC L has been withdrawn

299d685

Manishearth previously approved these changes Dec 3, 2021

View reviewed changes

Scripts for new characters

59090b8

markusicu dismissed Manishearth’s stale review via 59090b8 December 3, 2021 18:45

markusicu added 4 commits December 3, 2021 11:57

add script codes Kawi+Nagm; run GenerateEnums for UcdPropertyValues; …

ab42374

…regen UCD

Ken: 6 characters should have sc=Latin

27cbf03

from Ken: initial drop for PropList.txt

3b0408f

generated files after PropList.txt update

072931a

Manishearth approved these changes Dec 4, 2021

View reviewed changes

srl295 assigned markusicu Dec 6, 2021

markusicu merged commit 9c9ef12 into unicode-org:main Dec 9, 2021

markusicu deleted the unihan-15-2021dec01 branch December 9, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode 15 initial data files #171

Unicode 15 initial data files #171

markusicu commented Dec 1, 2021 •

edited

markusicu commented Dec 1, 2021

markusicu commented Dec 1, 2021

Unicode 15 initial data files #171

Unicode 15 initial data files #171

Conversation

markusicu commented Dec 1, 2021 • edited

Notes from Ken

Notes from Ken about the initial drop for PropList.txt

Notes from Markus

markusicu commented Dec 1, 2021

markusicu commented Dec 1, 2021

markusicu commented Dec 1, 2021 •

edited