-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade to Unicode 5.2 #52272
Comments
Is there any benefit to upgrade the UCD in trunk? |
Excerpt of the release note: The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes. Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language. The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East. Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols. Current version is 5.1 in Python 2.6 |
Have you checked how big the structural changes are between 5.2 and 5.1. If we only have to rerun the makeunicodedata.py script, then I'd be +1 on going with 5.2. Otherwise, I think it's better to wait another release before upgrading to the then latest Unicode version. |
It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2". It generates the 3 db files: Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py". I use UCD 5.2 since January, and everything works fine. |
Florent Xicluna wrote:
So the Unicode database format itself has not changed ? |
No. The changes listed below have no impact afai-have-tested. --------- --------- --------- --------- --------- --------- --------- The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:
--------- --------- --------- --------- --------- --------- --------- See also: |
Florent Xicluna wrote:
Ok, so +1 for updating to 5.2. The files that have changed are not used by Python (yet), so there's Thanks for checking. |
Done with r79059 and r79062. |
Reverted in 3.x: it triggers some failures. Symptoms:
|
Florent Xicluna wrote:
repr() for Unicode doesn't use the Unicode database. Are you sure that Looking closer at the patch, you also changed the unicodetype mappings If that's the case, please also revert the Python 2.7 checkin. Thanks,Marc-Andre Lemburg ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character: AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;; And test_bigmem relies on this code point being non-printable. The regression test suite passes flawlessly. I will do further tests before merging back in 3.x |
Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines. |
Florent Xicluna wrote:
That's better. You wrote about '\üaaa' (3 'a's) in your previous post
Please also check what happened to all those code points that were Thanks. |
Amaury Forgeot d'Arc wrote:
Ooops :-) I now realized that I was looking at the patch reverting Sorry about that. |
Merged with r79093 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: