-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #73642
Comments
unicodedata can't normalize(NFC) hangul strings which contain \u1176(HANGUL JUNGSEONG A-O). >>> from unicodedata import normalize
>>> normalize("NFC", "\u1100\u1176\u11a8")
'깍' => should be "\u1100\u1176\u11a8" not '깍' (\uae4d) I attached a patch for this issue. (Fixing boundary of modern medial vowels) |
How about the third character's range? The code seems assuming it's [11a7..11c3] while the spec is [11a8..11c2]? >>> unicodedata.normalize("NFC", "\u1100\u1175\u11a7")
'기' while it should be '기ᆧ'? |
I think you are right. The modern final consonants is [11a8..11c2]. |
Is there anything need more? |
We have moved our code hosting to GitHub, would you mind turn your patch into a GitHub PR first Wonsup? |
Ok, I'll do it. |
Any updates? I need this fix for my project. |
I added some test cases for this issue. Please, someone check this. |
I think it can be merged. Is there anything I need to do? |
Hi Wonsup, sorry for the delay. I get really busy with my work these days. If no one get involved I'd try to find time reviewing your patch this week. |
This patch fixes changes in Unicode 4.1.0. @animalize says: Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0) This change happened in 2005. |
Hello? |
ping, this was forgotten. |
Hello! |
Sorry for the absence and late response. I just reviewed it and think it's ready. I think the change in the unicode standard is more like a bug in the implementation than an intentional change. It's mentioned in Unicode 3.0 the third character is out of bounds when TIndex <= 0 or TIndex >= TCount. We have a ucd_3_2_0 in unicodedata. I'll merge it after resolve the CI bot. |
Probably this 3.2 unicodedata is used for IDNA2003. Now we changed the Composition code of Hangul to Unicode Standard 4.1+, and fixed the bug even in Unicode Standard 4.1-. |
As I said, I checked Unicode 3.0 for the hangul composition algorithm. It looks consistent with Unicode 4.1+. 3.0 only gets description but no sample implementation. So I think the changed code also applies to Unicode 3.0+. |
You are right. I found a Normalization Test Suite for Unicode 3.2 \u1176 is not in the range of the second character. |
Thanks for your confirmation, Ma Lin. Also thanks for Wonsup! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: