-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377
Comments
>>> license
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\site.py", line 372, in __repr__
self.__setup()
File "C:\Python30\lib\site.py", line 359, in __setup
data = fp.read()
File "C:\Python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164-
15165: il
legal multibyte sequence
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in
position 1:
illegal multibyte sequence
>>> I also can't understand why chr(0x10000) and chr(0x11000) has different |
Here (winxpsp2, Py3, cp850-terminal) the license works fine:
>>> license
Type license() to see the full license text and license() works as well. I get this output for the chr()s:
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Programs\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Programs\Python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
1-2: character maps to <undefined> I believe that chr(0x10000) and chr(0x11000) should have the opposite On Linux with Py3 and a UTF-8 terminal, chr(0x10000) prints '\U00010000' Also note that with cp850 the error message is 'character maps to |
There were non-ascii characters in the Windows license file. This was
This other problem is because on a narrow unicode build, >>> unicodedata.category(chr(0x10000 % 65536))
'Cc'
>>> unicodedata.category(chr(0x11000 % 65536))
'Lo' |
I don't understand the behaviour of unichr(): Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
>>> import unicodedata
>>> unicodedata.category(u"\U00010000")
'Lo'
>>> unicodedata.category(u"\U00011000")
'Cn'
>>> unicodedata.category(unichr(0x10000))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build) Why unichr() fails whereas \Uxxxxxxxx works? >>> len(u"\U00010000")
2
>>> ord(u"\U00010000")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found |
FWIW, on Python3 it seems to work:
>>> import unicodedata
>>> unicodedata.category("\U00010000")
'Lo'
>>> unicodedata.category("\U00011000")
'Cn'
>>> unicodedata.category(chr(0x10000))
'Lo'
>>> unicodedata.category(chr(0x11000))
'Cn'
>>> ord(chr(0x10000)), 0x10000
(65536, 65536)
>>> ord(chr(0x11000)), 0x11000
(69632, 69632)
I'm using a narrow build too:
>>> import sys
>>> sys.maxunicode
65535
>>> len('\U00010000')
2
>>> ord('\U00010000')
65536 On Python2 unichr() is supposed to raise a ValueError on a narrow build Maybe we should open a new issue for this if it's not present already. |
Since r56395, ord() and chr() accept and return surrogate pairs even in The goal is to remove most differences between narrow and wide unicode To address this problem, I suggest to change all functions in |
On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:
-1. That would cause major breakage in the C API and is not inline with the Users who are interested in UCS4 builds should simply use UCS4 builds.
--with-wctype-functions was scheduled for removal many releases ago, It's not useful in any way, and causes compatibility problems |
amaury> Since r56395, ord() and chr() accept and return surrogate pairs Note: My examples are made with Python 2.x.
It would be nice to get the same behaviour in Python 2.x and 3.x to help unichr() (in Python 2.x) documentation is correct. But I would approciate to
Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP -- I can open a new issue if you agree that we can change unichr() / ord() |
On 2009-02-03 14:14, STINNER Victor wrote:
This is not possible for unichr() in Python 2.x, since applications Changing ord() would be possible in Python 2.x is easier, since
|
Not if you recompile. I don't see how this breaks the API at the C level.
Py_UNICODE is still used as the allocation unit for unicode strings. To get correct results, we need a way to access the whole unicode My motivation for the change is this post: |
lemburg> This is not possible for unichr() in Python 2.x, since applications Oh, ok. lemburg> Changing ord() would be possible in Python 2.x is easier, since ord() of Python3 (narrow build) rejects surrogate characters: '\U00010000'
>>> len(chr(0x10000))
2
>>> ord(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected string of length 1, but int found It looks that narrow builds with surrogates have some more problems... Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other. Correct result (Python 2.5, wide build): $ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> unichr(0x10000)
u'\U00010000'
>>> unichr(0x10000).isalpha()
True Error in Python3 (narrow build): marge$ ./python
Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35)
>>> chr(0x10000).isalpha()
False
>>> list(chr(0x10000))
['\ud800', '\udc00']
>>> chr(0xd800).isalpha()
False
>>> chr(0xdc00).isalpha()
False Unicode ranges, all in the category "Other, Surrogate":
|
On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:
Well, then try to look at such a change from a C extension They'd have to change all their function calls and routines to work Supporting both the old API and the new one would Please remember that the public Python C API is not only meant for Python has a long history of providing very stable APIs, both in FWIW: The last major change in the C API (the change to Py_ssize_t That said, we can of course provide additional UCS4 APIs for
I must be missing some detail, but what does the Unicode database
There are certainly other ways to make Python deal with surrogates |
haypo> ord() of Python3 (narrow build) rejects surrogate characters:
haypo> '\U00010000'
haypo> >>> len(chr(0x10000))
haypo> 2
haypo> >>> ord(0x10000)
haypo> TypeError: ord() expected string of length 1, but int found
ord() works fine on Py3, you probably meant to do
>>> ord('\U00010000')
65536
or
>>> ord(chr(0x10000))
65536
In Py3 is also stated that it accepts surrogate pairs (help(ord)).
Py2 instead doesn't support them:
>>> ord(u'\U00010000')
TypeError: ord() expected a character, but string of length 2 found |
Ah, now I understand your concerns. My suggestion is to change only the 20 functions in I join a patch so we can argue on concrete code (tests are missing). Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0. The str.isalpha() (and others) methods did not change: they still split the surrogate pairs. |
Surrogates aren't optional features of UTF-16, we really need to get We might keep the old public API for compatibility, but it should be I don't see a problem with changing 2.x. The existing behaviour is |
Adam Olsen wrote:
We use UCS2 on narrow Python builds, not UTF-16.
That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making This process is not yet complete and will likely never complete
No, but changing the APIs from 16-bit integers to 32-bit integers Also, the Unicode type database itself uses Py_UNICODE, so So if we want to support accessing non-BMP type information With such an approach we'd not break the binary API and Would someone be willing to work on this ? |
Is it acceptable between 3.1 and 3.2 for example? ISTM that other
Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4. |
Amaury Forgeot d'Arc wrote:
With the proposed approach, we'll keep binary compatibility, so Note: Changes to the binary interface can be done in minor releases,
The change affects the Unicode type database which is implemented |
This is the case with this patch: today all these functions
Are you referring to the _PyUnicode_TypeRecord structure? |
UCS2 died long ago, is there any reason why we keep using an UCS2 that
I don't exactly know all the details of the current implementation, but What are the use cases for processing the lone surrogates? Wouldn't be |
Amaury Forgeot d'Arc wrote:
True, but we can do better. For narrow builds, the API currently For wide builds, we don't need to change anything.
I haven't checked, but it's certainly possible to have a code point |
This is off-topic for the tracker item, but I'll reply anyway: Ezio Melotti wrote:
>
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
>
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
>
>> That has always been the case. UCS2 doesn't support surrogates.
>
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
>
> UCS2 died long ago, is there any reason why we keep using an UCS2 that
> "appears" like UTF-16 instead of real UTF-16? UCS2 is how we store Unicode in Python for narrow builds internally. However, on narrow builds such as the Windows builds, you will sometimes
No, because Python is meant to be used for working on all Unicode |
Why have two names for the same function? it's Python 3, after all.
OK, here is a new patch. Even if this does not happen with unicodedata |
Amaury Forgeot d'Arc wrote:
It's not the same function... the UCS2 version would take a I don't understand the comment about Python 3.x. FWIW, we're no
There are generally two options for API changes within a
The second option was used when transitioning from 2.4 to 2.5 due We could do the same for 2.7/3.2, but if it's just needed for this
Sorry, but this doesn't work: the functions have to return Py_UNICODE Otherwise, you'd get completely wrong values in code downcasting Another good reason to use two sets of APIs. The new set could |
On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:
Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made UTF-16 was designed as an easy transition from UCS-2. Indeed, if your If the intent really was to use UCS-2 then a correctly functioning "The alternative (no matter what the configure flag is called) is "If you find places where the Python core or standard library is doing |
Adam Olsen wrote:
>
> Adam Olsen <rhamph@gmail.com> added the comment:
>
> On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:
>> We use UCS2 on narrow Python builds, not UTF-16.
>>
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
>>
>> That has always been the case. UCS2 doesn't support surrogates.
>>
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
>>
>> This process is not yet complete and will likely never complete
>> since it must still be possible to create things line lone
>> surrogates for processing purposes, so care has to be taken
>> when using non-BMP code points on narrow builds.
>
> Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made
> this quite clear.
>
> UTF-16 was designed as an easy transition from UCS-2. Indeed, if your
> code only does searches or joins existing strings then it will Just
> Work; declare it UTF-16 and you are done. We have a lot more work to
> do than that (as in this bug report), and we can't reasonably prevent
> the user from splitting surrogate pairs via poor code, but a 95%
> solution doesn't mean we suddenly revert all the way back to UCS-2.
>
> If the intent really was to use UCS-2 then a correctly functioning
> UTF-16 codec would join a surrogate pair into a single scalar value,
> then raise an error because it's outside the range representable in
> UCS-2. That's not very helpful though; obviously, it's much better to
> use UTF-16 internally.
>
> "The alternative (no matter what the configure flag is called) is
> UTF-16, not UCS-2 though: there is support for surrogate pairs in
> various places, including the \U escape and the UTF-8 codec."
> http://mail.python.org/pipermail/python-dev/2008-July/080892.html
>
> "If you find places where the Python core or standard library is doing
> Unicode processing that would break when surrogates are present you
> should file a bug. However this does not mean that every bit of code
> that slices a string at an arbitrary point (and hence risks slicing in
> the middle of a surrogate) is incorrect -- it all depends on what is
> done next with the slice."
> http://mail.python.org/pipermail/python-dev/2008-July/080900.html All this is just nitpicking, really. UCS2 is a character set, It so happens that when the Unicode consortium realized The conversion of these surrogate pairs to UCS4 code point If we were to implement Unicode using UTF-16 as storage format, PEP-100 really says it all:
""" Note that I wrote the PEP and worked on the implementation at a time
But all that is off-topic for this ticket, so please let's just |
On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote:
UCS is a character set, for most purposes synonymous with the Unicode
No. Internal usage may become temporarily ill-formed, but this is a Not that I wouldn't *prefer* a system that wouldn't store lone
I think you hit the nail on the head there. 10 years ago, unicode
It needs to be discussed somewhere. It's a distraction from fixing |
So the discussion is now on 2 points:
"Naive" code that simply walks the Py_UNICODE* buffer will have
|
It's not as easy as that. The functions for case conversion are used in a way that assumes they What we can do is change the input parameter to Py_UCS4, but not the However, this change would not really help anyone if there are no It appears to be better to just leave the case mapping APIs unchanged - The situation is different for the various Py_UNICODE_IS*() APIs: for |
Unfortunately, there is no such warning, or the initial problem we are trying gcc has a -Wconversion flag, (which I tried today on python) but it is far too But the most important thing is that implicit truncation on UCS2 builds is what |
I don't see the point in changing the various conversion APIs in the unicode database to return Py_UCS4 when there are no conversions that map code points between BMP and non-BMP. In order to solve the problem in question (unicode_repr() failing), we should change the various property checking APIs to accept Py_UCS4 input data. This needlessly increases the type database size without real benefit. For that to work properly we'll have to either make sure that extensions get recompiled if they use these changed APIs, or we provide an additional set of UCS2 APIs that extend the Py_UNICODE input value to a Py_UCS4 value before calling the underlying Py_UCS4 API. |
For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.
Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes.
Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable. |
Amaury Forgeot d'Arc wrote:
The problem in question is already solved by just changing the property
Hmm, that's a good point. OK, you got me convinced: let's go for it then. |
Now I wonder whether it's reasonable to consider this character |
Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work. |
Ezio Melotti wrote:
Note that Python3 will send printable code points as-is to the console, The "printable" property is a Python invention, not a Unicode property, In recent years the situation has just started clearing up
The only font set I know of that tries to go beyond BMP is this
Most other fonts just cover small parts of the Unicode assigned
I suppose that in a few years we'll see OS and GUIs mix and match the Given the font situation, I don't think we should have repr() |
[This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.] The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.
We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO.
If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see bpo-5110), and that makes repr() barely usable. |
I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means. |
Ezio Melotti wrote:
Without fonts, you can't print the code points, even if the Unicode I also find the use of Zl, Zp and Zs in the definition somewhat http://www.cplusplus.com/reference/clibrary/cctype/isprint/ "A printable character is any character that is not a control character."
That's a different problem, but indeed also related to the I was never a fan of the Unicode repr() change to begin with. The |
Amaury Forgeot d'Arc wrote:
+1 |
Amaury, before applying the patch consider replacing the tab characters before the comments with spaces. The use of tabs is discouraged. Marc-Andre Lemburg wrote:
I still think that bpo-5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue. |
Ezio Melotti wrote:
Let's discuss this on bpo-9198. |
|
A new patch, generated on top of r82662 |
Amaury Forgeot d'Arc wrote:
Could you explain what this bit is about ? @@ -349,7 +313,7 @@ -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) #include <wctype.h> |
A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio. |
On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore. Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4. OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common. |
Amaury Forgeot d'Arc wrote:
Right, but you still have to check whether wchar_t is usable, don't The line should probably read: #if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)
True. The support for the wctype functions should have been remove long ago, The comment was true before the Python type tables were changed |
Amaury Forgeot d'Arc wrote:
Could you please be more specific on what you changed ? At least visually, there don't appear to be any differences between Thanks,Marc-Andre Lemburg 2010-07-19: EuroPython 2010, Birmingham, UK 9 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.) The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly. |
Ezio Melotti wrote:
Ah, good catch !
|
str.isprintable() &co are not changed by this patch, because they enumerate Py_UNICODE units and do not join surrogates. See bpo-9200 |
In this 6th patch, the wctype part was changed as suggested. -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) |
Committed with r84177. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: