-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-16 and UTF-32 codecs should reject (lone) surrogates #57101
Comments
From Chapter 03 of the Unicode Standard 60, D91: On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate). ---------------------------------------------- From Chapter 03 of the Unicode Standard 60, D90: On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates). ---------------------------------------------- I think that:
Note that this has been already reported in bpo-3672, but eventually only the utf-8 codec was fixed. |
Ezio Melotti wrote:
All UTF codecs should reject lone surrogates in strict error mode, |
Python 3.3 has a strange behaviour: >>> '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff'
>>> '\U0010ffff'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff' I would expect text.decode(encoding).encode(encoding)==text or an encode or decode error. So I agree that the encoder should reject lone surogates. |
Patch rejecting surrogates in UTF-16 and UTF-32 encoders. I don't think that Python 2.7 and 3.2 should be changed in a minor version. |
Hum, my patch doesn't call the error handler. |
Attached patch does the following beyond what the patch from haypo does:
The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.
For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching Implements the
|
Thanks for the patch!
This should probably be done on a separate patch that will be applied to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will go in 3.3 only. (Note that lot of Unicode-related code changed between 3.2 and 3.3.)
No, I meant only UTF-8/16/32; UTF-7 is fine as is. |
Unfortunately this took longer than I thought but here comes the patch.
This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3. You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue bpo-13913 for this. Also, the strcmps are quite crappy. I am working on issue bpo-13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead... In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.
Good to know. |
Here is a patch which combines both Kang-Hao's patches, synchronized with tip, fixed and optimized. Unfortunately even optimized this patch slowdown encoding/decoding some data. Here are some benchmark results (benchmarking tools are here: https://bitbucket.org/storchaka/cpython-stuff/src/default/bench). 3.3 3.4 3.4 969 (+12%) 978 (+11%) 1087 encode utf-16le 'A'*10000 531 (-7%) 545 (-9%) 494 encode utf-32le 'A'*10000 682 (+5%) 679 (+5%) 715 decode utf-16le 'A'*10000 103 (+272%) 374 (+2%) 383 decode utf-32le 'A'*10000 Performance of utf-16 decoding is not changed. utf-16 encoder is 2.5 times slowed for UCS2 data (it was just memcpy) but still 3+ times faster than 2.7 and 3.2 (bpo-15026). Due to additional optimization it now even slightly faster for some other data. There is a patch for speed up UTF-32 encoding (bpo-15027), it should help to compensate it's performance degradation. UTF-32 decoder already 3-4 times faster than in 3.3 (bpo-14625). I don't see performance objection against this patch. |
You should be able to squeeze out some extra cycles by +# if STRINGLIB_MAX_CHAR >= 0xd800 |
Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch. |
On 02.09.2013 18:56, Serhiy Storchaka wrote:
I think it would be faster to do this in two steps:
Part 1 can be further optimized by adding a simple range |
No, it isn't faster. I tested this variant, it is 1.5x slower. And simple range checking actually is slower. |
Could you please make a review Ezio? |
Updated whatsnew and Misc/ files. |
utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. |
On 08.10.2013 10:46, Antoine Pitrou wrote:
It's the default encoding for Unicode text files and APIs on Windows, http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_environments |
I've never seen any UTF-16 text files. Do you have other data? APIs are irrelevant. You only pass very small strings to then (e.g. |
On 08.10.2013 11:03, Antoine Pitrou wrote:
>
>>> utf-16 isn't that widely used, so it's probably fine if it becomes
>>> a bit slower.
>>
>> It's the default encoding for Unicode text files and APIs on Windows,
>> so I'd say it *is* widely used :-)
>
> I've never seen any UTF-16 text files. Do you have other data? See the link I posted. MS Notepad and MS Office save Unicode text files in UTF-16-LE, http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx This is simply due to the fact that MS introduced Unicode plain
You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16 http://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx |
I'd be curious to know if people actually edit *text files* using
Still, unless those APIs get passed rather large strings, the performance |
UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough). The converting to/from wchar_t* uses different code. |
On 08.10.2013 11:33, Antoine Pitrou wrote:
The question is not so much which program they use for editing.
Antoine, I'm just pointing out that your statement that UTF-16 The APIs on those platforms are used from Python (the interpreter UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible The real question is: Can the UTF-16/32 codecs be made fast |
I repeat myself. Even with the patch, UTF-16 codec is faster than UTF-8 codec (except ASCII-only data). This is fastest Unicode codec in Python (perhaps UTF-32 can be made faster, but this is another issue).
Yes, they can. But let defer this to other issues. |
"As fast as possible" is a platonic dream. |
On 08.10.2013 11:42, Serhiy Storchaka wrote:
That's a good plan :-) |
On 08.10.2013 12:30, Antoine Pitrou wrote:
No, they need to be as fast as possible, without sacrificing This has always been our guideline for codec implementations |
I don't think that performances on a microbenchmark is the good question. Please open a new performance issue after fixing this one if you have I didn't read the patch yet, but strict, surrogatepass and surrogateescape |
Here is my idea: http://permalink.gmane.org/gmane.comp.python.ideas/23521. I see that a discussion about how fast UTF-16 codec should be already larger than discussion about patches. Could you please review this not so simple patch instead? Yet one help which I need is writing a note in "Porting to Python 3.4" section in Doc/whatsnew/3.4.rst. |
Marc-Andre: please don't confuse "use in major operating systems" with "major use in operating systems". I agree with Antoine that UTF-16 isn't widely used on Windows, despite notepad and Office supporting it. Most users on Windows using notepad continue to use the ANSI code page, most users of Word use Word files (instead of plain text). Also, wchar_t on Windows isn't *really* UTF-16. Many APIs support lone surrogates just fine; they really are UCS-2 instead (e.g. the file system APIs). Only starting with Vista, MultiByteToWideChar will complain about lone surrogates. |
I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected. >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff' => I expected '[\udc80\udcdc]'. With a decoder, surrogateescape does not work neither: >>> '[\uDC80]'.encode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in position 1: surrogates not allowed Using the PEP-383, I expect that data.decode(encoding, 'surrogateescape') does never fail, data.decode(encoding, 'surrogateescape').encode(encoding, 'surrogateescape') should give data. -- With UTF-16, there is a corner case: >>> b'[\x00\x00'.decode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/haypo/prog/python/default/Lib/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2: truncated data
>>> b'[\x00\x80'.decode('utf-16-le', 'surrogateescape')
'[\udc80' The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does not. Should we extend the PEP-383 to bytes in range [0; 127]? Or should we keep this behaviour? Sorry, this question is unrelated to this issue. |
I did a first review of your code on rietveld. |
Updated patch addresses Victor's comments on Rietveld. Thank you Victor. The "surrogatepass" error handler now works with different spellings of encodings ("utf_32le", "UTF-32-LE", etc).
Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev. |
Changed the documentation as was discussed with Ezio on IRC. Ezio, do you want commit this patch? Feel free to reword the documentation if you are feeling be better. |
New changeset 0d9624f2ff43 by Serhiy Storchaka in branch 'default': |
Ezio have approved the patch and I have committed it. Thank you Victor and Kang-Hao for your patches. Thanks all for the reviews. |
New changeset 130597102dac by Serhiy Storchaka in branch 'default': |
Thanks Ezio and Serhiy for having fix UTF-16 and UTF-32 codecs! |
I don't understand why encoding with |
There are two causes:
|
Serhiy, I understand the first reason, but https://docs.python.org/3/library/codecs.html says
And about second one, could you explain a bit more? I mean, I don't know how to interpret it. You say b'\xD8\x00' are invalid ASCII bytes, but from these two only 0xD8 is invalid. Also, we are talking about encoding here, str -> bytes, so who cares are resulting bytes ASCII compatible or not? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: