-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental encoders of CJK codecs reset the codec at each call to encode() #56309
Comments
Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output: >>> import codecs
>>> encoder = codecs.getincrementalencoder('hz')()
>>> encoder.encode('\u804a') + encoder.encode('\u804a')
b'~{AD~}~{AD~}'
>>> '\u804a\u804a'.encode('hz')
b'~{ADAD~}' Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp). Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue bpo-12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first? I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0). |
I think it's better to use a StringIO instance for the tests. Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this. |
For which test excatly? An encoder produces bytes, I don't the relation with StringIO. |
STINNER Victor wrote:
Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO. |
Does Victor Stinner have a psychic link with Armin Rigo? :) https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py """ The answer to Armin's theory is that they're bugs but not ones users are likely to notice? |
Le mardi 24 mai 2011 à 18:13 +0000, Martin a écrit :
Sorry, I only found one bug, and while testing HZ, not while reading the
This is a new bug that you should be fixed. Armin did not reported the
Ok, I will apply my fix. |
Hi :-) I did not report the two issues I found so far because I didn't finish the PyPy implementation of CJK yet, and I'm very new to anything related to codecs; additionally I didn't check Python 3.x, as I was just following the 2.7 sources. Can someone confirm that the two bugs I suspect are really bugs? And should I open another report to help tracking the 2nd bug? |
New changeset bd17396895fb by Victor Stinner in branch '3.1': New changeset 7f2ab2f95a04 by Victor Stinner in branch '3.2': New changeset cb9867dab15e by Victor Stinner in branch 'default': |
New changeset e789b4cda872 by Victor Stinner in branch '2.7': |
The initial problem (reset() at each call to .encode()) is fixed in Python 2.7, 3.1, 3.2 and 3.3. I opened a new issue, bpo-12171, for the second problem noticed by Armin (decreset vs encreset). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: