Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental encoders of CJK codecs reset the codec at each call to encode() #56309

Closed
vstinner opened this issue May 17, 2011 · 10 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@vstinner
Copy link
Member

BPO 12100
Nosy @malemburg, @loewis, @arigo, @hyeshik, @vstinner, @bz2
Files
  • cjk_no_reset.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-05-25.22:01:10.524>
    created_at = <Date 2011-05-17.23:10:05.037>
    labels = ['interpreter-core']
    title = 'Incremental encoders of CJK codecs reset the codec at\teach call to encode()'
    updated_at = <Date 2012-02-01.01:25:06.893>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2012-02-01.01:25:06.893>
    actor = 'kennyluck'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-05-25.22:01:10.524>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2011-05-17.23:10:05.037>
    creator = 'vstinner'
    dependencies = []
    files = ['22017']
    hgrepos = []
    issue_num = 12100
    keywords = ['patch']
    message_count = 10.0
    messages = ['136194', '136534', '136727', '136730', '136777', '136780', '136788', '136791', '136792', '136910']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'arigo', 'hyeshik.chang', 'vstinner', 'gz', 'python-dev', 'kennyluck']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue12100'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output:

    >>> import codecs
    >>> encoder = codecs.getincrementalencoder('hz')()
    >>> encoder.encode('\u804a') + encoder.encode('\u804a')
    b'~{AD~}~{AD~}'
    >>> '\u804a\u804a'.encode('hz')
    b'~{ADAD~}'

    Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp).

    Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue bpo-12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first?

    I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0).

    See also issues bpo-12016 and bpo-12057.

    @vstinner vstinner added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label May 17, 2011
    @malemburg
    Copy link
    Member

    I think it's better to use a StringIO instance for the tests.

    Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this.

    @vstinner
    Copy link
    Member Author

    I think it's better to use a StringIO instance for the tests.

    For which test excatly? An encoder produces bytes, I don't the relation with StringIO.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > I think it's better to use a StringIO instance for the tests.

    For which test excatly? An encoder produces bytes, I don't the relation with StringIO.

    Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO.

    @malemburg malemburg changed the title Incremental encoders of CJK codecs reset the codec at each call to encode() Incremental encoders of CJK codecs reset the codec at each call to encode() May 24, 2011
    @bz2
    Copy link
    Mannequin

    bz2 mannequin commented May 24, 2011

    Does Victor Stinner have a psychic link with Armin Rigo? :)

    https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py

    """
    # My theory is that they are not widely used on CPython either, because
    # I found two bugs just by looking at their .c source: they always call
    # encreset() after a piece of data, even though I think it's wrong ---
    # it should be called only once at the end; and mbiencoder_reset() calls
    # decreset() instead of encreset().
    """

    The answer to Armin's theory is that they're bugs but not ones users are likely to notice?

    @vstinner
    Copy link
    Member Author

    Le mardi 24 mai 2011 à 18:13 +0000, Martin a écrit :

    Martin <gzlist@googlemail.com> added the comment:

    Does Victor Stinner have a psychic link with Armin Rigo? :)

    https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py

    """

    My theory is that they are not widely used on CPython either, because

    I found two bugs just by looking at their .c source

    Sorry, I only found one bug, and while testing HZ, not while reading the
    source code.

    ... and mbiencoder_reset() calls decreset() instead of encreset()

    This is a new bug that you should be fixed. Armin did not reported the
    bug upstream (in this bug tracker)?

    The answer to Armin's theory is that they're bugs but not ones users are likely to notice?

    Ok, I will apply my fix.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented May 24, 2011

    Hi :-) I did not report the two issues I found so far because I didn't finish the PyPy implementation of CJK yet, and I'm very new to anything related to codecs; additionally I didn't check Python 3.x, as I was just following the 2.7 sources. Can someone confirm that the two bugs I suspect are really bugs? And should I open another report to help tracking the 2nd bug?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 24, 2011

    New changeset bd17396895fb by Victor Stinner in branch '3.1':
    Issue bpo-12100: Don't reset incremental encoders of CJK codecs at each call to
    http://hg.python.org/cpython/rev/bd17396895fb

    New changeset 7f2ab2f95a04 by Victor Stinner in branch '3.2':
    (Merge 3.1) Issue bpo-12100: Don't reset incremental encoders of CJK codecs at
    http://hg.python.org/cpython/rev/7f2ab2f95a04

    New changeset cb9867dab15e by Victor Stinner in branch 'default':
    (Merge 3.2) Issue bpo-12100: Don't reset incremental encoders of CJK codecs at
    http://hg.python.org/cpython/rev/cb9867dab15e

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 24, 2011

    New changeset e789b4cda872 by Victor Stinner in branch '2.7':
    Issue bpo-12100: Don't reset incremental encoders of CJK codecs at each call to
    http://hg.python.org/cpython/rev/e789b4cda872

    @vstinner
    Copy link
    Member Author

    The initial problem (reset() at each call to .encode()) is fixed in Python 2.7, 3.1, 3.2 and 3.3.

    I opened a new issue, bpo-12171, for the second problem noticed by Armin (decreset vs encreset).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants