Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reject unicode in zlib #49007

Closed
vstinner opened this issue Dec 27, 2008 · 15 comments
Closed

reject unicode in zlib #49007

vstinner opened this issue Dec 27, 2008 · 15 comments
Assignees
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 4757
Nosy @malemburg, @pitrou, @vstinner, @florentx
Files
  • zlib_bytes.patch
  • issue4757_zlib_bytes_v2.diff: Patch, apply to py3k
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2009-12-14.18:24:46.510>
    created_at = <Date 2008-12-27.12:58:17.987>
    labels = ['extension-modules', 'type-bug']
    title = 'reject unicode in zlib'
    updated_at = <Date 2010-01-10.02:31:01.289>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-01-10.02:31:01.289>
    actor = 'vstinner'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2009-12-14.18:24:46.510>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2008-12-27.12:58:17.987>
    creator = 'vstinner'
    dependencies = []
    files = ['12472', '15556']
    hgrepos = []
    issue_num = 4757
    keywords = ['patch']
    message_count = 15.0
    messages = ['78356', '78357', '78360', '78365', '79090', '79110', '79124', '79225', '96376', '96378', '96382', '96383', '96384', '96389', '97491']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'pitrou', 'vstinner', 'ebfe', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4757'
    versions = ['Python 3.1', 'Python 3.2']

    @vstinner
    Copy link
    Member Author

    Python 2.x allows to encode any byte string (str) and ASCII unicode
    string (unicode):

    $ python
    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>> import zlib
    >>> zlib.compress('abc')
    "x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress(u'abc')
    "x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress(u'abc\xe9')
    ...
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' ...

    I'm not sure that this behaviour was really wanted become the
    decompress operation is not symetric (the result type is always byte
    string):

    $ python
    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>> import zlib
    >>> zlib.decompress("x\x9cKLJ\x06\x00\x02M\x01'")
    'abc'

    Python 3.0 accepts any string: bytes or characters. But decompress
    always produce bytes string:

    $ ./python
    Python 3.1a0 (py3k:67926M, Dec 26 2008, 23:59:07)
    >>> import zlib
    >>> zlib.compress(b'abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress('abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress('abc\xe9')
    b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93'
    >>> zlib.compress('abc\xe9'.encode('utf-8'))
    b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93'
    >>> zlib.decompress(b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93')
    b'abc\xc3\xa9'

    The most strange operation is the decompression of an unicode string:

    $ ./python
    >>> zlib.decompress('x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93')
    ...
    zlib.error: Error -3 while decompressing data: incorrect header check

    I propose to change zlib API to reject unicode string and use explicit
    conversion to/from bytes. Functions/methods:

    • compress(bytes, ...)
    • decompress(bytes, ...)
    • <compress object>.compress(bytes, ...)
    • <decompress object>.decompress(bytes, ...)
    • crc32(bytes, value=0)
    • adler(bytes, value=1)

    Note: binascii.crc32() already rejects unicode string.

    The behaviour may kept in Python 3.0.x and only changed in Python 3.1.

    @vstinner vstinner added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Dec 27, 2008
    @vstinner
    Copy link
    Member Author

    See also issue bpo-4738 (better threads support in zlib).

    @malemburg
    Copy link
    Member

    On 2008-12-27 13:58, STINNER Victor wrote:
    > Python 2.x allows to encode any byte string (str) and ASCII unicode 
    > string (unicode):
    > 
    > $ python
    > Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>>> import zlib
    >>>> zlib.compress('abc')
    > "x\x9cKLJ\x06\x00\x02M\x01'"
    >>>> zlib.compress(u'abc')
    > "x\x9cKLJ\x06\x00\x02M\x01'"
    >>>> zlib.compress(u'abc\xe9')
    > ...
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' ...
    > 
    > I'm not sure that this behaviour was really wanted become the 
    > decompress operation is not symetric (the result type is always byte 
    > string):
    > 
    > $ python
    > Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>>> import zlib
    >>>> zlib.decompress("x\x9cKLJ\x06\x00\x02M\x01'")
    > 'abc'
    > 

    I don't see a problem with this. The fact that Python 2.x also
    accepts Unicode ASCII strings where strings are normally expected
    is intended to help with the migration to Unicode, so the above
    is expected.

    zlib itself doesn't care about whether the data to be encoded
    is text or bytes.

    In Python 3.x, it's probably better to use bytes throughout the
    API.

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Dec 27, 2008

    I don't think Python 2.x should be changed - but 3.0 or 3.1 should be:

    • Characters don't mean a thing in zlib-land, all operations are based
      on bytes and their (implicit) default encoding. This behaviour is hidden
      and somewhat violates the rule of least surprise.
    • type(zlib.decompress(zlib.compress('abc'))) == bytes anyway
    • Changing from s* to y* forces the programmer to use .encode() on his
      strings (e.g. zlib.compress('abc'.encode()) which very clearly shows
      what's happening. If you want to compress and decompress Python3
      strings, you *must* share the same character encoding; think of
      zlib.compress('hôńè') and str(zlib.decompress(x)) with different locales.
    • Other modules (hashlib comes to my mind...) already reject Unicode
      objects for the same argument.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 4, 2009

    The fact that Python 2.x also accepts Unicode ASCII strings
    where strings are normally expected is intended to help with
    the migration to Unicode

    I hate this behaviour. It doesn't help migration, it's the opposite! Sometimes
    it works (ASCII), and somtimes it fails (just one non-ASCII character). And
    then we will read "Unicode sucks!" because people doesn't understand the
    error.

    In Python 3.x, it's probably better to use bytes throughout the
    API.

    I propose to reject unicode in Python 3.x and display a warning for Python
    2.x. A warning to prepare the migration... not to Unicode, but to Python3 ;-)

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jan 5, 2009

    The current behaviour may help the majority by ignorance and cause weird
    errors for others. We tell people that Python distincts between Text and
    Data but actually treat it all the same by implicit encoding.

    Modules that only operate on Bytes should reject Unicode-objects in
    Python3; it's a matter of 3 lines to display a warning in Python 2.
    Those modules that usually operate on Text but have single functions
    that operate on Bytes should display a warning but not enforce explicit
    encoding.

    Also see bpo-4821 and bpo-4818 where unicode already got rejected by the
    openssl-driven classes but silently accepted by the build-in ones.

    @malemburg
    Copy link
    Member

    On 2009-01-04 23:51, STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > The fact that Python 2.x also accepts Unicode ASCII strings
    > where strings are normally expected is intended to help with
    > the migration to Unicode

    I hate this behaviour. It doesn't help migration, it's the opposite! Sometimes
    it works (ASCII), and somtimes it fails (just one non-ASCII character). And
    then we will read "Unicode sucks!" because people doesn't understand the
    error.

    Well, that's your opinion.

    The feature was added to get people
    work with Unicode at all, since otherwise we would have had to do
    all the Unicode porting we're doing now for Python 3 at the time
    Unicode was introduced - which was in version Python 1.6, eight years
    ago.

    At the time the Python community was a lot smaller and there wasn't
    all that much interest in Unicode anyway - the Unicode support I wrote
    for Python 1.6 was partially financed by HP which needed it for an
    application they had written in Python.

    See the introduction in PEP-100 for the motivation behind the design
    decisions:

    http://www.python.org/dev/peps/pep-0100/

    > In Python 3.x, it's probably better to use bytes throughout the
    > API.

    I propose to reject unicode in Python 3.x and display a warning for Python
    2.x. A warning to prepare the migration... not to Unicode, but to Python3 ;-)

    Fair enough.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 6, 2009

    > I propose to reject unicode in Python 3.x and display a warning for
    > Python 2.x. A warning to prepare the migration... not to Unicode, but to
    > Python3 ;-)

    Fair enough.

    The patch for Python 3.x is already attached to this issue. We might only
    apply this one and leave Python 2.x unchanged. Can someone review the patch?

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Dec 14, 2009

    Definitely, zlib.compress should raise a TypeError (like bz2 does).

    >>> import bz2, zlib
    >>> bz2.compress('abc')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: argument 1 must be bytes or buffer, not str
    >>> zlib.compress('abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"

    Someone can review the patch and merge it?

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    The patch lacks a test that TypeError is raised on unicode input,
    otherwise it's fine.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Dec 14, 2009

    Patch from haypo updated for r76830 .

    Additional tests for TypeError, and to check that bytearray objects are
    accepted.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    The patch produces a number of errors in test_tarfile, test_distutils,
    test_gzip and test_xmlrpc.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Dec 14, 2009

    Fixed.

    And some "bytearray" tests improved in test_zlib.

    @pitrou pitrou self-assigned this Dec 14, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    The patch was committed to py3k and 3.1. Thank you!

    @pitrou pitrou closed this as completed Dec 14, 2009
    @vstinner
    Copy link
    Member Author

    The patch was committed to py3k and 3.1. Thank you!

    r76836 and r76838

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants