Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte/unicode pickle incompatibilities between python2 and python3 #51033

Closed
RonnyPfannschmidt mannequin opened this issue Aug 26, 2009 · 32 comments
Closed

byte/unicode pickle incompatibilities between python2 and python3 #51033

RonnyPfannschmidt mannequin opened this issue Aug 26, 2009 · 32 comments
Assignees
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@RonnyPfannschmidt
Copy link
Mannequin

RonnyPfannschmidt mannequin commented Aug 26, 2009

BPO 6784
Nosy @gvanrossum, @loewis, @birkenfeld, @jcea, @pitrou, @avassalotti, @florentx, @RonnyPfannschmidt, @serhiy-storchaka
Files
  • bytestrpickle.diff
  • bytestrpickle.diff
  • pickle_python2_str_as_bytes.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/avassalotti'
    closed_at = <Date 2013-12-07.09:12:55.777>
    created_at = <Date 2009-08-26.11:56:12.063>
    labels = ['type-bug', 'library', 'docs']
    title = 'byte/unicode pickle incompatibilities between python2 and python3'
    updated_at = <Date 2013-12-07.09:12:55.775>
    user = 'https://bugs.python.org/RonnyPfannschmidt'

    bugs.python.org fields:

    activity = <Date 2013-12-07.09:12:55.775>
    actor = 'alexandre.vassalotti'
    assignee = 'alexandre.vassalotti'
    closed = True
    closed_date = <Date 2013-12-07.09:12:55.777>
    closer = 'alexandre.vassalotti'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2009-08-26.11:56:12.063>
    creator = 'RonnyPfannschmidt'
    dependencies = []
    files = ['33011', '33016', '33019']
    hgrepos = []
    issue_num = 6784
    keywords = ['patch']
    message_count = 32.0
    messages = ['91966', '91967', '91970', '91978', '91980', '91998', '92002', '92003', '92012', '92014', '92072', '92592', '153659', '153686', '153705', '153707', '153718', '153719', '154282', '154662', '154795', '154832', '156166', '156167', '205347', '205401', '205412', '205435', '205436', '205440', '205443', '205444']
    nosy_count = 15.0
    nosy_names = ['gvanrossum', 'loewis', 'georg.brandl', 'jcea', 'ggenellina', 'pitrou', 'alexandre.vassalotti', 'RonnyPfannschmidt', 'flox', 'valhallasw', 'jdharper', 'python-dev', 'Ronny.Pfannschmidt', 'serhiy.storchaka', 'kmike']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6784'
    versions = ['Python 3.4']

    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 26, 2009

    i just noticed that there are some slight differences of the
    bytestring/unicodestring pickles between python2/3 using the protocols
    0, 1 and 2

    the first things i noticed are:

    a str from python2 is unpickled as unicode in python3
    (fails for byte strings that don't fit whats expected for unicode)

    a bytes instance from python3 is pickled as custom class in protocols <3

    i'll write a script to try all combinations of protocols and string
    variations and transfer directions

    @RonnyPfannschmidt RonnyPfannschmidt mannequin added the type-bug An unexpected behavior, bug, or error label Aug 26, 2009
    @RonnyPfannschmidt RonnyPfannschmidt mannequin changed the title bytw/unicode string incompatibilities between python2 and and python3 byte/unicode pickle incompatibilities between python2 and and python3 Aug 26, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 26, 2009

    Why are you reporting this here? If you think there is a bug, can you
    propose an alternative behavior that you would consider correct?

    The changes you mentioned are all deliberate.

    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 26, 2009

    the basic behavior i want to see for all protocols <= 2

    1. python 2 string maps to python3 byte-string
    2. python 2 unicode maps to python3 string
    3. python 3 string map to python 2 unicode
    4. python 3 bytestring maps to python 2 string

    anything else is is confusing and may break
    for example one can't unpickle '\xFF' in python3 if it was pickled in
    python2

    note that these changes seem irrelevant for protocol 3 as python2.x
    doesn't support it

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 26, 2009

    the basic behavior i want to see for all protocols <= 2

    1. python 2 string maps to python3 byte-string

    That would not be good. Many people create pickles in 2.x where the
    string type really represents characters, more often so than they want
    it to represent bytes. Giving them bytes on unpickling will likely
    cause more problems than the current approach.

    1. python 2 unicode maps to python3 string

    That's the case, right?

    1. python 3 string map to python 2 unicode

    That's also the case, AFAICT.

    1. python 3 bytestring maps to python 2 string

    Hmm. This may be indeed a mistake. Until r61467, bytes were saved
    with the (BIN)STRING code; not sure why this was changed.

    @loewis loewis mannequin changed the title byte/unicode pickle incompatibilities between python2 and and python3 byte/unicode pickle incompatibilities between python2 and and python3 Aug 26, 2009
    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 26, 2009

    Since it breaks for anything non-ascii, its not that helpfull after all
    and since python2 strings are encoding-unaware there is no way to fix
    it.

    It might be preferable to supply unpicklers that are cappable of
    coercing if the user really wants wants coercing.

    yup

    > 3. python 3 string map to python 2 unicode

    That's also the case, AFAICT.
    yup

    > 4. python 3 bytestring maps to python 2 string

    Hmm. This may be indeed a mistake. Until r61467, bytes were saved
    with the (BIN)STRING code; not sure why this was changed.
    Python 3 is indeed evil there.

    b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'

    I'm convinced that a 1:1 mapping of python2 string from/to python3
    bytestrings is the least surprising behaviour and will keep surprising
    errors away when needing to communicate between different python
    versions.

    It just has bitten me, and i suspect will will get others, too.
    Unpickle that completely fails in the face of encodings is not desirable
    at all.

    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 27, 2009

    its even worse

    python3:
    >>> import pickle
    >>> pickle.dumps(b'', protocol=2)
    b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'
    
    python2.6:
    >>> import pickle
    >>> pickle.loads('\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.')
    '[]'

    @pitrou
    Copy link
    Member

    pitrou commented Aug 27, 2009

    The problem with trying to solve the following issue:
    "a bytes instance from python3 is pickled as custom class in
    protocols <3"
    is that if we pickle bytes from Python 3 as a 2.x str in protocol <= 2,
    unpickling it using Python 3 will yield a str (unicode), not a bytes
    object. Therefore the whole chain (pickling then unpickling) will not be
    idempotent.

    @pitrou pitrou added the stdlib Python modules in the Lib dir label Aug 27, 2009
    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 27, 2009

    unpickle of any non-ascii string from python2 will break
    the only way out would be to ensure text strings and a single defined
    encoding (at that point storing unicode strings in any case seems more
    practical)

    also byte-strings stored as python2 str would break

    and since i pass around binary strings as parts of objects, its just
    completely broken for me

    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Aug 27, 2009

    in case the actual behavior is not supposed to change

    how about a way to declare one wants exact 1:1 mapping between py2<>py3,
    so str<>bytes and unicode<>str will work for sure

    something like load/dump(..., encoding=bytes) just crossed my mind

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 27, 2009

    how about a way to declare one wants exact 1:1 mapping between py2<>py3,
    so str<>bytes and unicode<>str will work for sure

    In a sense, that's already possible. Inherit from _Pickler/_Unpickler,
    and replace the dispatch dict with a different mapping.

    I wouldn't object to supporting this with an option, though, assuming it
    was properly documented and implemented for both pickle and _pickle
    (probably along with pickletools).

    @RonnyPfannschmidt RonnyPfannschmidt mannequin changed the title byte/unicode pickle incompatibilities between python2 and and python3 byte/unicode pickle incompatibilities between python2 and python3 Aug 28, 2009
    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Aug 29, 2009

    Note that this is also a documentation issue: "The pickle
    serialization format is guaranteed to be backwards compatible across
    Python releases."

    @ggenellina ggenellina mannequin added the docs Documentation in the Doc dir label Aug 29, 2009
    @ggenellina ggenellina mannequin assigned birkenfeld Aug 29, 2009
    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Sep 14, 2009

    i'll try to add some tests now

    hopefully i can get rid of the implicit badness like trying to coerce
    bytes to unicode in unpickle and storing bytes as list in pickle for
    protocol < 3

    @admin admin mannequin assigned docspython and unassigned birkenfeld Oct 29, 2010
    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 18, 2012

    Any news on this?

    Just as a note, pickletools.py also does not reflect the current behaviour; pickle types STRING, BINSTRING and SHORT_BINSTRING are all defined with stack_after=[pystring]:

    [1, line 992]
    I(name='STRING',
    code='S',
    arg=stringnl,
    stack_before=[],
    stack_after=[pystring],
    proto=0,
    doc=(...)
    )

    although the doc=... does describe it will be decoded, the object type of pystring is still defined as bytes:

    [1, line 747]
    pystring = StackObject(
    name='string',
    obtype=bytes,
    doc="A Python (8-bit) string object.")

    [1] http://hg.python.org/cpython/file/98df29d51e12/Lib/pickletools.py

    @RonnyPfannschmidt
    Copy link
    Mannequin Author

    RonnyPfannschmidt mannequin commented Feb 19, 2012

    im unlikely to find the time to try and fix pickle/cpickle myself in the next few months

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 19, 2012

    Last night, I hacked together a wrapper to do what loewis suggested [1]. It pickles bytes to str (for protocol <= 2), and unpickles str to bytes.

    If I (ever) get the build system and tests of python itself to work, I'll try and see if I can implement a nicer solution - at least for pickle.py.

    [1] https://github.com/valhallasw/py2/blob/master/bytestrpickle.py

    @pitrou
    Copy link
    Member

    pitrou commented Feb 19, 2012

    If I (ever) get the build system and tests of python itself to work,

    If you have any problems with that, don't hesitate to ask on python-dev
    (or see http://mail.python.org/mailman/listinfo/core-mentorship )

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 19, 2012

    OK, this is the pickle.py patch. A new parameter 'bytestr' has been added to both _Pickler and _Unpickler to toggle the pickle.string<=>bytes behaviour:

    _Pickler:
    IF protocol <= 2 AND bytestr=True
    THEN bytes are stored as STRING/SHORT_BINSTRING/BINSTRING
    ELSE (the old behaviour; obj for protocol <=2, else BINARY)

    _Unpickler:
    IF bytestr=True
    THEN STRING/SHORT_BINSTRING/BINSTRING are read as bytes
    ELSE they are read as str (old behaviour)

    I also extracted the decoding stuff from the three string reading functions to a single one.

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 19, 2012

    P.S. (sorry for forgetting this in the original post ;-))

    Both
    ./python -m test -G -v test_pickle
    and
    ./python test_bytestrpickle.py
    pass, but I have not run the entire test suite, as that takes ~90 minutes on my laptop....

    The test script should of course be merged with test_pickle.py at some time....

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 25, 2012

    Ok, this is my first attempt at the Pickler part of the C implementation. I'll have to adapt the python implementation to match this one.

    All BytestrPicklerTests in test_bytestrpickle.py pass, and ./python -m test -G -v test_pickle passes.

    Comments on style etc. are very welcome.

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Feb 29, 2012

    Added tests in Lib/test format.

    After applying pickle.py.patch and BytestrPickler_c.diff,
    ./python -m test -v -m PyPicklerBytestrTests test_pickle
    returns 12 tests, no errors, while
    ./python -m test -v -m CPicklerBytestrTests test_pickle
    only passes
    test_dump_bytes_protocol_0 (test.test_pickle.CPicklerBytestrTests) ... ok
    test_dump_bytes_protocol_1 (test.test_pickle.CPicklerBytestrTests) ... ok
    test_dump_bytes_protocol_2 (test.test_pickle.CPicklerBytestrTests) ... ok
    test_dump_bytes_protocol_3 (test.test_pickle.CPicklerBytestrTests) ... ok

    and has 8 errors (as expected).

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Mar 2, 2012

    And a complete patch that implements the tests, the python implementation and the C implementation. I'm not completely happy with the code duplication in read_string/read_binstring/read_short_binstring C implementation, so that might be an improvement (however, there is already a lot of code duplication there at the moment).

    Again: comments would be very welcome...

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Mar 3, 2012

    OK, and now a version that's not broken... I forgot to initialize self->bytestr for PicklerObject/UnpicklerObject. *puts on the you-broke-the-build-hat*

    Except for test_packaging.test_caches, this version passes all tests -- test_packaging.test_caches, which seems to fail because I make install'd python and installed {distribute,pip,setuptools,virtualenv}.

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Mar 17, 2012

    Based on the discussion on python-dev [1], this is an updated implementation that uses encoding='bytes' to signal str->bytes behaviour.

    http://mail.python.org/pipermail/python-dev/2012-March/117536.html

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Mar 17, 2012

    ...and the tests to go with that.

    @avassalotti
    Copy link
    Member

    Could you provide a single patch with the implementation and the tests together? I will try to find some time this week to review this.

    @avassalotti avassalotti assigned avassalotti and unassigned docspython Dec 6, 2013
    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Dec 6, 2013

    Hi Alexandre,

    Attached is a diff based on r87793:0c508d87f80b.

    Merlijn

    @valhallasw
    Copy link
    Mannequin

    valhallasw mannequin commented Dec 6, 2013

    I have fixed most of the nits in this patch, except for:

    1. the intermediate bytes object being created; inlining is an option, as storchaka suggested, but I'd rather have you decide what it should become before implementing it;

    2. make clinic gives me

    ./python -E ./Tools/clinic/clinic.py --make
    Error in file "./Modules/_pickle.c" on line 6611:
    Checksum mismatch!
    Expected: bed0d8bbe1c647960ccc6f997b33bf33935fa56f
    Computed: 58dcccb705487695fec30980f566027bc68d9c69
    make: *** [clinic] Error 255

    and I have no clue how to fix that -- the clinic docs are sparse, to say the least;

    1. The tests are still in their own test case; please decide between the two of you what is the best solution;

    2. I have grouped the test cases: test_load_python2_str_as_bytes (which checks protocols 0, 1, and 2), test_load_python2_unicode_as_str and test_load_long_python2_str_as_bytes;

    3. I have moved the commands to create the shown pickled versions from docstrings to comments. If you think they are not useful, I'll remove them, but I found them pretty useful while shortening the strings.

    @avassalotti
    Copy link
    Member

    I cleaned up the patch. I will submit it tonight if there is no major objections.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 7, 2013

    How about updating the documentation as well?

    @serhiy-storchaka
    Copy link
    Member

    And what about an issue mentioned in msg153659?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 7, 2013

    New changeset bd71352e950f by Alexandre Vassalotti in branch 'default':
    Issue bpo-6784: Strings from Python 2 can now be unpickled as bytes objects.
    http://hg.python.org/cpython/rev/bd71352e950f

    @avassalotti
    Copy link
    Member

    I fixed up the last few review comments and submitted the patch. Thank you for the help!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants