Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array pickling exposes internal memory representation of elements #46642

Closed
hniksic mannequin opened this issue Mar 18, 2008 · 21 comments
Closed

Array pickling exposes internal memory representation of elements #46642

hniksic mannequin opened this issue Mar 18, 2008 · 21 comments
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@hniksic
Copy link
Mannequin

hniksic mannequin commented Mar 18, 2008

BPO 2389
Nosy @gvanrossum, @loewis, @rhettinger, @jcea, @hniksic, @avassalotti, @benjaminp
Files
  • portable_array_pickling.diff
  • portable_array_pickling-2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-07-15.18:22:30.628>
    created_at = <Date 2008-03-18.14:38:08.061>
    labels = ['extension-modules', 'type-bug']
    title = 'Array pickling exposes internal memory representation of elements'
    updated_at = <Date 2009-07-15.18:22:30.626>
    user = 'https://github.com/hniksic'

    bugs.python.org fields:

    activity = <Date 2009-07-15.18:22:30.626>
    actor = 'alexandre.vassalotti'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-07-15.18:22:30.628>
    closer = 'alexandre.vassalotti'
    components = ['Extension Modules']
    creation = <Date 2008-03-18.14:38:08.061>
    creator = 'hniksic'
    dependencies = []
    files = ['14369', '14461']
    hgrepos = []
    issue_num = 2389
    keywords = ['patch']
    message_count = 21.0
    messages = ['63915', '64236', '65469', '70473', '70491', '70743', '70774', '71000', '71037', '71044', '71048', '71050', '71051', '71067', '71109', '71110', '85298', '89751', '90162', '90197', '90541']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'loewis', 'collinwinter', 'rhettinger', 'jcea', 'hniksic', 'alexandre.vassalotti', 'benjamin.peterson']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2389'
    versions = ['Python 2.7', 'Python 3.2']

    @hniksic
    Copy link
    Mannequin Author

    hniksic mannequin commented Mar 18, 2008

    It would seem that pickling arrays directly exposes the underlying
    machine words, making the pickle non-portable to platforms with
    different layout of array elements. The guts of array.__reduce__ look
    like this:

    	if (array->ob_size > 0) {
    		result = Py_BuildValue("O(cs#)O", 
    			array->ob_type, 
    			array->ob_descr->typecode,
    			array->ob_item,
    			array->ob_size * array->ob_descr->itemsize,
    			dict);
    	}

    The byte string that is pickled is directly created from the array's
    contents. Unpickling calls array_new which in turn calls
    array_fromstring, which ends up memcpying the string data to the new array.

    As far as I can tell, array pickles created on one platform cannot be
    unpickled on a platform with different endianness (in case of integer
    arrays), wchar_t size (in case of unicode arrays) or floating-point
    representation (rare in practice, but possible). If pickles are
    supposed to be platform-independent, this should be fixed.

    Maybe the "typecode" field when used with the constructor could be
    augmented to include information about the elements, such as endianness
    and floating-point format. Or we should simply punt and pickle the
    array as a list of Python objects that comprise it...?

    @hniksic hniksic mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Mar 18, 2008
    @hniksic
    Copy link
    Mannequin Author

    hniksic mannequin commented Mar 21, 2008

    Here is an example that directly demonstrates the bug. Pickling on x86_64:

    Python 2.5.1 (r251:54863, Mar 21 2008, 13:06:31) 
    [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import array, cPickle as pickle
    >>> pickle.dumps(array.array('l', [1, 2, 3]))
    "carray\narray\np1\n(S'l'\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp2\n."
    
    Unpickling on ia32:
    Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32) 
    [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import cPickle as pickle
    >>>
    pickle.loads("carray\narray\np1\n(S'l'\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp2\n.")
    array('l', [1, 0, 2, 0, 3, 0])

    @gvanrossum
    Copy link
    Member

    This looks indeed wrong. Unfortunately it also looks hard to fix in a
    way that won't break unpickling arrays pickled by a previous Python
    version. We won't be able to fix this in 2.5 (it'll be a new feature)
    but we should try to fix this in 2.6 and 3.0.

    @benjaminp
    Copy link
    Contributor

    Ping.

    @rhettinger
    Copy link
    Contributor

    At this point, I think it better to wait until Py2.7/3.1. Changing it
    now would just complicate efforts to port from 2.5 to 2.6 to 3.0.

    Guido, do you agree?

    @gvanrossum
    Copy link
    Member

    Agreed, this has been broken for a long time, and few people have
    noticed or complained. Let's wait.

    @gvanrossum gvanrossum removed their assignment Aug 5, 2008
    @hniksic
    Copy link
    Mannequin Author

    hniksic mannequin commented Aug 6, 2008

    I guess it went unnoticed due to prevalence of little-endian 32-bit
    machines. With 64-bit architectures becoming more and more popular,
    this might become a bigger issue.

    Raymond, why do you think fixing this bug would complicate porting to
    2.6/3.0?

    @avassalotti
    Copy link
    Member

    I don't see why this cannot be fixed easily. All we need to do is fix
    the __reduce__ method of array objects to emit a list--i.e. with
    array.tolist()--instead of a memory string. Since the reduce protocol is
    just a fancy way to store the constructor arguments, this won't break
    unpickling of array objects pickled by previous Python versions.

    And here is a patch against the trunk.

    @gvanrossum
    Copy link
    Member

    Wouldn't that be lots and lots slower? I believe speed is one of the
    reasons why the binary representation is currently dumped.

    @avassalotti
    Copy link
    Member

    The slowdown depends of the array type. The patch makes array unpickling
    a few orders of magnitude slower (i.e. between 4 and 15 times slower
    depending of the array type). In general, pickling is about as fast as
    with the binary representation (or faster!).

    Although since most 64-bit compilers uses the LP64 model, I think we
    could make a compromise and only pickle as a list arrays of long
    integers. This would fix the problem without any visible speed penalties.

    @hniksic
    Copy link
    Mannequin Author

    hniksic mannequin commented Aug 12, 2008

    Unfortunately dumping the internal representation of non-long arrays
    won't work, for several reasons. First, it breaks when porting pickles
    between platforms of different endianness such as Intel and SPARC.
    Then, it ignores the considerable work put into correctly pickling
    floats, including the support for IEEE 754 special values. Finally, it
    will break when unpickling Unicode character arrays pickled on different
    Python versions -- wchar_t is 2 bytes wide on Windows, 4 bytes on Unix.

    I believe pickling arrays to compact strings is the right approach on
    the grounds of efficiency and I wouldn't change it. We must only be
    careful to pickle to a string with a portable representation of values.
    The straightforward way to do this is to pick a "standard" size for
    types (much like the struct module does) and endianness and use it in
    the pickled array. Ints are simple, and the code for handling floats is
    already there, for example _PyFloat_Pack8 used by cPickle.

    Pickling arrays as lists is probably a decent workaround for the pending
    release because it's backward and forward compatible (old pickles will
    work as well as before and new pickles will be correctly read by old
    Python versions), but for the next release I would prefer to handle this
    the right way. If there is agreement on this, I can start work on a
    patch in the following weeks.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 12, 2008

    I like to challenge the view what "correct" behavior is here. If I
    pickle an array of 32-bit integer values on one system, and unpickle it
    as an array of 64-bit integer values on a different system, is that
    correct, or incorrect?

    IMO, correct behavior would preserve the width as much as possible. For
    integers, this should be straight-forward, as it should be for floats
    and doubles (failing to unpickle them if the target system doesn't
    support a certain format). For Unicode, I think the array module should
    grow platform-independent width, for both 2-byte and 4-byte Unicode.

    When pickling, the pickle should always use network byte order;
    alternatively, the pickle should contain a byte order marker (presence
    of which could also be used as an indication that the new array pickle
    format is used). IOW, <i would indicate little-endian four byte
    integers, and so on.

    @hniksic
    Copy link
    Mannequin Author

    hniksic mannequin commented Aug 12, 2008

    I think preserving integer width is a good idea because it saves us from
    having to throw overflow errors when unpickling to machines with
    different width of C types. The cost is that pickling/unpickling the
    array might change the array's typecode, which can be a problem for C
    code that processes the array's buffer and expects the C type to remain
    invariant.

    Instead of sticking to network byte order, I propose to include byte
    order information in the pickle (for example as '<' or '>' like struct
    does), so that pickling/unpickling between the same-endianness
    architectures doesn't have to convert at all. Floats are always pickled
    as IEEE754, but the same optimization (not having to convert anything)
    would apply when unpickling a float array on an IEEE754 architecture.

    Preserving widths and including endianness information would allow
    pickling to be as fast as it is now (with the exception of unicode chars
    and floats on non-IEEE754 platforms). It would also allow unpickling to
    be as fast between architecture with equal endianness, and correct
    between others.

    @gvanrossum
    Copy link
    Member

    Instead of sticking to network byte order, I propose to include byte
    order information in the pickle (for example as '<' or '>' like struct
    does), so that pickling/unpickling between the same-endianness
    architectures doesn't have to convert at all. Floats are always pickled
    as IEEE754, but the same optimization (not having to convert anything)
    would apply when unpickling a float array on an IEEE754 architecture.

    Preserving widths and including endianness information would allow
    pickling to be as fast as it is now (with the exception of unicode chars
    and floats on non-IEEE754 platforms). It would also allow unpickling to
    be as fast between architecture with equal endianness, and correct
    between others.

    This sounds like the best approach yet -- it can be made backwards
    compatible (so 2.6 can read 2.5 pickles at least on the same platform)
    and can be just as fast when unpickling on the same platform, and only
    slightly slower on a different platform.

    @avassalotti
    Copy link
    Member

    I'm all in for a standardized representation of array's pickles (with
    width and endianness preserved). However to happen, we will either need
    to change array's constructor to support at least the byte-order
    specification (like struct) or add built-in support for array in the
    pickle module (which could be done without modifying the pickle protocol).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 14, 2008

    I think changing the array constructor is fairly easy: just pick a set
    of codes that are defined to be platform-neutral (i.e. for each size two
    codes, one for each endianness). For example, the control characters
    (\0..\x1F) could be used in the following way:

    char, signed-byte, unsigned byte: c, b, B
    (big/little)
    sint16: 1,2 uint16: 3,4
    sint32: 5,6 uint32: 7,8
    sint64: 9,10 uint64: 11,12
    float: 13,14 double: 15,16
    UCS-2: 17,18 UCS-4: 19,20

    In above scheme, even codes are little-endian, odd codes are big endian.
    Converting the codes to "native" codes could be table-driven.

    @avassalotti
    Copy link
    Member

    Ah, I just remembered the smart way I had devised some time ago to
    handle this issue without changing the constructor of array.array. The
    trick would be to add a __reduce__ method to array.array. This method
    would return a special constructor function, the binary data of the
    array and a string representing the format of the array. Upon
    unpickling, the special constructor function would be called with the
    binary data and its format and then it would recreate the array.

    Now, the only thing I am not sure about is whether this would work well
    with subclasses of array.array. I guess we make __reduce__ also return
    the instance's type which could be used by special constructor to
    recreate the instance from the proper subclass.

    @avassalotti
    Copy link
    Member

    Here's a patch that implements the solution I described in msg85298.

    Please give it a good review:
    http://codereview.appspot.com/87072

    @avassalotti
    Copy link
    Member

    I would like to commit my patch later this week. So if you see any issue
    with the patch, please speak up.

    @avassalotti
    Copy link
    Member

    I know believe that arrays should be pickled as a list of values on
    Python 2.x. Doing otherwise makes it impossible to unpickle arrays
    coming from Python 2.x using Python 3.x, since pickle on Python 3
    decodes all the strings from 2.x to Unicode.

    However, we still can use the compact memory representation on Python 3.x.

    So, I propose that we change the array module on Python 2.x to emit a
    list instead of memory string and implement the portable array pickling
    mechanism only on Python 3.x.

    @avassalotti
    Copy link
    Member

    Committed fix for 3.x in r74013 and for 2.x in r74014.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants