Pickles

leorochael edited this page Apr 16, 2013 · 12 revisions

Pickle interoperability between Python 2 and Python 3

It's useful to be able to support accessing databases from both Python 2 and Python 3 because:

  • You may have multiple applications accessing a database, or multiple installations of the same applications that are moved to Python 3 at different times. Supporting both Python versions will make transition much easier.
  • Eventually, ZODB may support other languages, especially Javascript. It would be a shame if we could support Javascript but not Python 2.
  • Some ZODB users have massive databases which cannot be easily (or even realistically) go through a migration process before moving to Python 3

Issues

  1. Python 3 uses different pickling codes than Python 2. In particular, the Python 2 bytes (STRING, BINSTRING, and SHORTBINSTRING) are DWIMilly interpreted as text in some encoding. Python 3 saves bytes with a Python 3-specific bytecode (BYTES and BINBYTES). "unicode" data in Python 2 and "str" data in Python 3 is saved as UNICODE and BINUNICODE.

    The BIN... prefix in these pickling codes refers to the pickle format of the data and not to the actual data they represent, and can be ignored for the purposes of this discussion.

  2. Names (attribute, and global) in Python 3 are unicode in Python 3 but bytes in Python 2.

Proposals

Python2 pickle with name conversion

Read and store byte data using Python 2 byte codes using a forked version of pickle, zodbpickle. Fix up names when necessary in Python 3.

  • When finding globals or setting instance state, convert byte names to unicode using an ascii encoding.

    We can only fix up attribute names when no custom set state is used. So this is only a partial solution. Applications with custom __setstate__ methods may not be interoperable accross Python versions or may need to be modified.

  • Note that Python 2 attributes can be stored as unicode. (They can only be accessed with attribute notation if they're ASCII.)

Issues:

  • Cookie.Morsel is a dict subclass that has unicode keys on Python 3, but byte keys in Python 2. Reading a Python 2 Morsel pickle in Python 3 requires the byte->unicode DWIM.

    This is a case we can't handle.

  • People often use "names" (aka "native strings") for dictionary keys. If we want interoperability between Python 2 and Python 3, then it would be bad if in in Python 2, a user did:

    >>> foo['bar'] = 1
    

    and then in Python 3:

    >>> foo['bar']
    

    raised a KeyError. After discussing this a bit, we think this may be fatal. Specially in presence of code like:

    >>> foo = dict(bar=1)

    or even:

    >>> foo = dict(**kw)

    which cannot be easily fixed in python 2 by explicitly using b'bar' notation.

Explicit binary for Python 2

In this option, we create an explicit binary type for Python 2, probably as a subclass of str. We'll define the type in Python 3 as an aliase for bytes.

Application authors will need to analyze their applications and replace true binary strings with instances of this new type. (This includes object ids and tids.)

In Python 2, we fork pickle and cPickle and add support for protocol 3 such that the new pickle byte codes for bytes are used for the new Python 2 binary type.

In Python 3, we'll just use bytes.

Pros:

  • We don't need for fork the Python 3 pickle code. We'll still need a noload() implementation, but not for strings issues. We can probably do this via subclassing, which should be much less of a maintenance burden.
  • We don't need to worry about trying to spot names used for attribute names and dictionary keys.
  • We'll be explicit about what's binary and what's unicode.

Cons:

  • We fork Python 2's pickle and cPickle, but these are unlikely to change, so there's far less risk than forking the Python 3 versions.
  • Application developers need to use the new binary type. Some developer action will be required no matter what we do.

Implemented on a branch: see https://github.com/zopefoundation/zodbpickle/tree/py2_explicit_bytes

Explicit, but transitional, Native string for Python 3

The purpose of this option is to allow full round-trip for data between Python 2 and 3 without loss of information, but preserving full functionality, allowing applications in large clusters to run with a mix of both versions, while also permitting data migration without system downtime.

This option must be combined with the option "Explicit binary for Python 2" above, but not with the option "Python2 pickle with name conversion".

In this option, STRING data is unpickled in Python 3 not as "str" (unicode), but as a str subclass called Native, after being decoded from bytes through the "latin-1" encoding.

The purpose of using "latin-1" is preserving all the byte values in the decoded unicode, so that application code can then choose to either re-decode it to pure unicode in the correct encoding, or encode it to pure bytes.

This Native subclass behaves exactly like "str" (unicode) in Python 3, except that it pickles as STRING (by re-encoding it through "latin-1"), thereby preserving the same information when it is unpickled under Python 2.

This Native class would also serve as a "marker" on the respective data that a string needs to be converted to either Unicode or Bytes, while still behaving like str (unicode) for all Python 3 purposes, including acting as keys in the __dict__ state of objects, permitting normal attribute access.

It doesn't need to contain actually any methods, though it could get methods that help in conversion, like .__bytes__() (which would encode through 'latin-1', recovering the original "bytes" value) or .recode(encoding) (or perhaps .decode(encoding)), which would be used when a string in Python 2 was actually meant as text in an encoding different than 'latin-1'.

Under Python 2, STRING pickles would be unpickled also as a Native class, which extends as "str" (i.e. what Python 3 calls bytes). They would behave like normal (byte) strings under Python 2, including behaving correctly as attribute names under the __dict__ of objects, but would function as markers that allow users to detect, under Python 2, that a Native string needs to be migrated to either bytes or unicode. Like their Python 3 counterparts, the Python 2 Native type could also get convenience functions .__bytes__() (returning the explicit Python 2 "bytes" type from the previous proposal) and .recode(encoding) (returning an explicit "unicode" type).

A package implementing this option would not be necessary for applications that start on Python 3, and it could be removed from applications that started in Python 2 but successfully migrated all their data and are now running under Python 3.

Removing the package (or configuration) implementing this option from an application would simply revert it to the current pickle behavior.

Pros:

  • We don't need to worry about trying to spot names used for attribute names and dictionary keys.
  • We'll be explicit about what's binary and what's unicode.
  • We don't do any DWIM guessing on the type of STRING pickles and yet obtain a fully functional "str" (unicode) subclass that is equivalent to "str" for all intents and purposes within Python 3.
  • Full round-trip between Python 2 and 3 without information loss, allowing version straddling on a live application, and allowing data migration on a live (production) system.

Cons:

  • We may have to maintain a fork of the Py3k pickle module (in order to force using the ols STRING/BINSTRING/SHORT_BINSTRING opcodes).
  • Many values unpickled from Python 2 str which would be created to "normal" Unicode text cleanly (i.e., they were either ASCII or else latin-1 already) will get marked as "needing fixing".