Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling issues between C and Python implementations. #19

Closed
jamadden opened this issue Jan 11, 2016 · 19 comments
Closed

Pickling issues between C and Python implementations. #19

jamadden opened this issue Jan 11, 2016 · 19 comments

Comments

@jamadden
Copy link
Member

This grows out of the discussion in zopefoundation/persistent#32. The user had been using the C implementation of BTrees and had many saved pickles. Upon updating dependencies, the user discovered he was using the Python implementation. This was slow, so the user switched back to the C implementation.

This leads to two related issues. The first, most serious, issue is that a BTree pickled by the Python implementation will only ever unpickle as the Python implementation, though a BTree pickled by the C implementation will unpickle as whichever implementation is available (but see below). Because of the tree-like structure of BTree objects, this can lead to AttributeErrors when the parent object is (re)pickled as the Python implementation and still has children that are pickled as the C implementation.

The second issue I discovered when testing the first issue. An empty C BTree unpickled with the Python implementation isn't initialized correctly, leading to its own AttributeErrors.

In environments like my own organizations, that are slowly rolling out PyPy, it will be common to have developers and production/test environments sharing pickles where everyone is on some combination of PyPy and CPython. Maybe even different production (micro)services will be on PyPy and CPython at the same time. So the first AttributeError could be a real problem.

It seems to me like the two implementations should produce pickles that always result in loading the "best" available implementation, i.e., the Python implementation should not specify the Py suffix in pickle names. I think that's the only way to avoid the first issue. Though, there is an argument to be made that someone might explicitly want to pickle the Python implementation, but I'm not sure why. I'm not sure how to fix this without changing the pickling format in a backwards-incompatible way, but I'm not a pickle expert. I can try to look into that a bit.

I suspect the second issue is easier to fix.

If there's consensus that something needs to be done on this I can do the work to submit a PR.

@mgedmin
Copy link
Member

mgedmin commented Jan 11, 2016

This sounds like a serious bug that ought to be fixed.

@jamadden
Copy link
Member Author

It may be as simple as reversing the names of the Python implementations. Instead of:

class OOBTreePy(BTree): 
   ...

try:
    from ._OOBTree import OOBucket
except ImportError as e: #pragma NO COVER w/ C extensions
    OOBTree = OOBTreePy
    ...
else: #pragma NO COVER w/o C extensions
    from ._OOBTree import OOBTree
    ...

Do this:

class OOBTree(BTree): 
   ...

OOBTreePy = OOBTree # NEW; expose the Python implementation always
try:
    from ._OOBTree import OOBucket
except ImportError as e: #pragma NO COVER w/ C extensions
    pass # Nothing to do here anymore
else: #pragma NO COVER w/o C extensions
    from ._OOBTree import OOBTree
    ...

I haven't really thought this through...

@jamadden
Copy link
Member Author

That does appear to work.

The second issue (pickling empty BTrees) appears in the plain pure-python implementation, without any cross-implementation factors in play.

I'm working on a branch to fix these.

@jamadden
Copy link
Member Author

Argh. No, no that doesn't work. I get test failures as soon as ZODB is available with the C implementation.

@ml31415
Copy link

ml31415 commented Jan 11, 2016

Maybe rather something like this from the persistent module?

https://github.com/zopefoundation/persistent/blob/master/persistent/__init__.py

@jamadden
Copy link
Member Author

@ml31415 I'm not sure what you're suggesting. Persistent doesn't need to do anything special to make sure that both the C and Python implementations pickle and unpickle the same way (cross implementation) because people usually only pickle subclasses of Persistent. So the module and/or class name differences don't matter much.

@ml31415
Copy link

ml31415 commented Jan 11, 2016

Well, but it applies to PersistentList and PersistentMapping. If I remember correctly, they both got the same class location, independent of implementation.

@jamadden
Copy link
Member Author

PersistentList and PersistentMapping only have one implementation, in Python.

@ml31415
Copy link

ml31415 commented Jan 11, 2016

doh. Ok, I'm out and leave this to the experts :)

@jamadden
Copy link
Member Author

By adding a __reduce__ method to the Python base class BTrees._base.Base and setting a few attributes, it's possible to make the Python versions pickle with the same class names as the C implementation: because __reduce__ returns the class to pickle, the Python implementation can return the C class. At unpickling time, the best available implementation is used. All pickles are otherwise unchanged and backward and forward compatibility is assured.

This has the minor problem that any "legacy" pickles that still have the Python class name in them will unpickle as the Python class (so Michael's DB would still be corrupted). I don't see any way to get around that while maintaining pickle compatibility, short of doing some nasty things with stack introspection (which is slow on PyPy). But that seems livable, especially because it could be worked around during unpickling by setting find_global.

However, there's a substantial problem that turns out to be a deal breaker: the approach of returning a different class from __reduce__ is simply not allowed under pickle protocols >= 2 if the usual copy_reg.__newobj__ is used; pickling raises an exception in that case (those protocols have a special opcode to directly call cls.__new__ and they make sure that __reduce__ returns the same class as the object being pickled. For some reason.). Those protocols are the default on Python 3. This can be handled by using a custom constructor function instead of copy_reg.__newobj__.

So far, I'm at a loss for any way to make the C and Python pickles always result in the C version if it's available (and preferably being equal) that doesn't change the pickle format, meaning older versions couldn't unpickle new pickles.

Thoughts?

@jamadden
Copy link
Member Author

FWIW, the changes so far are at master...NextThought:issue_2_pickle

@ml31415
Copy link

ml31415 commented Jan 11, 2016

Fixed my db by doing lots of undos this afternoon, so as long as noone else complains ... I wouldn't bother about me or reverting any messed up objects. Besides that, would it be so bad to use a custom constructor?

@jamadden
Copy link
Member Author

A custom constructor means at least these things:

  1. The new pickles can't be read by older releases. This can be annoying for moving databases between environments. It may or may not be a show stopper, I don't know.
  2. The new constructor function effectively becomes part of the API of the package and can't be removed without breaking pickles.
  3. The higher pickle protocol optimizations don't kick in, so the pickles are slightly larger/slower than they would be using those protocols.
  4. The pickle formats produced by C and Python implementations would diverge unless someone wants to go in and mess with the C code (I don't). They're different now, so that's not a huge deal, but it would be nice if they were the same (I was this close to that, until I tested on Python 3).

@jamadden
Copy link
Member Author

Here's a subversive thought: When performing the class check for the protocol >= 2, all implementations of pickle (pickle.py, cPickle, zodbpickle) do the equivalent of obj.__class__; that is, they don't use type(obj) or PyObject*->tp_type (or whatever the C equivalent is).

This means we can lie about the __class__ with an appropriate @property definition such that we return the C implementation. In turn, this means that the Python __reduce__ can return the C implementation for all protocols: no custom constructor, and the pickles for C and Python are identical. Yay!

Downsides: calling the property getter function may add overhead to accessing __class__, which is done a lot, apparently; may need a metaclass to make issubclass work as expected (not sure).

@mgedmin
Copy link
Member

mgedmin commented Jan 12, 2016

such that we return the C implementation

What if there's no C implementation? E.g. the user didn't have the right libraries/headers when they installed BTrees?

@jamadden
Copy link
Member Author

What if there's no C implementation?

My proof-of-concept handles this by making sure that the Python implementation has the same __name__ as the C implementation would and returning that. e.g., https://github.com/NextThought/BTrees/blob/issue_2_pickle/BTrees/_base.py#L1538

@jamadden
Copy link
Member Author

With the __class__ approach, all tox environments for the branch I referenced before are green and pickles are identical.

There's no need for a metaclass to make isinstance work out. For reasons having to do with the way isinstance is implemented, it's possible to be an instance of two unrelated classes without any subclass customization:

>>> class X(object):
...   pass
...
>>> class Y(object):
...   @property
...   def __class__(self):
...     return X
...
>>> y = Y()
>>> isinstance(y, type(y)) # Y
True
>>> isinstance(y, y.__class__) # X
True

In isinstance(inst, Type), if the C object's ob_type is identical to or a subclass of Type, checking is short-circuited---that's the first case. Otherwise, inst.__class__ is checked to see if it's a subclass of Type---that's the second case.

@jamadden
Copy link
Member Author

I've opened PR #20 for the __class__ based approach.

@ml31415
Copy link

ml31415 commented Jan 13, 2016

Looks good, simpler than expected!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants