Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a higher pickle protocol for serializing objects on Python 2 #179

Merged
merged 4 commits into from Aug 29, 2017

Conversation

jamadden
Copy link
Member

Previously protocol 1 was used. The higher protocol is more efficient for new-style classes (all persistent objects are new-style), according to the docs, at the cost of being very slightly less space efficient for old-style classes.

In tests of a persistent object with two trivial numeric attributes, the higher protocol was 12 bytes smaller, and serialized and deserialized 1us faster. Introducing a reference to another new-style
class (with a small dict and list of strings for attributes) for a more realistic test made the higher protocol twice as fast to serialize (20.5 vs 10.3us), almost half the size (215 vs 142 bytes), and it deserialized 30% faster (6.5 vs 4.6us).

On Python 2, this will now allow open file objects to be pickled (loading the object will result in a closed file); previously this would result in a TypeError (as does under Python 3). We had tests that you couldn't do that with a BlobFile so I had to update it to still make that true.

I wouldn't recommend serializing arbitrary open files under Python 2 (for one thing, they can't trivially be deserialized in Python 3), but I didn't take any steps to prevent it either. Since this hasn't
been possible, there shouldn't be code in the wild that is trying to do it---and it wouldn't be forward compatible with Python 3 either.

Previously protocol 1 was used. This is more efficient for new-style
classes (all persistent objects are new-style), according to the docs,
at the cost of being very slightly less space efficient for old-style
classes.

In tests of a persistent object with two trivial numeric attributes,
the higher protocol was 12 bytes smaller, and serialized and
deserialized 1us faster. Introducing a reference to another new-style
class for a more realistic test made the higher protocol twice as fast
to serialize (20.5 vs 10.3us), almost half the size (215 vs 142
bytes), and it deserialized 30% faster (6.5 vs 4.6us).

On Python 2, this will now allow open ``file`` objects to be
pickled (loading the object will result in a closed file); previously
this would result in a ``TypeError`` (as does under Python 3). We had
tests that you couldn't do that with a BlobFile so I had to update it
to still make that true.

I wouldn't recommend serializing arbitrary open files under Python
2 (for one thing, they can't trivially be deserialized in Python 3),
but I didn't take any steps to prevent it either. Since this hasn't
been possible, there shouldn't be code in the wild that is trying to
do it---and it wouldn't be forward compatible with Python 3 either.
@jamadden
Copy link
Member Author

Another advantage would be that __getnewargs__ would be called on both Python 2 and 3. This would (probably) solve #18.

@jimfulton
Copy link
Member

Python 2 pickle/cPickle don't support protocol 3, so this change more closely ties us to zodbpickle, which is a maintenance burden that I'd like to get rid of at some point.

@jamadden
Copy link
Member Author

Protocol 2 would have the same benefit, so that could be changed if you'd like.

I just thought it was nice to be able to use the same protocol under both versions, especially since eventually Python 2 will go away. (FWIW, it seems like under Python 2, other than the version number, protocol 2 and 3 are the same...that was why zodbpickle's binary object was needed in the first place, wasn't it, to be able to make the distinction under Python 2 that Python 3's protocol 3 makes natively?)

Regarding getting rid of zodbpickle entirely, I would still eventually like to get around to porting protocol 4 back to it, which would make large objects much more efficient. And it would be useful for Python 2 and 3.3...which I now remember we just dropped, so that's a bit less important.

@jamadden
Copy link
Member Author

D'oh! I forgot the main reason we're committed to zodbpickle: Python's noload is broken, which breaks parts of ZODB (e.g., zodbdgc).

I suppose we could try to get my patch that fixed noload for zodbpickle upstreamed (I don't think it reintroduces the regression that caused its removal), but the earliest that would happen would be 3.7.

@jimfulton
Copy link
Member

jimfulton commented Aug 26, 2017 via email

@jamadden
Copy link
Member Author

I should note, however, that noload is an optimization. If using something
like relstorage or zc.zodbdgc, that optimization is less important because
the computation is performed outside of the database server and doesn't
block anything else, so I consider this optimization less important than I
once did.

True, it's an optimization. It's a substantial optimization, the last time I measured. IIRC it was something like an order of a magnitude (but that was 2014 so I may be misremembering). It also is extremely helpful if not everything can be unpickled, either because the class has gone missing or because it relies on infrastructure that's not set up without loading a full application (e.g., zc.baseregistry which unpickles by doing component.getUtility(IRegistry, name='somename').

Another way to accomplish that optimization would be cool!

So, given all that about our ties to zodbpickle, but also the fact that on Python 2, the protocols are pretty much the same, I am happy to either leave it or change it, it's up to you.

@jimfulton
Copy link
Member

I would prefer to use protocol 2 on Python 2. This means that the records would remain readable with standard pickle on Python 2, which I think has value.

This way the records typically stay readable with stdlib pickle.
@jamadden
Copy link
Member Author

Makes sense. Done.

@jamadden
Copy link
Member Author

Is there anything else I can do to get this ready to be approved?

@jmuchemb
Copy link
Member

Update the changelog ? (it still describes a change for protocol 3)

@jamadden
Copy link
Member Author

Update the changelog ? (it still describes a change for protocol 3)

Thanks! I have that locally, I just haven't pushed it yet.

@jimfulton jimfulton merged commit be5a9d5 into master Aug 29, 2017
@jimfulton jimfulton deleted the jam-pickle-prot branch August 29, 2017 14:55
@jamadden
Copy link
Member Author

Thank you!

@jimfulton
Copy link
Member

Thank you! :)

@jamadden
Copy link
Member Author

As it happens, we're just now hitting a case where protocol 2 might be useful (for __getnewargs__). Could there be a PyPI release when convenient please?

@jimfulton
Copy link
Member

Released.

@jamadden
Copy link
Member Author

jamadden commented Aug 30, 2017 via email

@jimfulton
Copy link
Member

I'm an idiot. Using protocol 2 defeats zodbpickle's binary mechanism. :( Whimper.

@icemac
Copy link
Member

icemac commented Sep 12, 2018

See #193 as followup of this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants