New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a higher pickle protocol for serializing objects on Python 2 #179
Conversation
Previously protocol 1 was used. This is more efficient for new-style classes (all persistent objects are new-style), according to the docs, at the cost of being very slightly less space efficient for old-style classes. In tests of a persistent object with two trivial numeric attributes, the higher protocol was 12 bytes smaller, and serialized and deserialized 1us faster. Introducing a reference to another new-style class for a more realistic test made the higher protocol twice as fast to serialize (20.5 vs 10.3us), almost half the size (215 vs 142 bytes), and it deserialized 30% faster (6.5 vs 4.6us). On Python 2, this will now allow open ``file`` objects to be pickled (loading the object will result in a closed file); previously this would result in a ``TypeError`` (as does under Python 3). We had tests that you couldn't do that with a BlobFile so I had to update it to still make that true. I wouldn't recommend serializing arbitrary open files under Python 2 (for one thing, they can't trivially be deserialized in Python 3), but I didn't take any steps to prevent it either. Since this hasn't been possible, there shouldn't be code in the wild that is trying to do it---and it wouldn't be forward compatible with Python 3 either.
Another advantage would be that |
Python 2 pickle/cPickle don't support protocol 3, so this change more closely ties us to zodbpickle, which is a maintenance burden that I'd like to get rid of at some point. |
Protocol 2 would have the same benefit, so that could be changed if you'd like. I just thought it was nice to be able to use the same protocol under both versions, especially since eventually Python 2 will go away. (FWIW, it seems like under Python 2, other than the version number, protocol 2 and 3 are the same...that was why zodbpickle's Regarding getting rid of zodbpickle entirely, I would still eventually like to get around to porting protocol 4 back to it, which would make large objects much more efficient. And it would be useful for Python 2 and 3.3...which I now remember we just dropped, so that's a bit less important. |
D'oh! I forgot the main reason we're committed to I suppose we could try to get my patch that fixed |
On Sat, Aug 26, 2017 at 10:36 AM, Jason Madden ***@***.***> wrote:
D'oh! I forgot the main reason we're committed to zodbpickle: Python's
noload is broken, which breaks parts of ZODB (e.g., zodbdgc).
Yes. That was my fault. I regret it. :(
I suppose we could try to get my patch that fixed noload for zodbpickle
upstreamed (I don't *think* it reintroduces the regression that caused
its removal), but the earliest that would happen would be 3.7.
Noload was a hack. The purpose of no-load is to get persistent
references. I don't see noload being accepted upstream. I *do* see a
possibility for an optimized API to get references. That would probably be
a useful thing to to.
In might be interesting to have a stab at an API to extract references
based on pickletools.genops. It would likely allow an implementation that
avoids a number of the allocations/instantiations performed by a regular
pickle load.
I should note, however, that noload is an optimization. If using something
like relstorage or zc.zodbdgc, that optimization is less important because
the computation is performed outside of the database server and doesn't
block anything else, so I consider this optimization less important than I
once did.
|
True, it's an optimization. It's a substantial optimization, the last time I measured. IIRC it was something like an order of a magnitude (but that was 2014 so I may be misremembering). It also is extremely helpful if not everything can be unpickled, either because the class has gone missing or because it relies on infrastructure that's not set up without loading a full application (e.g., zc.baseregistry which unpickles by doing Another way to accomplish that optimization would be cool! So, given all that about our ties to zodbpickle, but also the fact that on Python 2, the protocols are pretty much the same, I am happy to either leave it or change it, it's up to you. |
I would prefer to use protocol 2 on Python 2. This means that the records would remain readable with standard pickle on Python 2, which I think has value. |
This way the records typically stay readable with stdlib pickle.
Makes sense. Done. |
Is there anything else I can do to get this ready to be approved? |
Update the changelog ? (it still describes a change for protocol 3) |
Thanks! I have that locally, I just haven't pushed it yet. |
[skip ci]
Thank you! |
Thank you! :) |
As it happens, we're just now hitting a case where protocol 2 might be useful (for |
Released. |
On Aug 30, 2017, at 07:55, Jim Fulton ***@***.***> wrote:
Released
Thank you!
|
I'm an idiot. Using protocol 2 defeats zodbpickle's binary mechanism. :( Whimper. |
See #193 as followup of this ticket. |
Previously protocol 1 was used. The higher protocol is more efficient for new-style classes (all persistent objects are new-style), according to the docs, at the cost of being very slightly less space efficient for old-style classes.
In tests of a persistent object with two trivial numeric attributes, the higher protocol was 12 bytes smaller, and serialized and deserialized 1us faster. Introducing a reference to another new-style
class (with a small dict and list of strings for attributes) for a more realistic test made the higher protocol twice as fast to serialize (20.5 vs 10.3us), almost half the size (215 vs 142 bytes), and it deserialized 30% faster (6.5 vs 4.6us).
On Python 2, this will now allow open
file
objects to be pickled (loading the object will result in a closed file); previously this would result in aTypeError
(as does under Python 3). We had tests that you couldn't do that with a BlobFile so I had to update it to still make that true.I wouldn't recommend serializing arbitrary open files under Python 2 (for one thing, they can't trivially be deserialized in Python 3), but I didn't take any steps to prevent it either. Since this hasn't
been possible, there shouldn't be code in the wild that is trying to do it---and it wouldn't be forward compatible with Python 3 either.