Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fork py38 / py312 pickle C extensions #92

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

tseaver
Copy link
Member

@tseaver tseaver commented Jun 18, 2024

New features:

  • Support for new / faster protocol versions (4/5).
  • The 3.12 version should be near-ready to enable sub-interpreter support.

Remaining issues:

  • Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.
  • Update setup.py to build the desired extension, based on Python version.
  • Once we've decided that, fork the 'pickletools' modules from the Python stdlib as well, and arrange for them to be imported based on Python version.
  • And update the tests to use the new 'pickletools', match the new protocol versions, etc.

We need to be able to build this on other versions of Python, and without
having the original sources available.
Attempt to conform to the new argument-clinic patterns present in
the forked modules.
Add comments annotating semantics of all 'load_*' opcodes handlers
shadowed by 'noload_' opcode handlers, and the corresponding 'noload_'
semantics.
Use current (Python 3.8) 'Unpickler.load' as a template, leaving the
'load_*' opcodes commented out for eyeball verification.
The Python '_Py_ID' function is not accessible outside its own build.
Most changes involve finding / passing the 'PickleState' pointer
into 'noload' and the 'noload_*' opcode handlers.

'Unpickler.noload' is now declared using the required convention to
be able to look up module state from the type.
@tseaver tseaver marked this pull request as draft June 18, 2024 03:32
@icemac
Copy link
Member

icemac commented Jun 18, 2024

  • Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3.
It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

@navytux
Copy link

navytux commented Jun 18, 2024

Hello everyone,

For the reference: in the context of ERP5 we do load pickles generated on py2 under py3 instead of migrating the database. Quoting https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21#note_205356 :

Please see pickle_py2_gpy3_demo.py and ZODB_py2_gpy3_demo.py there for explanation + demonstration of how py2/py3 pickle compatibility problem is solved.

...

# However when run under gpy3, the string is loaded ok as bstr. Since bstr has the
# same semantic as regular str on py2, working with that object produces the
# same result plain py2 would produce when adjusting the data. And then, bstr
# is also persisted ok and via the same *STRING opcodes, that py2 originally
# used for the data.
#
# This way both py2 and gpy3 can interoperate on the same database: py2 can
# produce data, gpy3 can read the data and modify it, and further py2 can load
# updated data, again, just ok.

From this point of view it would be handy to retain support for STRING opcodes.

Thanks beforehand,
Kirill

P.S. Please see https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21 for the whole context of this if you are interested.

@tseaver
Copy link
Member Author

tseaver commented Jun 18, 2024

@icemac

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3.

Interesting: I'd forgotten that tool existed.

It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

I'd be glad to just git rm src/zodbpickle/_pickle_33.c if so, and just pick the 38 or 312 versions in setup.py based on the Python version.

@tseaver
Copy link
Member Author

tseaver commented Jun 18, 2024

@navytux

From this point of view it would be handy to retain support for STRING opcodes.

I just looked, and the _pickle_38.c and _pickle_312.c modules in this draft preserve the same semantics for reading the STRING and opcode as the current_pickle_33.c`:

  • First, they strip the required outer quotes abd convert the remainder to bytes (using PyBytes_DecodeEscape)
  • If self->encoding is set to bytes (that is the Py2 compatibility option), they push the bytes object on the stack.
  • Otherwise, they covert to str (using PyUnicode_FromEncodedObject) and push that object.

For writing, the new modules lose the ability to emit the STRING opcode:

  • for protocol < 3, they fall back to the old "fake reduce" strategy, which should be consumable from Python2.

Examining the _pickle_33.c version, I don't believe that this branch emitting STRING opcode is actually reachable: if (!self->bin) implies self->protocol == 0, but the prior section is guarded by if (self->protocol < 3).

@navytux
Copy link

navytux commented Jun 18, 2024

@tseaver, thanks for feedback.

Maybe when originally replying I misunderstood a bit the following:

Decide whether to remove _pickle_33.c, and build only one extension module
(much simpler). If so, users who require the older "preserve pickle-level
compatibility for databases created under Python 2" scenario will need to
pin to the last major version, forever.

I understand and agree that all _pickle_33, _pickle_38 and _pickle_312 handle
*STRING opcodes on loading essentially the same way: out of the box the
bytestring is decoded to unicode string, or is left as bytes if Unpickler
is created with encoding='bytes'. From this point of view it should be indeed
ok to drop _pickle_33 and leave only the rest. But then could you please
clarify what did you mean with "If so, users who require the older "preserve
pickle-level compatibility for databases created under Python 2" scenario will
need to pin to the last major version, forever."
?

For the reference I remembered that my bstr already works ok with respect to
pickle loading and saving on stdlib pickle from py3.11, so dropping py3.3-based
cpickle should be likely ok.

Your outline of how loading works and how saving does not work for STRING
opcodes is correct, but it applies to unmodified stdlib pickle and zodbpickle
behaviours.

However on my side loading and saving *STRING work ok out of the box, and in
completely backward compatible way with respect to py2, because I patch pickle
modules at runtime - both py and C versions - to handle that well. For example
the loading is implemented via using Unpickler(encoding='bstr') and returning
bstr(byte-data) for *STRING opcodes:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L374-393
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L475-497
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L73-127

while for saving the code is adjusted to emit bstr objects back via those same *STRING instructions:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L395-423
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L635-673
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L129-185

This way what py2 saved as its str is loaded on py3 as bstr object which
has semantic compatible to bytestring from py2.

In total there are 3 types in this scheme:

  1. raw bytes (represented by bytes on py3 and zodbpickle.binary on py2),
  2. binary string (represented by bstr on py3 and str on py2), and
  3. unicode string (represented by str on py3 and unicode on py2)

The binary string and unicode string are just two different representations of
the same "string" concept. So being two different representations of the same
entity, contrary to "raw bytes", they are interoperable with each other without
explicit conversion: for example bstr + unicode works seamlessly out of the box
as well as all other string operations do automatic coercion as needed.

And for the reference: even though intended encoding for binary string
representation is UTF-8, it still does work ok even when binary data is invalid UTF-8.

My primary message here is that there still are users that care about
backward compatibility for py2 pickles, and there is a way to keep that
compatibility working completely and transparently without database migration.

Kirill

H/t @gforcada for the catch.
Build either '_pickle_38.c' or '_pickle_312.c', based on Python version.
Python < 3.12 uses the version from Python 3.8;  Python >= 3.12 uses the
version from 3.12.

Note:  this change does *not* yet fix up all the things in the forked
       modules:  that bit is still in progress.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants