WIP: fork py38 / py312 pickle C extensions #92

tseaver · 2024-06-18T03:32:27Z

New features:

Support for new / faster protocol versions (4/5).
The 3.12 version should be near-ready to enable sub-interpreter support.

Remaining issues:

Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.
Update setup.py to build the desired extension, based on Python version.
Once we've decided that, fork the 'pickletools' modules from the Python stdlib as well, and arrange for them to be imported based on Python version.
And update the tests to use the new 'pickletools', match the new protocol versions, etc.

We need to be able to build this on other versions of Python, and without having the original sources available.

Attempt to conform to the new argument-clinic patterns present in the forked modules.

Pass the unpickler's statck.

Add comments annotating semantics of all 'load_*' opcodes handlers shadowed by 'noload_' opcode handlers, and the corresponding 'noload_' semantics.

Use current (Python 3.8) 'Unpickler.load' as a template, leaving the 'load_*' opcodes commented out for eyeball verification.

The Python '_Py_ID' function is not accessible outside its own build.

Most changes involve finding / passing the 'PickleState' pointer into 'noload' and the 'noload_*' opcode handlers. 'Unpickler.noload' is now declared using the required convention to be able to look up module state from the type.

icemac · 2024-06-18T06:17:40Z

Decide whether to remove _pickle_33.c, and build only one extension module (much simpler). If so, users who require the older "preserve pickle-level compatibility for databases created under Python 2" scenario will need to pin to the last major version, forever.

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3.
It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

navytux · 2024-06-18T08:20:42Z

Hello everyone,

For the reference: in the context of ERP5 we do load pickles generated on py2 under py3 instead of migrating the database. Quoting https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21#note_205356 :

Please see pickle_py2_gpy3_demo.py and ZODB_py2_gpy3_demo.py there for explanation + demonstration of how py2/py3 pickle compatibility problem is solved.

...

# However when run under gpy3, the string is loaded ok as bstr. Since bstr has the
# same semantic as regular str on py2, working with that object produces the
# same result plain py2 would produce when adjusting the data. And then, bstr
# is also persisted ok and via the same *STRING opcodes, that py2 originally
# used for the data.
#
# This way both py2 and gpy3 can interoperate on the same database: py2 can
# produce data, gpy3 can read the data and modify it, and further py2 can load
# updated data, again, just ok.

From this point of view it would be handy to retain support for STRING opcodes.

Thanks beforehand,
Kirill

P.S. Please see https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21 for the whole context of this if you are interested.

tseaver · 2024-06-18T14:43:24Z

@icemac

Databases created under Python 2 should have been migrated to Python 3 using https://pypi.org/project/zodbupdate/ – at least this is what we have promoted to do to upgrade existing applications to Python 3.

Interesting: I'd forgotten that tool existed.

It is not even possible to open a not converted database on Python 3 – it gets rejected because of a different magic code in the first bytes of the database file. From my understanding there should be no need to keep that Python 2 compat scenario – but maybe I am misunderstanding something here.

I'd be glad to just git rm src/zodbpickle/_pickle_33.c if so, and just pick the 38 or 312 versions in setup.py based on the Python version.

tseaver · 2024-06-18T15:18:04Z

@navytux

From this point of view it would be handy to retain support for STRING opcodes.

I just looked, and the _pickle_38.c and _pickle_312.c modules in this draft preserve the same semantics for reading the STRING and opcode as the current_pickle_33.c`:

First, they strip the required outer quotes abd convert the remainder to bytes (using PyBytes_DecodeEscape)
If self->encoding is set to bytes (that is the Py2 compatibility option), they push the bytes object on the stack.
Otherwise, they covert to str (using PyUnicode_FromEncodedObject) and push that object.

For writing, the new modules lose the ability to emit the STRING opcode:

for protocol < 3, they fall back to the old "fake reduce" strategy, which should be consumable from Python2.

Examining the _pickle_33.c version, I don't believe that this branch emitting STRING opcode is actually reachable: if (!self->bin) implies self->protocol == 0, but the prior section is guarded by if (self->protocol < 3).

src/zodbpickle/_pickle_38.c

navytux · 2024-06-18T19:44:19Z

@tseaver, thanks for feedback.

Maybe when originally replying I misunderstood a bit the following:

Decide whether to remove _pickle_33.c, and build only one extension module
(much simpler). If so, users who require the older "preserve pickle-level
compatibility for databases created under Python 2" scenario will need to
pin to the last major version, forever.

I understand and agree that all _pickle_33, _pickle_38 and _pickle_312 handle
*STRING opcodes on loading essentially the same way: out of the box the
bytestring is decoded to unicode string, or is left as bytes if Unpickler
is created with encoding='bytes'. From this point of view it should be indeed
ok to drop _pickle_33 and leave only the rest. But then could you please
clarify what did you mean with "If so, users who require the older "preserve
pickle-level compatibility for databases created under Python 2" scenario will
need to pin to the last major version, forever." ?

For the reference I remembered that my bstr already works ok with respect to
pickle loading and saving on stdlib pickle from py3.11, so dropping py3.3-based
cpickle should be likely ok.

Your outline of how loading works and how saving does not work for STRING
opcodes is correct, but it applies to unmodified stdlib pickle and zodbpickle
behaviours.

However on my side loading and saving *STRING work ok out of the box, and in
completely backward compatible way with respect to py2, because I patch pickle
modules at runtime - both py and C versions - to handle that well. For example
the loading is implemented via using Unpickler(encoding='bstr') and returning
bstr(byte-data) for *STRING opcodes:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L374-393
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L475-497
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L73-127

while for saving the code is adjusted to emit bstr objects back via those same *STRING instructions:

https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L395-423
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/_golang_str_pickle.pyx#L635-673
https://lab.nexedi.com/kirr/pygolang/-/blob/58f86b12/golang/golang_str_pickle_test.py#L129-185

This way what py2 saved as its str is loaded on py3 as bstr object which
has semantic compatible to bytestring from py2.

In total there are 3 types in this scheme:

raw bytes (represented by bytes on py3 and zodbpickle.binary on py2),
binary string (represented by bstr on py3 and str on py2), and
unicode string (represented by str on py3 and unicode on py2)

The binary string and unicode string are just two different representations of
the same "string" concept. So being two different representations of the same
entity, contrary to "raw bytes", they are interoperable with each other without
explicit conversion: for example bstr + unicode works seamlessly out of the box
as well as all other string operations do automatic coercion as needed.

And for the reference: even though intended encoding for binary string
representation is UTF-8, it still does work ok even when binary data is invalid UTF-8.

My primary message here is that there still are users that care about
backward compatibility for py2 pickles, and there is a way to keep that
compatibility working completely and transparently without database migration.

Kirill

@gforcada

H/t @gforcada for the catch.

Build either '_pickle_38.c' or '_pickle_312.c', based on Python version.

Python < 3.12 uses the version from Python 3.8; Python >= 3.12 uses the version from 3.12. Note: this change does *not* yet fix up all the things in the forked modules: that bit is still in progress.

tseaver added 13 commits June 16, 2024 16:18

chore: fork 'Modules/_pickle.c' from Python 3.8.19 and 3.12.3

a533776

chore: inline corresponding 'Modules/clinic/_pickle_c.h' headers

307b6e8

We need to be able to build this on other versions of Python, and without having the original sources available.

chore: apply / fix up 'noload' patch

1e6c515

Attempt to conform to the new argument-clinic patterns present in the forked modules.

fix: always build as module

f3f49c6

fix: 'stack_underflow' -> 'Pdata_stack_underflow'

974e8f9

Pass the unpickler's statck.

fix: use correct name for 'Unpickler.noload' docstring

494c2af

refactor: add missing 'noload' opcode handlers

8a6e397

Add comments annotating semantics of all 'load_*' opcodes handlers shadowed by 'noload_' opcode handlers, and the corresponding 'noload_' semantics.

refactor: replace guts of 'Unpickler.noload'

3d574ff

Use current (Python 3.8) 'Unpickler.load' as a template, leaving the 'load_*' opcodes commented out for eyeball verification.

refactor: intern our own static string objects

25bcdba

The Python '_Py_ID' function is not accessible outside its own build.

refactor: use public module state APIs

98506d5

refactor: use public APIs for enter/leave recursive call

b0e135e

refactor: use public APIs for thread state

53d89b1

refactor: port protoc 4/5 from '_pickle_38.c'

988b54b

Most changes involve finding / passing the 'PickleState' pointer into 'noload' and the 'noload_*' opcode handlers. 'Unpickler.noload' is now declared using the required convention to be able to look up module state from the type.

tseaver requested review from davisagli, jimfulton and jamadden June 18, 2024 03:32

tseaver marked this pull request as draft June 18, 2024 03:32

icemac mentioned this pull request Jun 18, 2024

Status of 'noload' in Python 3 stdlib ? (i.e. let's try to kill this module for Python 3) #35

Open

gforcada reviewed Jun 18, 2024

View reviewed changes

src/zodbpickle/_pickle_38.c Outdated Show resolved Hide resolved

tseaver added 6 commits June 20, 2024 22:46

fix: typo

c3ac229

H/t @gforcada for the catch.

refactor: drop '_pickle_33.c' module

6010d0c

Build either '_pickle_38.c' or '_pickle_312.c', based on Python version.

chore: rename modules forked from Python 3.3 stdlib

91f6794

chore: fork 'pickle', 'pickletools' and related tests from Python 3.8.19

6b733a6

chore: fork 'pickle', 'pickletools' and related tests from Python 3.12.4

b3cac20

refactor: select module based on Python version

71a255a

Python < 3.12 uses the version from Python 3.8; Python >= 3.12 uses the version from 3.12. Note: this change does *not* yet fix up all the things in the forked modules: that bit is still in progress.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: fork py38 / py312 pickle C extensions #92

WIP: fork py38 / py312 pickle C extensions #92

tseaver commented Jun 18, 2024

icemac commented Jun 18, 2024

navytux commented Jun 18, 2024

tseaver commented Jun 18, 2024

tseaver commented Jun 18, 2024

navytux commented Jun 18, 2024

WIP: fork py38 / py312 pickle C extensions #92

Are you sure you want to change the base?

WIP: fork py38 / py312 pickle C extensions #92

Conversation

tseaver commented Jun 18, 2024

icemac commented Jun 18, 2024

navytux commented Jun 18, 2024

tseaver commented Jun 18, 2024

tseaver commented Jun 18, 2024

navytux commented Jun 18, 2024