Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use object.__new__(ak.Array) for pickling constructor #2113

Merged
merged 6 commits into from
Jan 12, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jan 12, 2023

When pickling an Awkward Array, the use of pickle's __getstate__ interface means that pickle stores a reference to type(array) in the ensuing bytestream. For arrays with a behavior class, this might not be available at unpickle time. e.g. if the author adds their class in ak.behavior, it will not be found in a new session during unpickling, unless the same registration is performed. Whilst this is a somewhat sensible constraint, we should not needlessly force this upon the user.

Conventionally, arrays can have __name__s that aren't found in their respective behaviours; these aren't currently errors, so we shouldn't fail to unpickle them. We can fix this situation this by removing the reference to the type(array) in favour of object.__new__(ak.Array), so that only the behavior member references the class.

Note that arrays with local behaviors (ak.Array(..., behavior=...)) will always fail to unpickle in the event that the references inside the behaviour can't be resolved!

Fixes #2106

@jpivarski
Copy link
Member

You just started this, but remember we want to future-proof pickle-loading by making the __setstate__ unpack an open-ended number of tuple items?

form, length, container, behavior = state

and

form, length, container, behavior, at = state

with

form, length, container, behavior, *args = state   # including at for ak.Record

or

form = state[0]
length = state[1]
container = state[2]
behavior = state[3]
# including at for ak.Record

(which Python versions accept the *args syntax?)

This is not the time to introduce a version number, but making as many old versions (starting with this one) insensitive to extra arguments would make it possible to add a version number at any time.

@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

Merging #2113 (4a93567) into main (6efa287) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/highlevel.py 76.21% <100.00%> (ø)
src/awkward/_do.py 84.30% <0.00%> (ø)
src/awkward/contents/content.py 73.13% <0.00%> (ø)
src/awkward/contents/listarray.py 90.08% <0.00%> (ø)
src/awkward/contents/emptyarray.py 72.22% <0.00%> (ø)
src/awkward/contents/unionarray.py 84.39% <0.00%> (ø)
src/awkward/contents/recordarray.py 84.05% <0.00%> (ø)
src/awkward/contents/indexedarray.py 77.66% <0.00%> (ø)
src/awkward/contents/regulararray.py 88.47% <0.00%> (ø)
src/awkward/contents/bitmaskedarray.py 68.02% <0.00%> (ø)
... and 5 more

@agoose77 agoose77 temporarily deployed to docs-preview January 12, 2023 20:04 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

You just started this, but remember we want to future-proof pickle-loading by making the __setstate__ unpack an open-ended number of tuple items?

You read my mind! (I stopped for food)

Yes, a version number is useful for backward compatibility, whilst this is useful for forwards compatibility.

Take the first `N` state values
@agoose77 agoose77 temporarily deployed to docs-preview January 12, 2023 20:26 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might still be issues pickling behaviors with lambda functions in them, but that's to be expected. For that reason, the standard behaviors all have names in the Awkward package (and in Vector).

I always get confused about the differences between __reduce__ and __getstate__, and I prefer __getstate__ because it is symmetric with __setstate__. But presumably, you have a reason for switching to __reduce__ in this one, perhaps to be able to control the __new__ method that gets used.

As for future-proofing the state unpacking, I just verified that the *_ syntax was introduced with Python 3.0 (PEP 3132), so it's definitely safe to use. Although our tests would have revealed it if it wasn't, since we test every Python version we support.

I'll set this to auto-merge!

@jpivarski jpivarski merged commit 97d813a into main Jan 12, 2023
@jpivarski jpivarski deleted the agoose77/fix-test-pickle branch January 12, 2023 20:37
@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 12, 2023

There might still be issues pickling behaviors with lambda functions in them, but that's to be expected. For that reason, the standard behaviors all have names in the Awkward package (and in Vector).

Good Practice™!

I always get confused about the differences between __reduce__ and __getstate__, and I prefer __getstate__ because it is symmetric with __setstate__. But presumably, you have a reason for switching to __reduce__ in this one, perhaps to be able to control the __new__ method that gets used.

__getstate__ leads to Python including type(self) in the pickle stream, which is MyCustomType for any Array with a behavior class. This means that unpickling will always fail if the custom type cannot be resolved. However, by using ___reduce__, we can tell Python to use a different constructor function, in this case object.__new__(Array, ...). This means that the aforementioned case would not fail during unpickling provided that the user used ak.behavior rather than a custom behavior object; in the former case, we don't include the behavior dict in the pickle.

We could also have written

def _array_new():
    return object.__new__(Array)

but I can't see any motivation for it.

As for future-proofing the state unpacking, I just verified that the *_ syntax was introduced with Python 3.0 (PEP 3132), so it's definitely safe to use. Although our tests would have revealed it if it wasn't, since we test every Python version we support.

Yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pickle of array includes array.__class__
2 participants