Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: pickle support for StreamField #10654

Closed
wants to merge 14 commits into from

Conversation

pySilver
Copy link
Contributor

@pySilver pySilver commented Jul 8, 2023

This is a rework of my original PR #5998 where I described the problem of lack of pickle support. Sadly the implementation is a little bit ugly but after many hours of thinking I didn't find any other way to make it work. I'm open to discussing any other solutions.

Please check the following:

  • Do the tests still pass?[^1]
  • Does the code comply with the style guide?
    • Run make lint from the Wagtail root.
  • For Python changes: Have you added tests to cover the new/fixed behaviour?

@squash-labs
Copy link

squash-labs bot commented Jul 8, 2023

Manage this branch in Squash

Test this branch here: https://pysilverfix-pickle-support-71fg9.squash.io

@pySilver
Copy link
Contributor Author

pySilver commented Jul 8, 2023

I will take a look if that is possible to add pickle support for Block as it is the blocker for implementation without passing field reference.

@pySilver
Copy link
Contributor Author

pySilver commented Jul 8, 2023

Ok, with some help of ChatGPT, I was able to make StreamField compatible with pickle in a proper, clean way. I assume it is now safe to merge.

@@ -530,40 +530,57 @@ def _get_callable_choices(self, choices, blank_choice=True):
choices list with the addition of a blank choice (if blank_choice=True and one does not
already exist).
"""
return self.choices_callable(choices, blank_choice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right... the old version of this method defined choices_callable (a function that takes no arguments and returns a list when called), and returns that callable. Here, you've made choices_callable into a method, and you're calling that method immediately - which means that _get_callable_choices will now return a list, rather than a callable.

I suspect that this change will cause a callable choices argument to be evaluated once on initialisation, rather than being evaluated each time the widget is rendered, as we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify that and make proper modifications. Callables are tricky to serialize, so method has to be refactored a little bit anyway (so callable isn't nested in a method)

@@ -1800,3 +1802,40 @@ def test_post_with_comment_notifications_switched_off(self):

self.assertEqual(subscription.user, self.user)
self.assertFalse(subscription.comment_notifications)


class TestPagePickleSupport(WagtailTestUtils, TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would be better placed in wagtail.tests.test_streamfield since it's not testing anything to do with the admin module (and the self.login() is redundant).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I was looking for a proper place. I'll move it there.

@gasman
Copy link
Collaborator

gasman commented Jul 12, 2023

Unfortunately I don't think the issue with callable choices is going to be solvable without a complete change of approach here. If the internal state of BaseChoiceBlock needs to be included as part of the pickled representation, then there will be situations where that internal state will contain a passed-in choices=lambda: ... argument that can't be pickled. If you find a solution that seems to work, it's probably because you've evaluated that callable and stored the result, instead of holding on to the callable itself.

I think the only reasonable way forward here is to somehow exclude the stream_block attribute from StreamValue's pickled representation, so that we're only keeping the stream data, not the corresponding definition. This would be in line with how pickle is intended to work, as per https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled:

Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled. This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class.

Just as you'd expect to pickle a Python object, stop the program to make some changes to its class definition, start it again, unpickle the object and end up with an object that behaves according to the updated class definition - it seems logical that if you make any changes to a StreamField's definition, any unpickled StreamValues should then follow the new definition, not an older one that was frozen along with the data.

Unfortunately, I can't see a good way of detaching the stream_block attribute and reattaching it on un-pickling. For that to work, Django's Model class would need to provide some way for Fields to participate in the pickling process - essentially, being able to ask StreamField "here's the value I got from self.body - can you turn this into something pickle-safe?" and then later "here's the pickle-safe value I've got - can you turn this back into a usable StreamValue I can assign back to self.body?". Sadly, that mechanism doesn't seem to exist :-(

@pySilver
Copy link
Contributor Author

pySilver commented Jul 13, 2023

@gasman thanks for such a detailed review. You are right - it is impossible to pickle choice blocks in the current implementation. The only viable solution is to make a link between StreamField and StreamValue. It's actually my initial approach and the one I'm using for the last two years. I've moved tests into a proposed location and also made ugliness a little less ugly. Looking forward to your review. It might look like a hack but for ones who don't need pickle it doesn't break anything.

@pySilver pySilver requested a review from gasman July 13, 2023 02:56
@pySilver
Copy link
Contributor Author

pySilver commented Jul 13, 2023

Side note: Using dill instead of pickle solves the issue without any code modification at all. However, it requires some tooling adjustments. It may be as easy as writing a custom cache backend or providing a custom serializer for your cache app.

For example:

dill seems to be popular but I've never used this in production and I'm not sure about speed comparison vs pickle

https://github.com/uqfoundation/dill

Updated with some benchmarks:

In [1]: import dill

In [2]: import pickle

In [3]: obj=['{}'.format(n) for n in range(50000)]
   ...: obj
   ...:
   ...: import dill
   ...: %timeit len(dill.dumps(obj))
76.6 ms ± 758 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit len(pickle.dumps(obj))
2.02 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

another one: uqfoundation/dill#292

I think it can be heavy on CPU ... So no. Let's support pickle.

@pySilver
Copy link
Contributor Author

@gasman have you had a chance to look at the latest changes?

@gasman
Copy link
Collaborator

gasman commented Jul 17, 2023

@pySilver I've had a quick look, but not a full review yet - there are a lot of details here that will need careful consideration, like what happens to inner StreamValues from StreamBlocks nested inside other blocks - presumably those won't have the StreamField reference, and don't need it, but in that case the pickling logic needs to guard against that reference not existing.

I'm really not keen on having that reference to StreamField there - it's creating a coupling between the Django ORM and the streams/blocks framework that's been mostly independent of that up to now - and I'd like to spend some more time investigating whether it's possible to avoid that.

Incidentally, I don't think that moving the assignment into a decorator (4cbcf66) is really an improvement - it's just adding an extra step of indirection that you have to follow to fully understand the code, when it could just be a single line assignment within to_python. I'd also suggest changing _StreamFieldRef to _stream_field - an object attribute that references some other object is totally normal in Python, and doesn't need to be called out with a non-standard name.

@pySilver
Copy link
Contributor Author

@gasman thanks for your input! I've simplified the code per your recommendations.

@pySilver I've had a quick look, but not a full review yet - there are a lot of details here that will need careful consideration, like what happens to inner StreamValues from StreamBlocks nested inside other blocks - presumably those won't have the StreamField reference, and don't need it, but in that case the pickling logic needs to guard against that reference not existing.

Well, let's start with why we are doing this. We are doing this in order to support much-wanted cache support for Django ORM. I haven't seen requests to pickle StreamValue objects on it's own but there were plenty of issues for users like myself that had problems with popular orm cache apps.

Will it fail if we try to pickle StreamValue instance without StreamField reference? Yes, indeed. As much as it will fail without this PR, due to the fact that BaseChoiceBlock is constructed in a way it builds callable in runtime and we didn't find any good way to refactor it.

Do we need some guarding code in case _stream_field reference is not found? Honestly, I think we don't. It will fail anyway. Ok. We can make a descriptive error telling the dev why it did happen. But well it is in the code anyway.

I am also not very happy about this implementation but this is probably the only way at the moment. At least if we want to continue wrapping even static choices into a callable.

@gasman
Copy link
Collaborator

gasman commented Jul 18, 2023

Thanks for the clarification! I was under the impression that pickling a StreamValue would require recursively pickling its children, which meant there would be a possibility of encountering another StreamValue further down - but now I see that the call to get_prep_value ensures that it'll only have to pickle the basic JSON-friendly data types.

@pySilver
Copy link
Contributor Author

@gasman What can I do to make this PR happen? :)

@pySilver
Copy link
Contributor Author

Hey 👋,

Just a quick nudge on this PR. I get it is not ideal solution but I don't see how we can improve while maintaining callable choices wrapper. Chances we will make it to the master? I see people are already building their images with this patch, I myself currently use it with monkey patch with no problems.

gasman pushed a commit that referenced this pull request Aug 21, 2023
@gasman
Copy link
Collaborator

gasman commented Aug 21, 2023

Sorry for the delay in getting back to this! Have merged in eadf9a6 (with a couple of tweaks - some additional comments / error-checking, and reverting the addition of the callable choices block to the test StreamPage, which I don't think is needed now that we're not attempting to pickle the StreamBlock definition).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants