Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36694: Do not memoize temporary objects in the C implementation of pickle. #13036

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented May 1, 2019

This produces more optimal pickle data and reduces memory consumption on
pickling and unpickling.

https://bugs.python.org/issue36694

…f pickle.

This produces more optimal pickle data and reduces memory consumption on
pickling and unpickling.
@@ -1601,15 +1601,15 @@ memo_get(PicklerObject *self, PyObject *key)
/* Store an object in the memo, assign it a new unique ID based on the number
of objects currently stored in the memo and generate a PUT opcode. */
static int
memo_put(PicklerObject *self, PyObject *obj)
memo_put(PicklerObject *self, PyObject *obj, int opt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how you decide whether opt should be 0 or 1. What is the heuristic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. This is critical behavior for cloudpickle :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question. The rule is opt=0 for objects which save its content after saving itself. They are non-empty lists, sets, dicts and general objects with non-trivial elements 2-4 of the tuple returned by __reduce__(). They should be memoized to allow detecting reference loops.

@pitrou
Copy link
Member

pitrou commented May 10, 2019

Also @pierreglaser .

{
char pdata[30];
Py_ssize_t len;
Py_ssize_t idx;

const char memoize_op = MEMOIZE;

if (self->fast)
if (self->fast || (opt && Py_REFCNT(obj) == 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring Py_REFCNT(obj) to be 1 pretty strong right? Does this only affect objects created using the C API, i.e never bounded to python-level variables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly affects temporary objects. For example, if __reduce__ returns constructor, ((x, y),) then the tuple (x, y) will not be memoized. This is a case of namedtuples.

Another example: if you have a list of unique numbers, strings, tuples, etc, then items of the list will not be memoized as the only reference to the item is from the list.

@pitrou
Copy link
Member

pitrou commented May 31, 2019

I think this optimization should be restricted to well-known built-in types (tuples, etc.). Omitting arbitrary user objects risks opening regressions.

@serhiy-storchaka
Copy link
Member Author

For example?

@pitrou
Copy link
Member

pitrou commented May 31, 2019

I don't have any example, but I'm not confident that they don't exist. @pierreglaser mentioned cloudpickle, which heavily customizes pickling.

@serhiy-storchaka
Copy link
Member Author

cloudpickle customizes pickle using the Python implementation. This optimization is only for the C implementation.

@pitrou
Copy link
Member

pitrou commented May 31, 2019

No, cloudpickle will soon be subclassing from the C implementation.

@serhiy-storchaka
Copy link
Member Author

I do not see possibility of regressions.

@serhiy-storchaka
Copy link
Member Author

Well, maybe there are some differences with constructors with side effect. I'll try to write tests for this.

@igozali
Copy link

igozali commented Feb 19, 2023

I believe I'm also hitting this issue, curious about the remaining steps to push this patch forward? Is it just missing tests?

@serhiy-storchaka serhiy-storchaka marked this pull request as draft December 1, 2023 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants