Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between pickle.py and _pickle for certain strings #113028

Closed
jeff5 opened this issue Dec 12, 2023 · 3 comments · Fixed by #113436
Closed

Difference between pickle.py and _pickle for certain strings #113028

jeff5 opened this issue Dec 12, 2023 · 3 comments · Fixed by #113436
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@jeff5
Copy link
Contributor

jeff5 commented Dec 12, 2023

Bug report

Bug description:

There is a logical error in pickle.Pickler.save_str for protocol 0, such that it repeats pickling of a string object each time it is presented. The design clearly intends to re-use the first pickled representation, and the C-implementation _pickle does that.

In an implementation that does not provide a compiled _pickle (PyPy may be one) this is inefficient, but not actually wrong. The intended behaviour occurs with a simple string:

>>> s = "hello"
>>> pickle._dumps((s,s), 0)
b'(Vhello\np0\ng0\ntp1\n.'

When read by loads() this string says:

  1. stack "hello",
  2. save a copy in memory 0,
  3. stack the contents of memory 0,
  4. make a tuple from the stack,
  5. save a copy in memory 1.

The bug emerges when the pickled string needs pre-encoding:

>>> s = "hello\n"
>>> pickle._dumps((s,s), 0)
b'(Vhello\\u000a\np0\nVhello\\u000a\np1\ntp2\n.'

Here we see identical data stacked and saved (but not used). The problem is here:

cpython/Lib/pickle.py

Lines 860 to 866 in 42a86df

obj = obj.replace("\\", "\\u005c")
obj = obj.replace("\0", "\\u0000")
obj = obj.replace("\n", "\\u000a")
obj = obj.replace("\r", "\\u000d")
obj = obj.replace("\x1a", "\\u001a") # EOF on DOS
self.write(UNICODE + obj.encode('raw-unicode-escape') +
b'\n')

where the return from obj.replace may be a different object from obj. In CPython, that is only if a replacement takes place, which is why the problem only appears in the second case above.

save_str is only called when the object has not already been memoized, but in the cases at issue, the string memoized is not the original object, and so when the original string object is presented again, save_str is called again.

Depending upon the detailed behaviour of str.replace (in particular, if you decide to return an interned value when the result is, say, a Latin-1 character) an assertion may fail in memoize():

cpython/Lib/pickle.py

Lines 504 to 507 in 42a86df

assert id(obj) not in self.memo
idx = len(self.memo)
self.write(self.put(idx))
self.memo[id(obj)] = idx, obj
I have not managed to trigger an AssertionError in CPython.

This has probably gone unnoticed so long only because pickle.py is not tested. (At least, I think it isn't. #105250 and #53350 note this coverage problem.)

CPython versions tested on:

3.11

Operating systems tested on:

Windows

Linked PRs

@jeff5 jeff5 added the type-bug An unexpected behavior, bug, or error label Dec 12, 2023
@Eclips4 Eclips4 added the stdlib Python modules in the Lib dir label Dec 12, 2023
@jeff5
Copy link
Contributor Author

jeff5 commented Dec 12, 2023

I'd happily provide a PR. It seems only necessary to introduce a temporary variable so the original obj is memoized unchanged, while the pickled text is emitted as now. (Only the id matters if I've understood correctly.)

@serhiy-storchaka
Copy link
Member

I'll be glad to review your PR.

@jeff5
Copy link
Contributor Author

jeff5 commented Dec 22, 2023

#88073 is a similar problem solved in #25501 and I propose to follow its approach to a focused test.

The alternative of adding to create_data feels like overloading a basic functional test with corner cases.

jeff5 added a commit to jeff5/cpython that referenced this issue Dec 23, 2023
This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
serhiy-storchaka pushed a commit that referenced this issue Dec 24, 2023
)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 24, 2023
…ythonGH-113436)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
(cherry picked from commit 0839863)

Co-authored-by: Jeff Allen <ja.py@farowl.co.uk>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 24, 2023
…ythonGH-113436)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
(cherry picked from commit 0839863)

Co-authored-by: Jeff Allen <ja.py@farowl.co.uk>
serhiy-storchaka pushed a commit that referenced this issue Dec 24, 2023
…H-113436) (GH-113448)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
(cherry picked from commit 0839863)

Co-authored-by: Jeff Allen <ja.py@farowl.co.uk>
serhiy-storchaka pushed a commit that referenced this issue Dec 24, 2023
…H-113436) (GH-113449)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
(cherry picked from commit 0839863)

Co-authored-by: Jeff Allen <ja.py@farowl.co.uk>
ryan-duve pushed a commit to ryan-duve/cpython that referenced this issue Dec 26, 2023
…ythonGH-113436)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
kulikjak pushed a commit to kulikjak/cpython that referenced this issue Jan 22, 2024
…ythonGH-113436)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…ythonGH-113436)

This fixes a divergence between the Python and C implementations of pickle
for protocol 0, such that it pickle.py fails to re-use the first pickled
representation of strings involving characters that have to be escaped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants