-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datetime.__setstate__ fails decoding python2 pickle #66204
Comments
pickle.loads raises a TypeError when calling the datetime constructor, (then a UnicodeEncodeError in the load_reduce function). A short test program & the log, including dis output of both PY2 and PY3 pickles, are available in this gist; and extract on stackoverflow: I am using pickle.dumps(reply, protocol=2) in PY2 Native cPickle loads decoding fails too, I am only using pure python's _loads for debugging. Sorry if this is misguided (first time here) |
I have no idea what was done to pickle for Python3, but this line works for me to unpickle a Python2 protocol 2 datetime pickle under Python3, where P2 is the Python2 pickle string:
For example, >>> P2
'\x80\x02cdatetime\ndatetime\nq\x00U\n\x07Þ\x07\x12\r%%\x06á¸q\x01\x85q\x02Rq\x03.'
>>> pickle.loads(bytes(P2, encoding='latin1'), encoding='bytes')
datetime.datetime(2014, 7, 18, 13, 37, 37, 451000) I don't understand the Python3 loads() docs with respect to the "encoding" and "errors" arguments, and can't guess whether this is the intended way. It seems at best highly obscure. But hard to argue with something that works ;-) |
The code works when using encoding='bytes'. Thanks Tim for the suggestion. So this is not a bug, but is there any sense in having encoding='ASCII' by default in pickle ? |
@EddyGeek, I'd still call something so unintuitive "a bug" - it's hard to believe this is the _intended_ way to get it to work. So I'd keep this open until someone with better knowledge of intent chimes in. |
It is most definitely a bug. And it adds another road block to moving python applications from 2.7 to 3.x! encoding='bytes' has serious side effects and isn't useful in the general case. For instance, it will result in dict-keys being unpickled as bytes instead of as str after which hilarity ensues. I got the exception UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 1: ordinal not in range(128) when testing an application for compatibility in Python 3.5 on a pickle created by Python 2.7. The pickled data is a nested data structure and it took me quite a while to determine that the single datetime instance was the culprit. Here is a small test case that reproduces the problem:: # -*- coding: utf-8 -*-
# pickle_dump.py
import datetime, pickle, uuid
dti = datetime.datetime(2015, 10, 12, 13, 17, 42, 123456)
data = { "ascii" : "abc", "text" : u"äbc", "int" : 42, "date-time" : dti }
with open("/tmp/pickle.test", "wb") as file :
pickle.dump(data, file, protocol=2)
# pickle_load.py
# -*- coding: utf-8 -*-
import pickle
with open("/tmp/pickle.test", "rb") as file :
data = pickle.load(file)
print(data) $ python2.7 pickle_dump.py
$ python2.7 pickle_load.py
{'ascii': 'abc', 'text': u'\xe4bc', 'int': 42, 'date-time': datetime.datetime(2015, 10, 12, 13, 17, 42, 123456)}
$ python3.5 pickle_load.py
Traceback (most recent call last):
File "pickle_load.py", line 6, in <module>
data = pickle.load(file)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 1: ordinal not in range(128) That error message is spectacularly useless. |
There are two issues here.
The second issue usually hides the first one. The demonstration of the first issue: >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\x00\x00tR.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required (got type str) The first issue can be solved by accepting str argument and encoding it to bytes. The second issue can be solved by changing an encoding or an error handler. Following patch uses the "surrogateescape" error handler. >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.')
datetime.datetime(1900, 1, 1, 0, 0, 0, 50084) Unfortunately setting the "surrogateescape" error handler by default has a side effect. It can hide string decoding errors. In addition, unpickling datetime will not always work with different encodings. >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.', encoding='latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
>>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.', encoding='utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required (got type str) |
A strong -1 from me. Accepting bytes objects for year in 3.x (and str in 2.x) is a gross hack. In the long run, I would like to see a public named constructor, e.g. datetime.datetime.load to be used in datetime pickles. Can someone explain succinctly what the problem is? Does it only affect pickles transferred between 2.x and 3.x Pythons? If not, how can I reproduce the problem in 3.5? |
The problem is that you can't unpickle a data that contains both datetime Yet one possible solution is to change datetime classes in 2.x to produce more |
I wonder if this can be fixed using a fix_imports hook. I agree, it would be nice to fix this issue by modifying 3.x versions only. |
If we are concerned about performance, we should definitely avoid the decode-encode roundtrip. |
Here is a patch against 2.7 that makes datetime pickling portable. It doesn't solve problem with existing pickled data, but at least it allows to convert existing pickled data with 2.7 to portable format. |
Here is alternative patch for 2.7. It makes datetime pickling produce the same data as in 3.x. The side effect of this approach: it makes datetime pickling incompatible with Unicode disabled builds of 2.x. |
IMNSHO, the problem lies in the Python 3 pickle.py and it is **not** restricted to datetime instances In Python 2, 8-bit strings are used for text and for binary data. Well designed applications will use unicode for all text, but Python 2 itself forces some text to be 8-bit string, e.g., names of attributes, classes, and functions. In other words, **any 8-bit strings explicitly created by such an application will contain binary data.** In Python 2, pickle.dump uses BINSTRING (and SHORT_BINSTRING) for 8-bit strings; Python 3 uses BINBYTES (and SHORT_BINBYTES) instead. In Python 3, pickle.load should handle BINSTRING (and SHORT_BINSTRING) like this:
It is only because of the use of 8-bit strings for Python 2 names that the mapping to I would propose to change def _decode_binstring(self, value):
# Used to allow strings from Python 2 to be decoded either as
# bytes or Unicode strings. This should be used only with the
# BINSTRING and SHORT_BINSTRING opcodes.
if self.encoding != "bytes":
try :
return value.decode("ASCII")
except UnicodeDecodeError:
pass
return value instead of the currently called |
Christian, I don't think your solution will work for date/time/datetime pickles. There are many values for which pickle payload consists of bytes within 0-127 range. IIUC, you propose to decode those to Python 3 strings using ASCII encoding. This will in turn require accepting str type in date/time/datetime constructors. |
Alexander Belopolsky wrote at Thu, 15 Oct 2015 17:56:42 +0000:
Hmmmm.
Yes. There are too many BINSTRING instances that need to be Python 3
These datetime... constructors are strange beasts already. The documentation says that three integer arguments are required for So yes, the only practical solution is to accept a single str typed The change I proposed in http://bugs.python.org/issue22005#msg253042 To summarize: IMHO the solution needs to be implemented in Python 3 — otherwise |
This issue is getting old. Is there any way to solve this for Python 3.6? |
TL;DR - Just one more example of why nobody should *ever* use pickle under any circumstances. It is useless for data that is not transient for consumption by the same exact versions of all software that created it. Patches against 2.7 are not useful here. Either we write a unpickle deserializer for python 2 datetime pickles that works for all existing previous datatime pickled data formats from Python 3. Or we close this as rejected because the data formats are rightly incompatible as the in-process object states are incompatible between the two versions. If you want to serialize something, use a language agnostic data format - ideally one with a defined schema. Never pickle. Advice for those who have stored such data in Python 2 pickles: Write a Python 2 program to read your data and rewrite it in a portable data format that has nothing to do with pickle. Anything else is a gross hack. |
NumPy starves from the same issue. In NumPy this problem was solved by requiring encoding='latin1' passed to unpickler. It makes sense to use the same approach for datetime classes. |
I'm not sure I agree with how this was resolved. We're adding complexity to the datetime unpickler to support unpickling pickles created in Python 2 in Python 3? I also don't really understand the encoding parts of it, but it smells very fishy to me. I agree with Gregory here, pickle has so many other problems when used between versions of Python that it's simply not useful for cross-version serialization. It is useful for things like inter-process communication between two trusted instances of Python programs running the same version. Also, what is the plan here for 2020+? Do we remove this hack for Python 3.9, or are we stuck with it indefinitely? |
This is the same hack as in NumPy, so we are at least consistent here. I think we have to keep it some time after 2020, maybe to 2025. |
@serhiy Any chance we can roll these back before the release so that they can have some time for discussion? I have serious concerns about having to support some Python 2/3 compatibility hack in datetime for the next 6 years. If this is worth doing at all, I think it can safely wait until the next release. |
This issue is already open for a long time. There is a problem which can not be worked around from the user side. I want to get it solved in 3.6, and today is the last chance for this. This is important for migrating from Python 2 to Python 3. You can open a discussion on Python-Dev, and if there will be significant opposition, this change can be reverted before releasing the final version of 3.6.8. |
I do not care enough about this to fight about it. The issue has been open long enough that I do not think it justified the urgency of rushing through a patch just before the release and merging without review, but now that it is in the release of multiple versions, I think we may be stuck with it. This *is* something that users can work around by not abusing pickle in this way and instead using a proper cross-platform serialization format. I realize that that makes it *more difficult* for some people to do so, but as Gregory points out, these people are doing dangerous stuff that will break in a way that we are not going to be willing or able to fix at some point *anyway*. |
Paul Ganssle wrote at Fri, 07 Dec 2018 17:22:36 +0000:
This is completely and utterly wrong, to put it mildly. The official documentation of the pickle module states (I checked 2.7
Considering that this issue is 4.5 years old, one would assume that the But my or your personal views about the usability of pickle don't I personally know about this kind of usage in applications since 1998. Have a nice day! |
It is fundamentally impossible for pickled data to magically cross the 2 and 3 language boundary unscathed. The basic str/bytes/unicode types in the language changed meaning. Code must be written manually by the data owners to fix that up based on what the types and encodings should actually be in various places given the language version the data is being read into. The code in the PRs for this bug appears to do that in the requisite required hacky manner for stored datetime instances. This fact isn't new. It happened 10 years ago with the release of Python 3.0. The documentation is not a contract. I'm fixing it to mention this. |
Serhiy: should this one be marked fixed? |
With Gregory's addition I think this issue can be considered fixed. Thank you Gregory. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: