New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pickle stream for unicode object may contain non-ASCII characters. #47229
Comments
I'm not sure if this is a functionality or documentation bug. The docs say in section 13.1.2, Data stream format I took that to mean that only ASCII characters ever appear in the pickle >>> print [ord(c) for c in pickle.dumps(u'á')]
[86, 225, 10, 112, 48, 10, 46] |
Only pickle protocol 0 is ASCII. The other two are binary protocols. Protocol 2 is default in Python 2.5. This should probably be made clear in the documentation, so I'd consider |
Actually, I was wrong: protocol 0 is the default if you don't specify This set the binary flag to false, which should result in ASCII-only data. The Unicode save routine uses the raw-unicode-escape codec, but this Not sure what to do about this: we can't change the protocol anymore and Perhaps we just need to remove the ASCII note from the documentation |
I think the documentation is fine as it stands. The format is ASCII - |
I can't follow you, Martin. How can a data format be printable ASCII and at the same time use |
The "format" is the frame defining the structure. In the binary |
On 2008-05-28 00:21, Martin v. Löwis wrote:
I think there's a misunderstanding there. The pickle version 0 While adding the Unicode support I must have forgotten about the That's why I think we should update the docs. |
Actually, there is a way to fix that: pickle could start |
We could add an extra step to also escape range(128, 256) code points, Note that this was the first time anyone has ever noticed the fact that |
Your reasoning shows a lack of understanding how Python is actually used Why do you think that "noticing" a problem is the same thing as entering You are yourself reluctant so seek out the roots of this problem and fix The capability to serialize stuff into ASCII strings isn't just an The solution the change the documentation is in practice breaking Well, nobody has reported it as a bug in 8 years. How long do you think It is difficult to grasp why there is "no way to fix it now". From a Perhaps it is the raw-unicode-escape encoding that should be fixed? I
python test.py (executes silently without errors) But for raw-unicode-escape the outcome is a different thing:
python test.py File "test.py", line 1 Huh? For someone who trusts the Standard Encodings section Python |
On 2008-10-21 11:22, Dan Dibagh wrote:
Hmm, I've been using Python for almost 15 years now and do believe Note that we cannot change the pickle format in retrospective, since What we could do is add a new pickle format which then escapes all Note that the common way of dealing with binary data in ASCII streams
Right.
Because the raw-unicode-escape codec won't escape the \x80 character, But this is off-topic w/r to the issue in question. |
I am well aware why my example produces an error from a technical Or in other words: what is the high level reason why the codec won't To use a real-world term; an interface specification, in this case the What makes you think that the problem cannot be fixed without changing Note that base64 is "a common" way to deal with binary data in ascii |
Which PEP specifically? PEP-263 only mentions the unicode-escape
What code are you looking at, and where do you find it difficult to
The raw-unicode-escape codec? It was designed to support parsing of Even though the choice was arbitrary, you shouldn't change it now,
Applications might rely on what was implemented rather than what was So I personally don't see a problem with fixing this, but it appears So contributions are welcome. If you find that the patch meets |
On 2008-10-22 01:34, Martin v. Löwis wrote:
I've had a look at the implementations used in both pickle.py So +0 on adding the extra escapes for range(128,256) code Still, IMHO, all of this is not worth the effort, since protocol |
PEP-100 and PEP-263. What I looked for was a description of the
Since it is the issue with non-ASCII characters in pickle output I look I look at the function PyUnicode_EncodeRawUnicodeEscape in
I suppose you mean symmetric with decoding as long as you stick to the When PEP-263 came into the picture, wouldn't it have made sense to
Then let me ask: How far reaching is the aim to maintain compatibility In the case of those who have implemented their own pickle readers, the In the other end of the spectrum there are correct programs with depends
I consider doing a patch. I also understand that in order for the patch |
I ran into this problem today when writing python data structures into a I use pickle+base64 now, however, this makes debugging more difficult. Anyway, I think that the docs should clearly say that protocol 8 is not |
"protocol 8" --> "protocol 0" of course. |
On 2009-01-21 16:43, Torsten Bronger wrote:
Databases can handle binary data just fine, so pickle protocol 2 If you require ASCII-only data, you can also use pickle protocol 2,
That sounds like an issue with Django - it shouldn't try to convert |
Well, Django doesn't story binary data at all but wants you to store a) the docs strongly suggest that protocol 0 is ASCII-only and this b) currently, there is no way in the standard lib to serialise data in a Probably b) is not important. *I* want to have it currently but this |
Same issue with Django here ;-) I wouldn't mind a protocol 3 that does <128 ascii only. If only because |
Is there any reason that prevent you to debug your pickle using pickle |
The "problem" is the pickle result. It's not about debugging the |
If your data is simple enough, you can use JSON. It has an |
This can no longer be a 2.5 issue but I am not sure how to update it. OP apparently opened it as a feature request, so I did update it to 3.2. But OP then says "I'm not sure if this is a functionality or documentation bug." and indeed subsequent messages debate this issue. This would mean it could apply to earlier versions, if re-typed. On the other hand, there seems to be some opinion that there is no bug, or if there is/was, it cannot be fixed, which would mean this should be closed. Also, the docs seem to have already been changed, so if that were the issue, this is fixed and should be closed: |
Terry J. Reedy wrote:
I'd suggest to close the ticket. The main idea behind version 0 was to have a readable format. The |
Just ran into this problem using Python 2.7.3 and the issue others mention in conjunction with Django. Note the 2.7 docs still imply it's ASCII: http://docs.python.org/2/library/pickle.html#data-stream-format It has a weak caveat "(and of some other characteristics of pickle‘s representation)", but if you only skim read the bullet points below you'll miss that. Yes I will use base64 to get around this, but the point is the documentation is still unclear and should probably completely remove the reference to ASCII in favour of "human-readable"... or even better, explicitly mention what will happen with unicode. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: