Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle stream for unicode object may contain non-ASCII characters. #47229

Closed
mawbid mannequin opened this issue May 27, 2008 · 26 comments
Closed

Pickle stream for unicode object may contain non-ASCII characters. #47229

mawbid mannequin opened this issue May 27, 2008 · 26 comments
Assignees
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@mawbid
Copy link
Mannequin

mawbid mannequin commented May 27, 2008

BPO 2980
Nosy @malemburg, @loewis, @birkenfeld, @terryjreedy, @pitrou, @avassalotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/birkenfeld'
closed_at = <Date 2010-07-11.01:00:12.554>
created_at = <Date 2008-05-27.15:38:13.604>
labels = ['type-feature', 'library', 'docs']
title = 'Pickle stream for unicode object may contain non-ASCII characters.'
updated_at = <Date 2012-12-12.04:55:11.881>
user = 'https://bugs.python.org/mawbid'

bugs.python.org fields:

activity = <Date 2012-12-12.04:55:11.881>
actor = 'joelpitt'
assignee = 'georg.brandl'
closed = True
closed_date = <Date 2010-07-11.01:00:12.554>
closer = 'terry.reedy'
components = ['Documentation', 'Library (Lib)']
creation = <Date 2008-05-27.15:38:13.604>
creator = 'mawbid'
dependencies = []
files = []
hgrepos = []
issue_num = 2980
keywords = []
message_count = 26.0
messages = ['67410', '67421', '67422', '67425', '67432', '67434', '67436', '67437', '67631', '75021', '75022', '75055', '75058', '75070', '75161', '80330', '80331', '80334', '80337', '86294', '86329', '86331', '86334', '109671', '109688', '177364']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'terry.reedy', 'pitrou', 'bronger', 'alexandre.vassalotti', 'mawbid', 'dddibagh', 'wdoekes', 'joelpitt']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue2980'
versions = ['Python 3.2']

@mawbid
Copy link
Mannequin Author

mawbid mannequin commented May 27, 2008

I'm not sure if this is a functionality or documentation bug.

The docs say in section 13.1.2, Data stream format
(http://docs.python.org/lib/node315.html):
"By default, the pickle data format uses a printable ASCII representation."

I took that to mean that only ASCII characters ever appear in the pickle
output, but that's not true.

>>> print [ord(c) for c in pickle.dumps(u'á')]
[86, 225, 10, 112, 48, 10, 46]

@mawbid mawbid mannequin assigned birkenfeld May 27, 2008
@mawbid mawbid mannequin added docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels May 27, 2008
@malemburg
Copy link
Member

Only pickle protocol 0 is ASCII. The other two are binary protocols.

Protocol 2 is default in Python 2.5.

This should probably be made clear in the documentation, so I'd consider
this a documentation bug.

@malemburg
Copy link
Member

Actually, I was wrong: protocol 0 is the default if you don't specify
the protocol.

This set the binary flag to false, which should result in ASCII-only data.

The Unicode save routine uses the raw-unicode-escape codec, but this
only escapes non-Latin-1 characters and allows non-ASCII Latin-1
character to pass through.

Not sure what to do about this: we can't change the protocol anymore and
all higher protocol levels are binary already.

Perhaps we just need to remove the ASCII note from the documentation
altogether and only leave the "human readbable form" comment ?!

@loewis
Copy link
Mannequin

loewis mannequin commented May 27, 2008

I think the documentation is fine as it stands. The format is ASCII -
even though the payload might not be.

@malemburg
Copy link
Member

I can't follow you, Martin.

How can a data format be printable ASCII and at the same time use
non-ASCII characters ?

@loewis
Copy link
Mannequin

loewis mannequin commented May 27, 2008

How can a data format be printable ASCII and at the same time use
non-ASCII characters ?

The "format" is the frame defining the structure. In the binary
formatter, it's a binary format. In the standard pickle format,
it's ASCII (I for int, S for string, and so on, line-separated).

@malemburg
Copy link
Member

On 2008-05-28 00:21, Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

> How can a data format be printable ASCII and at the same time use
> non-ASCII characters ?

The "format" is the frame defining the structure. In the binary
formatter, it's a binary format. In the standard pickle format,
it's ASCII (I for int, S for string, and so on, line-separated).

I think there's a misunderstanding there. The pickle version 0
output used to be 7-bit only for both type code and content.

While adding the Unicode support I must have forgotten about the
fact that raw-unicode-escape does not escape range(128, 256) code
points. Unfortunately, there's no way to fix this now, since the
bug has been around since Python 1.6.

That's why I think we should update the docs.

@loewis
Copy link
Mannequin

loewis mannequin commented May 27, 2008

Unfortunately, there's no way to fix this now, since the
bug has been around since Python 1.6.

Actually, there is a way to fix that: pickle could start
emitting \u escapes for characters in the range 128..256.
Older pickle implementations would be able to read that
in just fine.

@malemburg
Copy link
Member

We could add an extra step to also escape range(128, 256) code points,
but I don't think it's worth the performance loss this would cause.

Note that this was the first time anyone has ever noticed the fact that
the pickle protocol 0 is not pure ASCII - in 8 years. I think it's
better to just adapt the documentation and remove the "ASCII". The
important feature of protocol 0 is being human readable (to some
extent), not that it's pure ASCII.

@dddibagh
Copy link
Mannequin

dddibagh mannequin commented Oct 21, 2008

Your reasoning shows a lack of understanding how Python is actually used
from a programmers point of view.

Why do you think that "noticing" a problem is the same thing as entering
as a python bug report? In practice there are several steps between
noticing a problem in a python program and entering it as a bug report
in the python development system. It is very difficult so see why any of
these steps would happen automatically. Believe me, people have had real
problems due to this bug. They have just selected other solutions than
reporting it.

You are yourself reluctant so seek out the roots of this problem and fix
it. Why should other people behave differently and report it? A not so
uncommon "fix" to pickle problems out there is to not using pickle at
all. There are Python programmers who gives the advice to avoid pickle
since "it's too shaky". It is a solution, but is it the solution you
desire?

The capability to serialize stuff into ASCII strings isn't just an
implementation detail that happens to be nice for human readability. It
is a feature people need for technical reasons. If the data is ASCII, it
can be dealt with in any ASCII-compatible context which might be network
protocols, file formats and database interfaces. There is the real use.
Programs depend on it to work properly.

The solution the change the documentation is in practice breaking
compatibility (which programming language designers normally tries to
avoid or do in a very controlled manner). How is a documentation fix
going to help all the code out there written with the assumption that
pickle protocol 0 is always ASCII? Is there a better solution around
than changing pickle to meet actual expectations?

Well, nobody has reported it as a bug in 8 years. How long do you think
that code will stay around based on the ASCII assumption? 8 years? 16
years? 24 years? Maybe all the time in the world for this to become an
issue again and again and again?

It is difficult to grasp why there is "no way to fix it now". From a
programmers point of view an obvious "fix" is to ditch pickle and use
something that delivers a consistent result rather than debugging hours.
When I try to see it from the Python library developers point of view I
see code implemented in C which produces a result with reasonable
performance. It is perfectly possible to write the code which implements
the expected result within reasonable performance. What is the problem?

Perhaps it is the raw-unicode-escape encoding that should be fixed? I
failed to find exact information about what raw-unicode-escape means. In
particular, where is the information which states that
raw-unicode-escape is always an 8-bit format? The closest I've come is
PEP-100 and PEP-263 (which I notice is written by you guys), which
describes how to decode raw unicode escape strings from Python source
and how to define encoding formats for python source code. The sole
original purpose of both unicode-escape and raw-unicode-escape appears
to be representing unicode strings in Python source code as u' and ur'
strings respectively. It is clear that the decoding of a raw unicode
escaped or unicode escaped string depends on the actual encoding of the
python source, but how goes the logic that when something is _encoded_
into a raw unicode string then the target source must be of some 8-bit
encoding. Especially considering that the default python source encoding
is ASCII. For unicode-escape this makes sense:

>> f = file("test.py", "wb")
>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>> f.close()
>> ^Z

python test.py (executes silently without errors)

But for raw-unicode-escape the outcome is a different thing:

>> f = file("test.py", "wb")
>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>> f.close()
>> ^Z

python test.py

File "test.py", line 1
SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Huh? For someone who trusts the Standard Encodings section Python
Library reference this isn't what one would expect. If the documentation
states "Produce a string that is suitable as raw Unicode literal in
Python source code" then why isn't it suitable?

@malemburg
Copy link
Member

On 2008-10-21 11:22, Dan Dibagh wrote:

Your reasoning shows a lack of understanding how Python is actually used
from a programmers point of view.

Hmm, I've been using Python for almost 15 years now and do believe
that I have an idea of how Python is being used.

Note that we cannot change the pickle format in retrospective, since
this would break pickle data exchange between different Python versions
relying on the same format (but using different pickle implementations).

What we could do is add a new pickle format which then escapes all
non-ASCII data. However, people have been more keen on getting
compact and fast loading pickles than pickles in ASCII which is why
all new versions of the pickle format are binary formats, so I don't
think it's worth the effort.

Note that the common way of dealing with binary data in ASCII streams
is using a base64 encoding and possibly also apply compression. The
pickle 0 format is really only useful for debugging purposes.

Perhaps it is the raw-unicode-escape encoding that should be fixed? I
failed to find exact information about what raw-unicode-escape means. In
particular, where is the information which states that
raw-unicode-escape is always an 8-bit format? The closest I've come is
PEP-100 and PEP-263 (which I notice is written by you guys), which
describes how to decode raw unicode escape strings from Python source
and how to define encoding formats for python source code. The sole
original purpose of both unicode-escape and raw-unicode-escape appears
to be representing unicode strings in Python source code as u' and ur'
strings respectively.

Right.

It is clear that the decoding of a raw unicode
escaped or unicode escaped string depends on the actual encoding of the
python source, but how goes the logic that when something is _encoded_
into a raw unicode string then the target source must be of some 8-bit
encoding. Especially considering that the default python source encoding
is ASCII. For unicode-escape this makes sense:

>>> f = file("test.py", "wb")
>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>> f.close()
>>> ^Z

python test.py (executes silently without errors)

But for raw-unicode-escape the outcome is a different thing:

>>> f = file("test.py", "wb")
>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>> f.close()
>>> ^Z

python test.py

File "test.py", line 1
SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Huh? For someone who trusts the Standard Encodings section Python
Library reference this isn't what one would expect. If the documentation
states "Produce a string that is suitable as raw Unicode literal in
Python source code" then why isn't it suitable?

Because the raw-unicode-escape codec won't escape the \x80 character,
hence the name. As a result, the generated source code is not ASCII,
which is why you see the exception.

But this is off-topic w/r to the issue in question.

@dddibagh
Copy link
Mannequin

dddibagh mannequin commented Oct 21, 2008

I am well aware why my example produces an error from a technical
standpoint. What I'm getting at is the decision to implement
PyUnicode_EncodeRawUnicodeEscape the way it is. Probably there is
nothing wrong with it, but how am I supposed to know? I read the PEP,
which serves as a specification of raw unicode escape (at least for the
decoding bit) and the reference documentation. Then I read the source
trying to map between specified behavior in the documentation and the
implementation in the source code. When it comes to the part which
causes the problem with non-ASCII characters, it is difficult to follow.

Or in other words: what is the high level reason why the codec won't
escape \x80 in my test program?

To use a real-world term; an interface specification, in this case the
pickle documentation, is the contract between the consumer of the
library and the provider of the library. If it states "ASCII", ASCII is
expected. If it doesn't state "for debugging only" it will be used for
non-debugging purposes. There isn't much you can do about it without
breaking the contract.

What makes you think that the problem cannot be fixed without changing
the existing pickle format 0?

Note that base64 is "a common" way to deal with binary data in ascii
streams rather than "the common". (But why should I care when my data is
already ascii?)

@loewis
Copy link
Mannequin

loewis mannequin commented Oct 21, 2008

I read the PEP,
which serves as a specification of raw unicode escape (at least for the
decoding bit) and the reference documentation.

Which PEP specifically? PEP-263 only mentions the unicode-escape
encoding in its problem statement, i.e. as a pre-existing thing.
It doesn't specify it, nor does it give a rationale for why it behaves
the way it does.

Then I read the source
trying to map between specified behavior in the documentation and the
implementation in the source code. When it comes to the part which
causes the problem with non-ASCII characters, it is difficult to follow.

What code are you looking at, and where do you find it difficult to
follow it? Maybe you get confused between the "unicode-escape" codec,
and the "raw-unicode-escape" codec, also.

Or in other words: what is the high level reason why the codec won't
escape \x80 in my test program?

The raw-unicode-escape codec? It was designed to support parsing of
Python 2.0 source code, and of "raw" unicode strings (ur"") in
particular. In Python 2.0, you only needed to escape characters above
U+0100; Latin-1 characters didn't need escaping. Python, itself, only
relied on the decoding directory. That the codec choses not to escape
Latin-1 characters on encoding is an arbitrary choice (I guess); it's
still symmetric with decoding.

Even though the choice was arbitrary, you shouldn't change it now,
because people may rely on how this codec works.

What makes you think that the problem cannot be fixed without changing
the existing pickle format 0?

Applications might rely on what was implemented rather than what was
specified. If they had implemented their own pickle readers, such
readers might break if the pickle format is changed. In principle, even
the old pickle readers of Python 2.0..2.6 might break if the format
changes in 2.7 - we would have to go back and check that they don't
break (although I do believe that they would work fine).

So I personally don't see a problem with fixing this, but it appears
MAL does (for whatever reasons - I can't quite buy the performance
argument). OTOH, I don't feel that this issue deserves as much of
my time to actually implement anythings.

So contributions are welcome. If you find that the patch meets
resistance, you also need to write a PEP, and ask for BDFL
pronouncement.

@malemburg
Copy link
Member

On 2008-10-22 01:34, Martin v. Löwis wrote:

> What makes you think that the problem cannot be fixed without changing
> the existing pickle format 0?

Applications might rely on what was implemented rather than what was
specified. If they had implemented their own pickle readers, such
readers might break if the pickle format is changed. In principle, even
the old pickle readers of Python 2.0..2.6 might break if the format
changes in 2.7 - we would have to go back and check that they don't
break (although I do believe that they would work fine).

So I personally don't see a problem with fixing this, but it appears
MAL does (for whatever reasons - I can't quite buy the performance
argument). OTOH, I don't feel that this issue deserves as much of
my time to actually implement anythings.

I've had a look at the implementations used in both pickle.py
and cPickle.c: both apply some extra escaping to the encoded
version of raw-unicode-escape in order to handle newlines
correctly, so I guess adding a few more escapes won't hurt.

So +0 on adding the extra escapes for range(128,256) code
points.

Still, IMHO, all of this is not worth the effort, since protocol
versions 1 and 2 are more efficient and there are better ways to
deal with the problem of sending binary data in some ASCII format,
e.g. using base64.

@dddibagh
Copy link
Mannequin

dddibagh mannequin commented Oct 24, 2008

Which PEP specifically? PEP-263 only mentions the unicode-escape
encoding in its problem statement, i.e. as a pre-existing thing.
It doesn't specify it, nor does it give a rationale for why it behaves
the way it does.

PEP-100 and PEP-263. What I looked for was a description of the
functional intention and a technical definition of raw unicode escape.
The term "raw" tends to have different meanings depending on the context
in which it appears. PEP-263 is of interest in the overall understanding
of the intention of raw unicode escape. If raw unicode escape is to
convert from python source into unicode strings then the decoding of raw
unicode escape strings depends on the source code encoding. Then perhaps
it would give an idea what the encoding part is supposed to do... PEP-100 is of interest for the technical description. It describes the
section "unicode constructors" as the definition.

What code are you looking at, and where do you find it difficult to
follow it? Maybe you get confused between the "unicode-escape" codec,
and the "raw-unicode-escape" codec, also.

Since it is the issue with non-ASCII characters in pickle output I look
at, it is raw-unicode-escape being in focus. For the decoding bit the
distinction between unicode-escape and raw-unicode-escape is very clear.

I look at the function PyUnicode_EncodeRawUnicodeEscape in
Objects/unicodeobject.c. At the point of the comment "/* Copy everything
else as-is */", given the perceived intentions of the encoding type, I
try to figure out why there isn't a "/* Map non-printable US ASCII to
'\xhh' */" section like in the unicodeescape_string function. The
background in older pythons you explained is essentially what I guessed.

The raw-unicode-escape codec? It was designed to support parsing of
Python 2.0 source code, and of "raw" unicode strings (ur"") in
particular. In Python 2.0, you only needed to escape characters above
U+0100; Latin-1 characters didn't need escaping. Python, itself, only
relied on the decoding directory. That the codec choses not to escape
Latin-1 characters on encoding is an arbitrary choice (I guess); it's
still symmetric with decoding.

I suppose you mean symmetric with decoding as long as you stick to the
latin-1 character set, as raw unicode escaping isn't a one-to-one mapping.

When PEP-263 came into the picture, wouldn't it have made sense to
change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or
perhaps output conforming to the current default encoding? Given the
intention of the raw unicode escape, encoding something with it means
producing python source code. But it is in latin-1 while the rest of
Python has moved on to use ASCII by default or whatever being configured
in the source. I tried to put shine on that problem in my previous example.

Even though the choice was arbitrary, you shouldn't change it now,
because people may rely on how this codec works.

Applications might rely on what was implemented rather than what was
specified. If they had implemented their own pickle readers, such
readers might break if the pickle format is changed. In principle,
even the old pickle readers of Python 2.0..2.6 might break if the
format changes in 2.7 - we would have to go back and check that they don't
break (although I do believe that they would work fine).

Then let me ask: How far reaching is the aim to maintain compatibility
with programs which depends on Python internals? Even if the internal
thing is a bug and the thing which depends on the bug is also a bug?
Maybe it is a provoking question, let me explain. The question(s)
applies to some extent to the workings of the codec but it is really the
pickle problem I think of. In the case of older Python releases, it is
just a matter of testing, just as you say. It is boring and perhaps
tedious but there is nothing special which prevents it from being done.
If there are many versions there ought to be a way to write a program
which does it automatically.

In the case of those who have implemented their own pickle readers, the
source and the comments in pickletools.py clearly states that unicode
strings are raw unicode escaped in format 0. Now raw unicode escape
isn't a canonical format. The letter A can be represented either as
\u0041 or as itself as A. If a hypothetical implementor gets the idea
that characters in the range 0-255 cannot be represented by \u00xx
sequences then the fact that pickle replaces \ with \u005c and \n with
\u000a should give a hint that he is wrong. So if characters in the
range 128-255 gets escaped with \u00xx any pickle reader should handle
it. I've tried to come up with some sensible way to write a pickle
implemenation which fails to understand \u00xx characters without
calling it a bug. I cannot. Can you? So it seems that the worry for
changing protocol 0 is buggy programs depending on a pickle bug.

In the other end of the spectrum there are correct programs with depends
on Python externals, ie. programs depending in ASCII-conformant pickle
output (even if there are some base64 ...ehm... fundamentalists who
think it is the wrong way to do it -- I can think of at least one good
reason to do it).

So contributions are welcome. If you find that the patch meets
resistance, you also need to write a PEP, and ask for BDFL
pronouncement.

I consider doing a patch. I also understand that in order for the patch
to get acceptance it must fit into the Python framework. That's why I
ask all these questions.

@bronger
Copy link
Mannequin

bronger mannequin commented Jan 21, 2009

I ran into this problem today when writing python data structures into a
database. Only ASCII is safe in this situation. I understood the
Python docs that protocol 0 was ASCII-only.

I use pickle+base64 now, however, this makes debugging more difficult.

Anyway, I think that the docs should clearly say that protocol 8 is not
ASCII-only because this is important in the Python world. For example,
I saw this issue because Django makes an implicit unicode() conversion
with my input which fails with non-ASCII.

@bronger
Copy link
Mannequin

bronger mannequin commented Jan 21, 2009

"protocol 8" --> "protocol 0" of course.

@malemburg
Copy link
Member

On 2009-01-21 16:43, Torsten Bronger wrote:

Torsten Bronger <bronger@physik.rwth-aachen.de> added the comment:

I ran into this problem today when writing python data structures into a
database. Only ASCII is safe in this situation. I understood the
Python docs that protocol 0 was ASCII-only.

I use pickle+base64 now, however, this makes debugging more difficult.

Databases can handle binary data just fine, so pickle protocol 2
should be better in your situation.

If you require ASCII-only data, you can also use pickle protocol 2,
zlib and base64 to get a compact version of a serialized Python object.

Anyway, I think that the docs should clearly say that protocol 8 is not
ASCII-only because this is important in the Python world. For example,
I saw this issue because Django makes an implicit unicode() conversion
with my input which fails with non-ASCII.

That sounds like an issue with Django - it shouldn't try to convert
binary data to Unicode (which is reserved for text data).

@bronger
Copy link
Mannequin

bronger mannequin commented Jan 21, 2009

Well, Django doesn't story binary data at all but wants you to store
image files etc in the file system. Whether this was a good design
decision, is beyond the scope of this issue. My points actually are
only these:

a) the docs strongly suggest that protocol 0 is ASCII-only and this
should be clarified (one sentence would be fully sufficient I think)

b) currently, there is no way in the standard lib to serialise data in a
debuggable, ASCII-only format

Probably b) is not important. *I* want to have it currently but this
doesn't mean much.

@wdoekes
Copy link
Mannequin

wdoekes mannequin commented Apr 22, 2009

Same issue with Django here ;-)

I wouldn't mind a protocol 3 that does <128 ascii only. If only because
debugging base64'd zlib'd protocol-2 data is not particularly convenient.

@avassalotti
Copy link
Member

I wouldn't mind a protocol 3 that does <128 ascii only. If only because
debugging base64'd zlib'd protocol-2 data is not particularly convenient.

Is there any reason that prevent you to debug your pickle using pickle
disassembler tool—i.e., pickletools.dis()?

@bronger
Copy link
Mannequin

bronger mannequin commented Apr 22, 2009

The "problem" is the pickle result. It's not about debugging the
pickler itself.

@pitrou
Copy link
Member

pitrou commented Apr 22, 2009

If your data is simple enough, you can use JSON. It has an
ensure_ascii flag when dumping data.

@terryjreedy
Copy link
Member

This can no longer be a 2.5 issue but I am not sure how to update it.

OP apparently opened it as a feature request, so I did update it to 3.2.

But OP then says "I'm not sure if this is a functionality or documentation bug." and indeed subsequent messages debate this issue. This would mean it could apply to earlier versions, if re-typed.

On the other hand, there seems to be some opinion that there is no bug, or if there is/was, it cannot be fixed, which would mean this should be closed.

Also, the docs seem to have already been changed, so if that were the issue, this is fixed and should be closed:
"By default, the pickle data format uses a printable ASCII representation."
is now
"Protocol version 0 is the original human-readable protocol and is backwards compatible with earlier versions of Python. "

@malemburg
Copy link
Member

Terry J. Reedy wrote:

Terry J. Reedy <tjreedy@udel.edu> added the comment:

This can no longer be a 2.5 issue but I am not sure how to update it.

OP apparently opened it as a feature request, so I did update it to 3.2.

But OP then says "I'm not sure if this is a functionality or documentation bug." and indeed subsequent messages debate this issue. This would mean it could apply to earlier versions, if re-typed.

On the other hand, there seems to be some opinion that there is no bug, or if there is/was, it cannot be fixed, which would mean this should be closed.

Also, the docs seem to have already been changed, so if that were the issue, this is fixed and should be closed:
"By default, the pickle data format uses a printable ASCII representation."
is now
"Protocol version 0 is the original human-readable protocol and is backwards compatible with earlier versions of Python. "

I'd suggest to close the ticket.

The main idea behind version 0 was to have a readable format. The
occasional UTF-8 in the stream should be readable enough nowadays,
even if it's not ASCII.

@joelpitt
Copy link
Mannequin

joelpitt mannequin commented Dec 12, 2012

Just ran into this problem using Python 2.7.3 and the issue others mention in conjunction with Django.

Note the 2.7 docs still imply it's ASCII: http://docs.python.org/2/library/pickle.html#data-stream-format

It has a weak caveat "(and of some other characteristics of pickle‘s representation)", but if you only skim read the bullet points below you'll miss that.

Yes I will use base64 to get around this, but the point is the documentation is still unclear and should probably completely remove the reference to ASCII in favour of "human-readable"... or even better, explicitly mention what will happen with unicode.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants