-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
codecs missing: base64 bz2 hex zlib hex_codec ... #51724
Comments
AFAIK these codecs were not ported to Python 3.
|
These are not encodings, in that they don't convert characters to bytes. |
Martin v. Löwis wrote:
Martin, I beg your pardon, but these codecs indeed implement valid They should be readded to Python 3.x. Note that just because a codec doesn't convert between bytes |
Reopening the ticket. |
It's not possible to add these codecs back. Bytes objects (correctly) |
I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal |
«Everything you thought you knew about binary data and Unicode Reopening for the documentation part. This "mistake" deserves some words in the documentation: And the conversion may be automated with 2to3, maybe. |
Is it possible to add "DeprecationWarning" for these codecs >>> {}.has_key('a')
__main__:1: DeprecationWarning: dict.has_key() not supported in 3.x;
use the in operator
False
>>> print `123`
<stdin>:1: SyntaxWarning: backquote not supported in 3.x; use repr()
123
>>> 'abc'.encode('base64')
'YWJj\n' |
Martin v. Löwis wrote:
Of course it does support these kinds of codecs. The codec All we agreed to is that unicode.encode() will only return bytes, However, you can still use codecs.encode() and codecs.decode() You can't argue that just because two methods don't support Also note that codecs allow a much more far-reaching use So your argument that just because the two methods don't |
Benjamin Peterson wrote:
Sorry, Bejamin, but that's simply not true. Codecs can work with arbitrary types, it's just that the helper codecs.encode()/.decode() provide access to all codecs, regardless |
Thinking about it, I am +1 to reimplement the codecs. We could implement new methods to replace the old one. >>> b'abc'.encodebytes('base64')
b'YWJj\n'
>>> b'abc'.encodebytes('zlib').encodebytes('base64')
b'eJxLTEoGAAJNASc=\n'
>>> b'UHl0aG9u'.decodebytes('base64').decode('utf-8')
'Python' |
2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:
Didn't you have a proposal for bytes.transform/untransform for |
Benjamin Peterson wrote:
Yes. At the time it was postponed, since I brought it up late Note that those methods are just convenient helpers to access The full machinery itself is accessible via the codecs module and |
I've ported the codecs from Py2: It's not a big deal. Basically:
Will add documentation if we agree on the feature. |
I presume that the OP didn't talk about codecs.encode, but about |
No, transform/untransform as methods are a bad idea, but these *codecs* The minimal change needed for that to be feasible is to give errors raised MAL also stated on python-dev that codecs.encode and codecs.decode already |
okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes<->unicode. |
The codecs module is generic, text encodings are just the most common use |
I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do import codecs
result = codecs.getencoder("base64").encode(data) I don't think people would actually prefer this over import base64
result = base64.encodebytes(data) I't (IMO) only the convenience method (.encode) that made people love these codecs. |
IMHO it's also a documentation problem. Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead. By reading the codecs docs0 it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead. |
It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:
So, all the way from 2.4 to 2.7 you can write: from codecs import encode
result = encode(data, "base64") It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases: >>> encode(b"example", "base64_codec")
b'ZXhhbXBsZQ==\n'
>>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
b'example' Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky): >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
return (base64.decodebytes(input), len(input))
File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not memoryview I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases. |
Just copying some details here about codecs.encode() and """ import codecs
r13 = codecs.encode('hello world', 'rot-13') These interface directly to the codec interfaces, without As Nick found, these aren't documented, which is a documentation http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598 These API are nice for general purpose codec work and For the codecs in question, it would still be nice to have |
FTR this is because of ff1261a14573 (see bpo-10807). |
For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously). By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in bpo-10807, the encoding and decoding error messages should first be improved as discussed in bpo-17828. |
Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP-3118 compliant objects as input. |
I also created bpo-17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released. |
With bpo-17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible: >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoder did not return a str object (type=bytes) |
I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed. In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important. |
For anyone interested, I have a patch up on bpo-17828 that produces the following output for various codec usage errors: >>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
>>> "hello".encode("rot_13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types |
Providing the 2to3 fixers in bpo-17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit). |
bpo-17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3. bpo-19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7. |
Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document. I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices. Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher. |
Victor is still -1, so to Python 3.5 it goes. |
The 3.4 portion of bpo-19619 has been addressed, so removing it as a dependency again. |
With bpo-19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897 I'll be committing this shortly, after adjusting the patch to account for the bpo-19619 changes to the tests and What's New. |
New changeset 5e960d2c2156 by Nick Coghlan in branch 'default': |
Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc. I'll probably aim for beta 2 for that. |
Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent. |
New changeset d7950e916f20 by R David Murray in branch '3.3': New changeset 83d54ab5c696 by R David Murray in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: