Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding functions in the base64 module could accept unicode strings #57850

Closed
pitrou opened this issue Dec 20, 2011 · 20 comments
Closed

decoding functions in the base64 module could accept unicode strings #57850

pitrou opened this issue Dec 20, 2011 · 20 comments
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Dec 20, 2011

BPO 13641
Nosy @gvanrossum, @pitrou, @merwok, @bitdancer, @akheron, @berkerpeksag
Files
  • issue13641_v1.diff
  • issue13641_v2_with_tests.diff
  • issue13641_v3_with_tests.diff
  • issue13641-alternative-v1.patch
  • issue13641-alternative-v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-02-25.15:41:16.375>
    created_at = <Date 2011-12-20.13:04:47.269>
    labels = ['easy', 'type-feature', 'library']
    title = 'decoding functions in the base64 module could accept unicode strings'
    updated_at = <Date 2012-02-25.15:41:16.374>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2012-02-25.15:41:16.374>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-02-25.15:41:16.375>
    closer = 'r.david.murray'
    components = ['Library (Lib)']
    creation = <Date 2011-12-20.13:04:47.269>
    creator = 'pitrou'
    dependencies = []
    files = ['24119', '24141', '24252', '24533', '24565']
    hgrepos = []
    issue_num = 13641
    keywords = ['patch', 'easy']
    message_count = 20.0
    messages = ['149912', '150445', '150633', '151646', '153499', '153500', '153501', '153504', '153505', '153604', '153713', '153714', '153757', '153759', '153793', '153794', '153827', '153828', '153830', '153847']
    nosy_count = 10.0
    nosy_names = ['gvanrossum', 'pitrou', 'eric.araujo', 'r.david.murray', 'catalin.iacob', 'python-dev', 'petri.lehtinen', 'poq', 'Anthony.Kong', 'berker.peksag']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue13641'
    versions = ['Python 3.3']

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 20, 2011

    Similarly to bpo-13637 for the binascii module, the decoding functions in the base64 module could accept ASCII-only unicode strings.

    @pitrou pitrou added stdlib Python modules in the Lib dir easy type-feature A feature request or enhancement labels Dec 20, 2011
    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 1, 2012

    Thanks for the patch, Berker. It seems a bit too simple, though. You should add some tests in Lib/test/test_base64.py and run them (using "./python -m test -v test_base64"), this will allow you to see if your changes are correct.

    @berkerpeksag
    Copy link
    Member

    Hi Antoine,

    I added some tests for b64decode function.

    Also, I wrote some tests for b32decode and b16decode functions and failed. I think my patch is not working for b32decode and b16decode functions. I'll dig into code and try to find a way.

    Thanks!

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 19, 2012

    Thanks for the updated patch!
    Two comments:

    • I see no tests for map01 and altchars being passed as an str, is this supported by the patch or am I reading it wrong?
    • apparently b16decode is not tackled, is it deliberate?

    Thanks again.

    @cataliniacob
    Copy link
    Mannequin

    cataliniacob mannequin commented Feb 16, 2012

    Attached alternative patch with a different approach: on input, strings are encoded as bytes and the rest of the code proceeds as before.

    All existing tests for bytes now test for strings as well and there is a new test for strings with non ASCII characters.

    Berker's patch was more intrusive and forgot to allow strings in _translate, leading to failures if altchars or map01 were used.

    @bitdancer
    Copy link
    Member

    Um. I'm inclined to think that bpo-13637 was a mistake.

    Functions that accept bytes and return bytes and also accept string and return string seem uncontroversial. However, accepting bytes or string and returning bytes is not an obviously good idea, and IMO at least merits some discussion. In fact, I thought it *had* been discussed, specifically in the context of the b2a/a2b functions, and been rejected.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 16, 2012

    However, accepting bytes or string and returning bytes is not an
    obviously good idea, and IMO at least merits some discussion.

    Why? "a" in "a2b" means ASCII, and unicode is as valid a container for ASCII text as bytes is.

    @poq
    Copy link
    Mannequin

    poq mannequin commented Feb 16, 2012

    FWIW, I was surprised by the return type of b64encode when I first used it in Python 3. It seems to me that b64encode turns binary data into text and thus intuitively should take bytes and return str.

    Similarly it seems intuitive to me for b64decode to take str as input and return bytes, as it turns text back into binary data.

    @bitdancer
    Copy link
    Member

    OK, I skimmed the thread I was remembering, and while it was discussing str->str and bytes->bytes primarily, the only pronouncement I could find was that functions should not accept a *mix* of bytes and string. So I guess I withdraw my objection, although it still makes me a bit uncomfortable.

    @cataliniacob
    Copy link
    Mannequin

    cataliniacob mannequin commented Feb 17, 2012

    My current patch allows mixing of bytes and str for the data to be decoded and the altchars or map01 parameter. Given David's observation in msg153505 I'll update the patch to require that both the data and altchars/map01 have the same type.

    @cataliniacob
    Copy link
    Mannequin

    cataliniacob mannequin commented Feb 19, 2012

    Attached v2 of patch where mixing str and binary data for altchars or map01 raises TypeError.

    I also added a note for each of the changed functions that it also accepts strings (but didn't also update the docstrings).

    When writing the docs, the new functionality seemed hard to describe; maybe that means this issue only complicates things and is not worth it, or maybe it just means I don't have experience at writing docs.

    But, regardless of having worked at a patch, I have to admit that I'm also not 100% sure this issue is a good idea. I *do* think that either both this issue and bpo-13637 should be accepted or both rejected.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 19, 2012

    I think trying to prevent mixed argument types is completely overkill. There's no ambiguity since they all have to be ASCII anyway.
    So I would prefer to commit bpo-13641-alternative-v1.patch

    @bitdancer
    Copy link
    Member

    OK' I'm back to being 100% on the side of rejecting both of these changes. ASCII is not unocode, it is bytes. You can decode it to unicode but it is not unicode. Those transformations operate bytes to bytes, not bytes to unicode.

    We made the bytes unicode separation to avoid the problem where you have a working program that unexpectedly gets non ASCII input and blows up with a unicode error. IMO these patches are reintroducing that problem. The programer should have to explicitly encode to ASCII if they are inadvisedly workimg with it in a string as part of a wire protocol (why else would they be using these transforms).

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 20, 2012

    OK' I'm back to being 100% on the side of rejecting both of these
    changes. ASCII is not unocode, it is bytes. You can decode it to
    unicode but it is not unicode. Those transformations operate bytes to
    bytes, not bytes to unicode.

    ASCII is just a subset of the unicode character set.

    We made the bytes unicode separation to avoid the problem where you
    have a working program that unexpectedly gets non ASCII input and
    blows up with a unicode error.

    How is blowing up with a unicode error worse than blowing up with a
    ValueError? Both indicate wrong input. At worse the code could catch
    UnicodeError and re-raise it as ValueError, but I don't see the point.

    The programer should have to explicitly encode to ASCII if they are
    inadvisedly workimg with it in a string as part of a wire protocol
    (why else would they be using these transforms).

    Inadvisedly? There are many situations where you can have base64 data in
    some unicode strings.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 20, 2012

    New changeset c760bd844222 by Antoine Pitrou in branch 'default':
    Issue bpo-13641: Decoding functions in the base64 module now accept ASCII-only unicode strings.
    http://hg.python.org/cpython/rev/c760bd844222

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 20, 2012

    I've committed bpo-13641-alternative-v1.patch. I really think practicality beats purity here and, furthermore, there's no associated danger (non-ASCII data is rejected both as bytes and str).

    @pitrou pitrou closed this as completed Feb 20, 2012
    @bitdancer
    Copy link
    Member

    Non-ascii binary data should not be being rejected unless validate
    is true. So what are you going to do with non-ascii-range unicode in
    that case? Ignore it as well? That can't be right.

    I believe this should be discussed on python-dev.

    @bitdancer
    Copy link
    Member

    I disagree with this commit. Reopening pending discussion on python-dev.

    @bitdancer bitdancer reopened this Feb 21, 2012
    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 21, 2012

    Non-ascii binary data should not be being rejected unless validate
    is true. So what are you going to do with non-ascii-range unicode in
    that case? Ignore it as well? That can't be right.

    It's not ignored, it raises ValueError. Since the common case it to feed
    valid (not invalid) baseXX data to these functions, that's a very benign
    limitation.

    @gvanrossum
    Copy link
    Member

    Aside: I, too, at first thought this would be a bad idea because it brings back the Python 2 issue of accepting some but not all Unicode strings. But then I realized that by their nature these functions only accepts a very specific set of characters -- so the restriction to (a subset of) ASCII is intrinsic to the functionality, and there is no possibility of confusion. If anything, accepting bytes is more likely to be confusing (they could be EBCDIC! :-). So no objection here. And a slight preference for ValueError.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants