Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

Closed
nhatcher mannequin opened this issue Apr 10, 2018 · 5 comments
Closed

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

nhatcher mannequin opened this issue Apr 10, 2018 · 5 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@nhatcher
Copy link
Mannequin

nhatcher mannequin commented Apr 10, 2018

BPO 33255
Nosy @etrepum, @vstinner, @benjaminp, @mcepl, @ezio-melotti, @serhiy-storchaka, @native-api, @nhatcher
PRs
  • [2.7] bpo-33255: Treats 'utf-8' and aliases equally. #6523
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-01-05.20:50:13.459>
    created_at = <Date 2018-04-10.09:21:14.075>
    labels = ['type-bug', 'expert-unicode']
    title = "json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'"
    updated_at = <Date 2020-01-05.20:50:13.458>
    user = 'https://github.com/nhatcher'

    bugs.python.org fields:

    activity = <Date 2020-01-05.20:50:13.458>
    actor = 'cheryl.sabella'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-01-05.20:50:13.459>
    closer = 'cheryl.sabella'
    components = ['Unicode']
    creation = <Date 2018-04-10.09:21:14.075>
    creator = 'nhatcher'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33255
    keywords = ['patch']
    message_count = 5.0
    messages = ['315164', '315270', '315478', '315890', '315895']
    nosy_count = 8.0
    nosy_names = ['bob.ippolito', 'vstinner', 'benjamin.peterson', 'mcepl', 'ezio.melotti', 'serhiy.storchaka', 'Ivan.Pozdeev', 'nhatcher']
    pr_nums = ['6523']
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue33255'
    versions = ['Python 2.7']

    @nhatcher
    Copy link
    Mannequin Author

    nhatcher mannequin commented Apr 10, 2018

    Hey I'm new here, so please let me know what incorrect things I am doing!

    I think json.dumps(o, ensure_ascii=False) is doing the wrong thing when o has both unicode and str keys/values. For instance:

    import json
    o = {u"greeting": "hi", "currency": "€"}
    json.dumps(o, ensure_ascii=False, encoding="utf8")
    json.dumps(o, ensure_ascii=False)
    

    The first dumps will work while the second will fail. the reason is:

    https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L198

    This will decode any str if the encoding is not 'utf-8'. In the mixed case (unicode and str) this will blow. I workaround is to use any of the aliases for 'utf-8' like 'utf8' or 'u8'.

    I would be crazy happy to provide a PR if this is really an issue.
    Let me know if extra clarification is needed.
    Nicolás

    @nhatcher nhatcher mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 10, 2018
    @native-api
    Copy link
    Mannequin

    native-api mannequin commented Apr 13, 2018

    Treating 'utf-8' and its aliases differently (when they specifically mean the Python's, rather than something else's, encoding) is definitely as issue.

    You shouldn't hardcode a list of aliases though; rather use existing facilities to resolve them. From quick googling, e.g. codecs.lookup(<encoding>).name can get the canonical name.

    Make sure to follow https://devguide.python.org/pullrequest when doing the PR; a test case will likely be needed, too.

    @serhiy-storchaka
    Copy link
    Member

    In simplejson:

    >>> simplejson.dumps({u"greeting": "hi", "currency": ""}, ensure_ascii=False, encoding="utf8")
    u'{"currency": "\u20ac", "greeting": "hi"}'
    >>> simplejson.dumps({u"greeting": "hi", "currency": ""}, ensure_ascii=False)
    u'{"currency": "\u20ac", "greeting": "hi"}'

    I think it makes sense to fix the case for "utf-8".

    @nhatcher
    Copy link
    Mannequin Author

    nhatcher mannequin commented Apr 29, 2018

    Hi Sehriy,

    I am ok with that change. I think it makes much more sense, but I also think it will break people's codes. At least with the simplest fix in which:

    >>> json.dumps({"g"}, ensure_ascii=False)
    u'"g"'

    Which is again compatible with simplejson.
    Although the documentation is not clear in this point there might be code out there relaying on this behaviour.
    Is that acceptable?

    @serhiy-storchaka
    Copy link
    Member

    You could decode only non-ascii strings.

    But I'm not sure that it is worth to change something in 2.7. This could be treated aa a new feature. Left this on to Benjamin, the release manager of 2.7.

    @csabella csabella closed this as completed Jan 5, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants