json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

nhatcher · 2018-04-10T09:21:14Z

BPO	33255
Nosy	@etrepum, @vstinner, @benjaminp, @mcepl, @ezio-melotti, @serhiy-storchaka, @native-api, @nhatcher
PRs	[2.7] bpo-33255: Treats 'utf-8' and aliases equally. #6523

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-01-05.20:50:13.459>
created_at = <Date 2018-04-10.09:21:14.075>
labels = ['type-bug', 'expert-unicode']
title = "json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'"
updated_at = <Date 2020-01-05.20:50:13.458>
user = 'https://github.com/nhatcher'

bugs.python.org fields:

activity = <Date 2020-01-05.20:50:13.458>
actor = 'cheryl.sabella'
assignee = 'none'
closed = True
closed_date = <Date 2020-01-05.20:50:13.459>
closer = 'cheryl.sabella'
components = ['Unicode']
creation = <Date 2018-04-10.09:21:14.075>
creator = 'nhatcher'
dependencies = []
files = []
hgrepos = []
issue_num = 33255
keywords = ['patch']
message_count = 5.0
messages = ['315164', '315270', '315478', '315890', '315895']
nosy_count = 8.0
nosy_names = ['bob.ippolito', 'vstinner', 'benjamin.peterson', 'mcepl', 'ezio.melotti', 'serhiy.storchaka', 'Ivan.Pozdeev', 'nhatcher']
pr_nums = ['6523']
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue33255'
versions = ['Python 2.7']

nhatcher · 2018-04-10T09:21:14Z

Hey I'm new here, so please let me know what incorrect things I am doing!

I think json.dumps(o, ensure_ascii=False) is doing the wrong thing when o has both unicode and str keys/values. For instance:

import json
o = {u"greeting": "hi", "currency": "€"}
json.dumps(o, ensure_ascii=False, encoding="utf8")
json.dumps(o, ensure_ascii=False)

The first dumps will work while the second will fail. the reason is:

https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L198

This will decode any str if the encoding is not 'utf-8'. In the mixed case (unicode and str) this will blow. I workaround is to use any of the aliases for 'utf-8' like 'utf8' or 'u8'.

I would be crazy happy to provide a PR if this is really an issue.
Let me know if extra clarification is needed.
Nicolás

native-api · 2018-04-13T22:20:39Z

Treating 'utf-8' and its aliases differently (when they specifically mean the Python's, rather than something else's, encoding) is definitely as issue.

You shouldn't hardcode a list of aliases though; rather use existing facilities to resolve them. From quick googling, e.g. codecs.lookup(<encoding>).name can get the canonical name.

Make sure to follow https://devguide.python.org/pullrequest when doing the PR; a test case will likely be needed, too.

serhiy-storchaka · 2018-04-19T05:47:35Z

In simplejson:

>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False, encoding="utf8")
u'{"currency": "\u20ac", "greeting": "hi"}'
>>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False)
u'{"currency": "\u20ac", "greeting": "hi"}'

I think it makes sense to fix the case for "utf-8".

nhatcher · 2018-04-29T12:08:30Z

Hi Sehriy,

I am ok with that change. I think it makes much more sense, but I also think it will break people's codes. At least with the simplest fix in which:

>>> json.dumps({"g"}, ensure_ascii=False)
u'"g"'

Which is again compatible with simplejson.
Although the documentation is not clear in this point there might be code out there relaying on this behaviour.
Is that acceptable?

serhiy-storchaka · 2018-04-29T13:36:40Z

You could decode only non-ascii strings.

But I'm not sure that it is worth to change something in 2.7. This could be treated aa a new feature. Left this on to Benjamin, the release manager of 2.7.

nhatcher mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 10, 2018

csabella closed this as completed Jan 5, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

nhatcher mannequin commented Apr 10, 2018

nhatcher mannequin commented Apr 10, 2018

native-api mannequin commented Apr 13, 2018

serhiy-storchaka commented Apr 19, 2018

nhatcher mannequin commented Apr 29, 2018

serhiy-storchaka commented Apr 29, 2018

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

json.dumps has different behaviour if encoding='utf-8' or encoding='utf8' #77436

Comments

nhatcher mannequin commented Apr 10, 2018

nhatcher mannequin commented Apr 10, 2018

native-api mannequin commented Apr 13, 2018

serhiy-storchaka commented Apr 19, 2018

nhatcher mannequin commented Apr 29, 2018

serhiy-storchaka commented Apr 29, 2018