json.dump much slower than dumps #56343

poq · 2011-05-21T13:38:06Z

BPO	12134
Nosy	@rhettinger, @terryjreedy, @pitrou, @ezio-melotti, @merwok

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2011-06-26.14:41:50.827>
created_at = <Date 2011-05-21.13:38:06.321>
labels = ['easy', 'docs', 'library', 'performance']
title = 'json.dump much slower than dumps'
updated_at = <Date 2011-06-26.14:41:50.825>
user = 'https://bugs.python.org/poq'

bugs.python.org fields:

activity = <Date 2011-06-26.14:41:50.825>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2011-06-26.14:41:50.827>
closer = 'rhettinger'
components = ['Documentation', 'Library (Lib)']
creation = <Date 2011-05-21.13:38:06.321>
creator = 'poq'
dependencies = []
files = []
hgrepos = []
issue_num = 12134
keywords = ['easy']
message_count = 12.0
messages = ['136439', '136440', '136641', '136691', '137146', '137166', '137169', '137170', '137691', '137704', '137707', '139182']
nosy_count = 7.0
nosy_names = ['rhettinger', 'terry.reedy', 'pitrou', 'ezio.melotti', 'eric.araujo', 'docs@python', 'poq']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue12134'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

poq · 2011-05-21T13:38:06Z

import json, timeit
obj = [[1,2,3]*10]*10
class writable(object):
	def write(self, buf): pass
w = writable()
print('dumps: %.3f' % timeit.timeit(lambda: json.dumps(obj), number=10000))
print('dump:  %.3f' % timeit.timeit(lambda: json.dump(obj,w), number=10000))

On my machine this outputs:
dumps: 0.391
dump: 4.501

I believe this is mostly caused by dump using JSONEncoder.iterencode without _one_shot=True, resulting in c_make_encoder not being used.

pitrou · 2011-05-21T13:57:27Z

This is indeed the case. And the solution is obvious: call dumps() and then write() the result yourself. If dump() used _one_shot=True, it would defeat the purpose of minimizing memory consumption by not buffering the whole result.

merwok · 2011-05-23T14:15:08Z

I believe Antoine is saying “works as intended”, i.e. not a bug. I’m not sure it’s worth a doc change.

poq · 2011-05-23T19:51:07Z

Alright.

I wouldn't mind a little note in the docs; I certainly did not expect that these two functions would perform so differently.

Would it be very difficult though to add buffering support the C encoder?

terryjreedy · 2011-05-28T19:42:09Z

From just reading the docs, it appears that json.dump(obj,fp) == fp.write(json.dumps(obj)) and it is easy to wonder why .dump even exists, as it seems a trivial abbreviation (and why not .dump and .dumpf instead). Since, '_one_shot' and 'c_make_encoder' are not mentioned in the doc, there is no hint from these either. So I think a doc addition is needed.

The benchmark is not completely fair as the .dumps timing omits the write call. For the benchmark, that would be trivial. But in real use on multitasking systems with slow (compared to cpu speed) output channels, the write time might dominate. I can even imagine .dump sometimes winning by getting chunks into a socket buffer and possibly out on the wire, almost immediately, instead of waiting to compute the entire output, possibly interrupted by task swaps. So I presume *this* is at least part of the reason for the incremental .dump.

I changed 'pass' to 'print(bug)' in class writable and verified that .dump is *very* incremental. Even '[' and ']' are separate outputs.

DOC suggestion: (limited to CPython since spec does not prohibit naive implementation of .dump given above) After current .dumps line, add

"In CPython, json.dumps(o), by itself, is faster than json.dump(o,f), at the expense of using more space, because it creates the entire string at once, instead of incrementally writing each piece of o to f. However, f.write(json.dumps(o)) may not be faster."

ezio-melotti · 2011-05-29T02:07:34Z

The name dump and dumps exist to match the same API provided by pickle and marshal.

I agree that a note marked as CPython implementation detail should be added.

rhettinger · 2011-05-29T07:20:05Z

Please don't add notes of this sort to the docs.

ezio-melotti · 2011-05-29T07:29:51Z

Are you against the proposed wording or the note itself?

Stating that in CPython json.dump doesn't use the C accelerations is a reasonable thing to do imho.

pitrou · 2011-06-05T11:29:20Z

"In CPython, json.dumps(o), by itself, is faster than json.dump(o,f),
at the expense of using more space, because it creates the entire
string at once, instead of incrementally writing each piece of o to f.
However, f.write(json.dumps(o)) may not be faster."

Uh, talking about "CPython" is not very helpful here and only muddies
the waters IMO.
Something like "typical implementations of dump() will try to write the
result in small chunks and will therefore trade lower memory usage for
higher serialization time. If you have enough memory and care about
performance, consider using dumps() and write the result yourself with a
single write() call".

terryjreedy · 2011-06-05T17:30:34Z

With 'will try to ' and the next 'will ' omitted, I agree that Antoine's version is better than mine.

poq · 2011-06-05T18:38:40Z

dump() is not slower because it's incremental though. It's slower because it's pure Python. I don't think there is necessarily a memory/speed trade-off; it should be possible to write an incremental encoder in C as well.

rhettinger · 2011-06-26T14:41:51Z

As Antoine and Eric stated, the module is working as intended and we don't document implementation details and generally stay away from talking about performance in the docs.

json.dump uses the Python implementation of the json encoder instead of the C implementation. While this saves a bit of memory while dumping data, the performance impact is large: python/cpython#56343 Saving a user share with 3 million files to disk previously took 1 minute 13 seconds. It now takes 14 seconds. The C implementation of the json encoder doesn't support the 'indent' argument, so stop using it.

poq mannequin added extension-modules C modules in the Modules dir performance Performance or resource usage labels May 21, 2011

ezio-melotti added stdlib Python modules in the Lib dir and removed extension-modules C modules in the Modules dir labels May 21, 2011

pitrou added the docs Documentation in the Doc dir label May 21, 2011

pitrou assigned docspython May 21, 2011

ezio-melotti added the easy label May 29, 2011

rhettinger assigned rhettinger and unassigned docspython May 29, 2011

rhettinger closed this as completed Jun 26, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.dump much slower than dumps #56343

json.dump much slower than dumps #56343

poq mannequin commented May 21, 2011

poq mannequin commented May 21, 2011

pitrou commented May 21, 2011

merwok commented May 23, 2011

poq mannequin commented May 23, 2011

terryjreedy commented May 28, 2011

ezio-melotti commented May 29, 2011

rhettinger commented May 29, 2011

ezio-melotti commented May 29, 2011

pitrou commented Jun 5, 2011

terryjreedy commented Jun 5, 2011

poq mannequin commented Jun 5, 2011

rhettinger commented Jun 26, 2011

json.dump much slower than dumps #56343

json.dump much slower than dumps #56343

Comments

poq mannequin commented May 21, 2011

poq mannequin commented May 21, 2011

pitrou commented May 21, 2011

merwok commented May 23, 2011

poq mannequin commented May 23, 2011

terryjreedy commented May 28, 2011

ezio-melotti commented May 29, 2011

rhettinger commented May 29, 2011

ezio-melotti commented May 29, 2011

pitrou commented Jun 5, 2011

terryjreedy commented Jun 5, 2011

poq mannequin commented Jun 5, 2011

rhettinger commented Jun 26, 2011