Skip to content

Speed up JSON string encoding with ensure_ascii=False for long string values #150878

@gaborbernat

Description

@gaborbernat

Feature or enhancement

Proposal

When json.dumps runs with ensure_ascii=False, it sizes each escaped string one character at a time in escape_size (Modules/_json.c), after which write_escaped_unicode copies the string verbatim when nothing needs escaping. In this mode a character needs escaping only when c == '"', c == '\\', or c < 0x20; non-ASCII is kept verbatim. For a long string with no such character, which is the common case for text values including Western-European (Latin-1) text, that per-character sizing scan is pure overhead before the verbatim copy.

The proposal is to detect the no-escape case on the one-byte representation eight bytes at a time, returning the verbatim size after about one eighth of the work. A length guard keeps short strings, such as the typical dict key, on the existing per-character loop. Two-byte and four-byte strings (anything with a character above U+00FF) keep the current loop.

This is the ensure_ascii=False counterpart to the encoder change in #150875 (PR #150876); together with the decode-side scan in #150871 (PR #150872) the three cover JSON string scanning end to end. They touch different code paths and are separate changes.

How this differs from the SIMD backend in #142915

It is not the SIMD parsing architecture declined in #142915. It uses no SIMD intrinsics, no runtime CPU detection, and no build configuration, only portable 64-bit integer arithmetic with the same 0x0101… / 0x8080… masks that Objects/unicodeobject.c already applies for ASCII scanning. It changes one function and adds no infrastructure, so it does not depend on #125022 and needs no PEP.

When it helps, and when it does not

Measured json.dumps(..., ensure_ascii=False) speedups against the current encoder:

Document shape Effect
One long text field (~16 KB string) 5.8x faster
Long Western-European (Latin-1) text values 4.2x faster
Many 200-character ASCII string values 3.9x faster
Realistic mixed records (short and medium strings) 1.4x faster
Short keys, strings that need escaping no change
Strings with characters above U+00FF no change (scalar path)

The benefit applies only to ensure_ascii=False, which is the non-default mode, so it reaches fewer callers than the default-path change in #150876; within that mode the win matches.

Correctness

The encoded output is byte-identical to the current encoder. A patch is validated against test_json and a 199-case differential corpus that places each escape-relevant character at every offset across the eight-byte window, in both ensure_ascii modes. Every output matched.

A draft PR follows.

Benchmark

Built base and patched interpreters from this branch's main ancestor and the patch, ran the same script under each, and compared with pyperf compare_to (A/B by swapping Lib/json/encoder.py on the same build; macOS arm64, non-PGO).

import json, pyperf
d = lambda o: json.dumps(o, ensure_ascii=False)
objs = {
 "long_ascii":   [("x"*200) for _ in range(200)],
 "long_latin1":  [("café résumé naïve "*15) for _ in range(200)],   # 1-byte Latin-1, kept verbatim
 "text_blob":    {"body": "lorem ipsum dolor "*900},
 "short_keys":   {f"k{i}": i for i in range(2000)},
 "nonascii":     ["中文 текст 😀 "*30 for _ in range(200)],          # UCS-2/4 scalar
 "mixed_real":   [{"id":i,"name":f"user_{i}","bio":"hello world "*10} for i in range(300)],
}
r = pyperf.Runner()
for n,o in objs.items():
    r.bench_func(f"dumpsF/{n}", lambda o=o: d(o))

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-modulesC modules in the Modules dirperformancePerformance or resource usagetype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions