gh-148653: refactored marshal for cycle safety and performance#148700
gh-148653: refactored marshal for cycle safety and performance#148700mjbommar wants to merge 6 commits intopython:mainfrom
Conversation
Replace the PyList-backed reference table with a raw growable
PyObject ** array, and encode REF_STATE_INCOMPLETE_HASHABLE in the low
bit of each ref pointer so the parallel state-byte allocation is gone.
Also:
- drop the allow_incomplete_hashable parameter from r_object; it lives
on RFILE now, auto-reset on entry, flipped via a wrapper at the two
list-element / dict-value sites.
- force-inline the r_ref_* helpers so the compiler can fold the
if (flag) guards into the callers as the original R_REF macro did.
Misc/marshal-perf-diary.md records the full experiment ledger: each
idea tested in isolation, results, and the combined stack. Benchmark
harness is /tmp/marshal_bench_cpu_stable.py (200k loads x 11 repeats,
taskset -c 0, best-of-3 pinned-run median).
Combined deltas vs main on loads:
small_tuple 14.3% faster
nested_dict 6.9% faster
code_obj 6.8% faster
dumps is roughly flat to slightly faster. test_marshal passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Appends to Misc/marshal-perf-diary.md the results of the full test
suite rerun (48,932 tests pass, including the new RecursiveGraphTest
combinatoric cases) and a `pyperformance` comparison against main on
the same 10-benchmark marshal-adjacent slice the design doc used.
Significant results on the pyperformance slice:
python_startup 1.18x faster (t=59.80)
python_startup_no_site 1.03x faster (t=12.90)
All other slice benchmarks within noise; no regressions.
Adds Misc/marshal-perf-data/ with the raw JSON backing every table in
the diary: all per-experiment microbench runs (exp0..exp9, expC, final)
plus the two pyperf-slice JSONs and a README describing the layout and
reproduction commands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@serhiy-storchaka @StanFromIreland - we're a long way from Kansas now, but I think the end result of this refactor is something much bigger: an 15-20% speedup in Python startup and a ~7-14% speedup in (edit: the startup speed looks like a fixed 1-3ms saving, so that % is for bare startup) I'm going to try running a few more complex tests like running some of my real apps or pipelines through this build to see if I can find regressions, but wanted to share. I have no experience with big CPython changes like this but hopefully I didn't do anything too crazy here from an ABI or compatibility perspective. Update: Confirmed that |
Records the outcome of an independent-library validation pass: - dill 0.4.1 test suite (30 files) — identical 29/30 pass on baseline and HEAD; the single failure is a pre-existing 3.15a8 incompatibility in dill's module-state serialization, unrelated to marshal. - cloudpickle 3.1.2 test suite (upstream) — 243/243 pass on both, identical skip/xfail breakdown. - 1,601 marshal-adjacent stdlib tests (test_importlib, test_zipimport, test_compileall, test_py_compile, test_marshal) all pass on HEAD. - compileall of CPython Lib/: +1.0% (within noise; dumps path untouched). - Cold-import stress (56 stdlib modules, fresh subprocess): flat. - Hypothesis fuzz (3500 random round-trips including cyclic shapes through mutable bridges): zero correctness regressions; acyclic round-trip -10%, list self-cycle -24%, dict value self-cycle -40%. Nothing in the third-party validation hints at a correctness or performance regression; several workloads that directly exercise the changed code path are measurably faster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This an experimental rewrite of
marshalbased on conversation with @serhiy-storchaka in #148652 and #148653.Includes a number of extra docs and data generators that are provided only for reference during discussion.
So far it's green for me on test suite but need to dig in further and assume @serhiy-storchaka will have better intuition than me on any behavior or performance regressions.
In my first attempt, we hit minor single digit performance regressions in the
loadspath, unsurprisingly concentrated in the complicated cases.Second attempt with improved performance coming in an hour or so.
(edit: now faster than HEAD with real performance and correctness gains)
Assisted by GPT-5.4 xhigh and Opus 4.7