-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[subinterpreters] Make free lists and unicode caches per-interpreter #84701
Comments
tuple, dict and frame use free lists to optimize the creation of objects. Unicode uses "interned" strings to reduce the Python memory footprint and speedup dictionary lookups. Unicode also uses singletons for single letter Latin1 characters ([U+0000; U+00FF] range). All these optimizations are incompatible with isolated subinterpreters, since caches are currently shared by all inteprepreters. These caches should be made per-intepreter. See bpo-40512 "Meta issue: per-interpreter GIL" for the rationale. I already made small integer singletons per interpreter in bpo-38858: |
I wrote a draft PR to make interned strings per-interpreter. It does crash because it requires to make method cache and _PyUnicode_FromId() (bpo-39465) compatible with subinterpreters. |
Microbenchmark for tuple free list to measure PR 20247 overhead: microbench_tuple.py. It requires to apply bench_tuple.patch. |
This change contains an interesting fix:
Maybe "Fini" functions should disable free lists to prevent following code to add something to a free list, during Python finalization. |
bench_dict.patch: Microbenchmark on the C function PyDict_New() to measure the overhead of PR 20645. |
I'm worried about the performance impact of these changes, especially as many of the changes haven't been reviewed. Have you done any performance analysis or tests of the cumulative effect of all these changes? |
No. It would be interesting to measure that using pyperformance. |
I reopen the issue. This change caused a regression in attached interned_bug.py. Output: $ ./python interned_bug.py
Exception ignored deletion of interned string failed:
KeyError: 'out of memory'
python: Objects/unicodeobject.c:1946: unicode_dealloc: Assertion `Py_REFCNT(unicode) == 1' failed.
Abandon (core dumped) Running "import xml.parsers.expat" in a subinterpreter causes two issues when the subinterpreter completes:
The interned string is created in the main interpreter and so stored in the main interpreter interned dictionary. The string is stored in 2 dictionaries of pyexpat.errors dictionaries: >>> pyexpat.errors.messages[1]
'out of memory'
>>> pyexpat.errors.codes['out of memory']
1 When the subinterpreter clears pyexpat.errors and pyexpat.model dictionaries, the interned string is deleted: unicode_dealloc() is called. But unicode_dealloc() fails to delete the interned string in the subinterpreter interned dictionary. pyexpat.errors and pyexpat.model modules are cleared because they are stored as different names in sys.modules by Lib/xml/parsers/expat.py: sys.modules['xml.parsers.expat.model'] = model |
Fixed by: commit c8a87ad
|
Should Make dtoa bigint free list per-interpreter. static Bigint *bigint_freelist[Kmax+1]; -> _is { Bigint *bigint_freelist[Kmax+1]; } |
Hi Victor, I just noticed the change to dtoa.c in #69009. Please could you explain what the benefit of this change was? In general, we need to be very conservative with changes to dtoa.c: it's a complex, fragile, performance-critical piece of code, and ideally we'd like it not to diverge from the upstream code any more than it already has, in case we need to integrate bugfixes from upstream. It's feeling as though the normal Python development process is being bypassed here. As I understand it, this and similar changes are in aid of per-subinterpreter GILs. Has there been agreement from the core devs or steering council that this is a desirable goal? Should there be a PEP before more changes like this are made? (Or maybe there's already a PEP, that I missed? I know about PEP-554, but that PEP is explicit that GIL sharing is out of scope.) |
Mark Dickinson: "I just noticed the change to dtoa.c in #69009. Please could you explain what the benefit of this change was?" The rationale is explained in bpo-40512. The goal is to run multiple Python interpreters in parallel in the same process. dtoa.c had global variables shared by all interpreters without locking, so two intepreters could corrupt the freelist consistency. Mark Dickinson: "In general, we need to be very conservative with changes to dtoa.c: it's a complex, fragile, performance-critical piece of code, and ideally we'd like it not to diverge from the upstream code any more than it already has, in case we need to integrate bugfixes from upstream." I know that dtoa.c was copied from a third party project. But the commit 5bd1059 change change only makes sense in Python, I don't think that it would make sense to propose it upstream. dtoa.c _Py_dg_strtod() is called by:
_PyOS_ascii_strtod() is called PyOS_string_to_double() which is called by:
dtoa.c _Py_dg_dtoa() is called by:
PyOS_double_to_string() is called by:
I guess that the most important use case are float(str) and str(float). I wrote attached bench_dtoa.py to measure the effect on performance of the commit 5bd1059: $ python3 -m pyperf compare_to before.json after.json
float('0'): Mean +- std dev: [before] 80.5 ns +- 3.1 ns -> [after] 90.1 ns +- 3.6 ns: 1.12x slower
float('1.0'): Mean +- std dev: [before] 89.5 ns +- 4.3 ns -> [after] 97.2 ns +- 2.6 ns: 1.09x slower
float('340282366920938463463374607431768211455'): Mean +- std dev: [before] 480 ns +- 42 ns -> [after] 514 ns +- 13 ns: 1.07x slower
float('1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190335'): Mean +- std dev: [before] 717 ns +- 36 ns -> [after] 990 ns +- 27 ns: 1.38x slower
str(0.0): Mean +- std dev: [before] 113 ns +- 8 ns -> [after] 106 ns +- 4 ns: 1.06x faster
str(1.0): Mean +- std dev: [before] 141 ns +- 11 ns -> [after] 135 ns +- 17 ns: 1.05x faster
str(inf): Mean +- std dev: [before] 110 ns +- 11 ns -> [after] 98.9 ns +- 3.3 ns: 1.12x faster Benchmark hidden because not significant (1): str(3.402823669209385e+38) Geometric mean: 1.05x slower I built Python with "./configure --enable-optimizations --with-lto" on Fedora 33 (GCC 10.2.1). I didn't use CPU isolation. Oh, float(str) is between 1.09x slower and 1.38x slower. On the other side, str(float) is between 1.06x and 1.12x faster, I'm not sure why. I guess that the problem is that PGO+LTO build is not reproducible, GCC might prefer to optimize some functions or others depending on the PROFILE_TASK (Makefile.pre.in, command used by GCC profiler). Mark Dickinson: "It's feeling as though the normal Python development process is being bypassed here. As I understand it, this and similar changes are in aid of per-subinterpreter GILs. Has there been agreement from the core devs or steering council that this is a desirable goal? Should there be a PEP before more changes like this are made? (Or maybe there's already a PEP, that I missed? I know about PEP-554, but that PEP is explicit that GIL sharing is out of scope.)" Honestly, I didn't expect any significant impact on performance on the change. So I merged the PR as I merge other fixes for subinterpreters. It seems like I underestimated the number of Balloc/Bmalloc calls per float(str) or str(float) call. There is no PEP about running multiple Python interpreters in the same process. There is no consensus on this topic. I discussed it in private with some core devs, but that's not relevant here. My plan is to merge changes which have no significant impact on performances, and wait for a PEP for changes which have a significant impact on performances. Most changes fix bugs in subinterpreters which still share a GIL. This use case is not new and is supported for 10-20 years. For now, I will the commit 5bd1059. |
I reopen the issue to remind me that collections.deque() freelist is shared by all interpreters. |
Each deque instance now has its own free list. But dtoa.c still has a per-process cache, shared by all interpreters. |
[Victor Stinner]
FWIW, PyFloat_FromDouble() is the most performance critical function in floatobject.c. |
My commit ea25180 (interned strings) introduced bpo-46006 "[subinterpreter] _PyUnicode_EqualToASCIIId() issue with subinterpreters" regression. To unblock the Python 3.11.0a4 release, I just reverted the change. It reintroduces the issue, so I created bpo-46283: "[subinterpreters] Unicode interned strings must not be shared between interpreters". |
I don't have the bandwidth to fix the remaining issues, so I just close again the issue. |
bpo-40521 (pythonGH-84701) is now long solved.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: