-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Compact PyGC_Head #77778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Currently, PyGC_Head takes three words; gc_prev, gc_next, and gc_refcnt. gc_refcnt is used when collecting, for trial deletion. So if we can avoid tracking/untracking while trial deletion, gc_prev and gc_refcnt can share same memory space. This idea reduces PyGC_Head size to two words. |
$ ./python -m perf compare_to master.json twogc.json -G --min-speed=2
Slower (3):
- scimark_monte_carlo: 268 ms +- 9 ms -> 278 ms +- 8 ms: 1.04x slower (+4%)
- fannkuch: 1.03 sec +- 0.02 sec -> 1.06 sec +- 0.02 sec: 1.03x slower (+3%)
- spectral_norm: 285 ms +- 9 ms -> 291 ms +- 6 ms: 1.02x slower (+2%) Faster (13):
Benchmark hidden because not significant (44) |
Interesting. Do you have any comparisons on memory footprint too? |
This is an interesting idea. The other problem with the garbage collecting is that it modifies the memory of all collectable objects. This leads to deduplicating virtually all memory blocks after the fork, even if these objects are not used in the child. If gc_refcnt is used only when collecting, what if allocate a linear array for them for that time? This will save less memory (only one word per object in the peak), but avoid modifying the memory of not collected objects (except pointers at the ends of generation lists and neighbors of collected objects). |
$ ./python-gc -c 'import asyncio,sys; sys._debugmallocstats()' master: # bytes in allocated blocks = 4,011,368 patched: # bytes in allocated blocks = 3,852,432 |
php implemented similar idea recently. In short, each tracked object have only "index" of GC struct, not "pointer". I tried to copy it, but there are some challenges:
And this is my first time GC hack. So I gave up PHP way and choose easier way. Anyway, we have gc.freeze() now which can be used for avoid CoW after fork. |
Le 22/05/2018 à 17:31, INADA Naoki a écrit :
Thanks. You can also collect peak memory stats during the benchmark |
In Doc folder: make clean master: 113.15user 0.41system 1:55.46elapsed 98%CPU (0avgtext+0avgdata 111.07user 0.44system 1:51.72elapsed 99%CPU (0avgtext+0avgdata 205052maxresident)k patched: 109.92user 0.44system 1:50.43elapsed 99%CPU (0avgtext+0avgdata 195832maxresident)k 110.70user 0.40system 1:51.50elapsed 99%CPU (0avgtext+0avgdata 195516maxresident)k It seems reduced 5% memory footprint, and performance difference is very small. |
This is such a great idea. +1 from me. |
Are you sure that all memory allocators align at least on 8 bytes (to give up 3 unused bits)? I don't know the answer, it's an open question. |
If they don't then a simple double array will end up unaligned. It's not impossible but extremely unlikely. |
Ok, I ran a subset of the benchmarks to record their memory footprint and got these results: master-mem.perf Performance version: 0.6.2 methane-mem.perf Performance version: 0.6.2 +------------------------+-----------------+------------------+---------------+------------------------+ |
I'm not sure that the code tracking the memory usage in performance works :-) It may be worth to double check the code. By the way, perf has a --tracemalloc option, but performance doesn't have it. perf has two options: --track-memory and --tracemalloc, see the doc: http://perf.readthedocs.io/en/latest/runner.html#misc perf has different implementations to track the memory usage:
In the 3 cases, perf saves the *peak* of the memory usage. |
Why wouldn't it? It certainly gives believable numbers (see above). And it seems to defer to your own "perf" tool anyway.
I don't think tracemalloc is a good tool to compare memory usage, as it comes with its own overhead. Also it won't account for issues such as memory fragmentation.
Well, yes, that's the point, thank you. |
I wrote the code and I never seriously tested it, that's why I have doubts :-) Maybe it works, I just suggest to double check the code ;-) |
As I said, the code just defers to "perf". And you should have tested that :-) I'm not really interested in checking it. All I know is that the very old code (inherited from Unladen Swallow) did memory tracking correctly. And I have no reason to disbelieve the numbers I posted. |
--track-memory and --tracemalloc have no unit tests, it's in the perf TODO list ;-) Well, it was just a remark. I'm looking for new contributors for perf! |
Even with smaller benefit the idea looks worth to me. Is the PR already ready for review? |
I think so. |
I asked if this change breaks the stable ABI. Steve Dower replied: "Looks like it breaks the 3.7 ABI, which is certainly not allowed at this time. But it’s not a limited API structure, so no problem for 3.8." https://mail.python.org/pipermail/python-dev/2018-May/153745.html I didn't understand the answer. It breaks the ABI but it doesn't break the API? It seems like PyObject.ob_refcnt is part of the "Py_LIMITED_API" and so an extension module using the stable API/ABI can access directly the field with no function call. For example, Py_INCREF() modifies directly the field at the ABI level. *But* PyGC_Head is a strange thing since it's stored "before" the usual PyObject* pointer, so fields starting at PyObject* address are not affected by this change, no? Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable *ABI*. Can someone please confirm my analysis? |
On Wed, May 30, 2018 at 7:14 PM STINNER Victor <report@bugs.python.org>
It breaks ABI, but it is not part of the "stable" ABI.
I think so.
s/from PyGC_Head/from PyObject/ |
"Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable *ABI*." Oops, I mean: PyGC_Head seems to be excluded *from the Py_LIMITED_API*. |
Here is a micro-benchmark of GC overhead:
$ ./python -m timeit -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, tkinter.tix, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
100 loops, best of 5: 2.41 msec per loop
$ ./python -m timeit -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, tkinter.tix, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
100 loops, best of 5: 2.52 msec per loop So it's a 4% slowdown, but GC runs themselves are a minor fraction of usual programs' runtime, so I'm not sure that matters. Though it would be better to test on an actual GC-heavy application. |
I added one micro optimization. $ ./python -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
python-master: ..................... 1.66 ms +- 0.08 ms
python: ..................... 1.58 ms +- 0.00 ms Mean +- std dev: [python-master] 1.66 ms +- 0.08 ms -> [python] 1.58 ms +- 0.00 ms: 1.05x faster (-5%) |
Oops, this optimization broke trace module. $ ./python-patched -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
python-master: ..................... 1.63 ms +- 0.04 ms
python-patched: ..................... 1.64 ms +- 0.01 ms Mean +- std dev: [python-master] 1.63 ms +- 0.04 ms -> [python-patched] 1.64 ms +- 0.01 ms: 1.01x slower (+1%) |
There are also problems with func_name and func_qualname. func_name can be an instance of the str subtype and has __dict__. func_qualname seems can be of any type. Even if they would be exact strings, excluding them from tp_traverse will break functions that calculate the total size of the memory consumed by the specified set of objects by calling sys.getsizeof() recursively. |
You're right. I'll revert the optimization completely... |
It seems like the PR 7043 has no big impact on performances. Sometimes, it's a little bit faster, sometimes it's a little bit slower. The trend is a little bit more in the "faster" side, but it's not very obvious. I approved PR 7043. It shouldn't break the world, it might only break very specialized Python extensions touching very low level details of the Python implementation. The performance seems ok, but it reduces the memory footprint which is a very good thing! |
Thanks for review. Do you think buildbots for master branch are sound enough to commit this change? Or should I wait one more week? |
CIs on master are stable again. Since PyGC_Head is a key feature of Python, I would suggest you to wait at least a second approval of another core dev. |
Would you mind to mention your optimization in What's New in Python 3.8? IHMO any enhancement of the memory footprint should be documented :-) |
Thank you! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: