Skip to content

Compact PyGC_Head #77778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
methane opened this issue May 22, 2018 · 34 comments
Closed

Compact PyGC_Head #77778

methane opened this issue May 22, 2018 · 34 comments
Labels
3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@methane
Copy link
Member

methane commented May 22, 2018

BPO 33597
Nosy @pitrou, @vstinner, @methane, @serhiy-storchaka, @1st1, @grimreaper
PRs
  • bpo-33597: Reduce PyGC_Head size #7043
  • bpo-33597: Add What's New for PyGC_Head change #8236
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-10.08:20:16.313>
    created_at = <Date 2018-05-22.07:06:22.566>
    labels = ['interpreter-core', '3.8', 'performance']
    title = 'Compact PyGC_Head'
    updated_at = <Date 2018-07-11.10:17:58.856>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2018-07-11.10:17:58.856>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-10.08:20:16.313>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2018-05-22.07:06:22.566>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33597
    keywords = ['patch']
    message_count = 34.0
    messages = ['317262', '317295', '317297', '317298', '317299', '317301', '317302', '317306', '317313', '318027', '318028', '318051', '318078', '318081', '318083', '318084', '318086', '318152', '318154', '318162', '318181', '318186', '318321', '318354', '318362', '318366', '318367', '318387', '318399', '318400', '321366', '321402', '321415', '321430']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'methane', 'serhiy.storchaka', 'yselivanov', 'eitan.adler']
    pr_nums = ['7043', '8236']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue33597'
    versions = ['Python 3.8']

    @methane
    Copy link
    Member Author

    methane commented May 22, 2018

    Currently, PyGC_Head takes three words; gc_prev, gc_next, and gc_refcnt.

    gc_refcnt is used when collecting, for trial deletion.
    gc_prev is used for tracking and untracking.

    So if we can avoid tracking/untracking while trial deletion, gc_prev and gc_refcnt can share same memory space.

    This idea reduces PyGC_Head size to two words.

    @methane methane added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 22, 2018
    @methane
    Copy link
    Member Author

    methane commented May 22, 2018

    $ ./python -m perf compare_to master.json twogc.json -G  --min-speed=2
    Slower (3):
    - scimark_monte_carlo: 268 ms +- 9 ms -> 278 ms +- 8 ms: 1.04x slower (+4%)
    - fannkuch: 1.03 sec +- 0.02 sec -> 1.06 sec +- 0.02 sec: 1.03x slower (+3%)
    - spectral_norm: 285 ms +- 9 ms -> 291 ms +- 6 ms: 1.02x slower (+2%)

    Faster (13):

    • unpickle_pure_python: 980 us +- 13 us -> 886 us +- 11 us: 1.11x faster (-10%)
    • pickle_dict: 71.9 us +- 6.8 us -> 67.2 us +- 0.4 us: 1.07x faster (-7%)
    • mako: 40.5 ms +- 1.1 ms -> 38.7 ms +- 0.4 ms: 1.05x faster (-5%)
    • scimark_lu: 396 ms +- 18 ms -> 381 ms +- 16 ms: 1.04x faster (-4%)
    • genshi_text: 89.3 ms +- 1.2 ms -> 86.3 ms +- 1.2 ms: 1.03x faster (-3%)
    • xml_etree_generate: 270 ms +- 5 ms -> 262 ms +- 5 ms: 1.03x faster (-3%)
    • regex_dna: 286 ms +- 1 ms -> 279 ms +- 1 ms: 1.03x faster (-3%)
    • scimark_sor: 511 ms +- 6 ms -> 497 ms +- 10 ms: 1.03x faster (-3%)
    • xml_etree_iterparse: 220 ms +- 6 ms -> 214 ms +- 5 ms: 1.03x faster (-3%)
    • xml_etree_process: 231 ms +- 4 ms -> 225 ms +- 4 ms: 1.02x faster (-2%)
    • genshi_xml: 191 ms +- 2 ms -> 187 ms +- 2 ms: 1.02x faster (-2%)
    • unpack_sequence: 131 ns +- 2 ns -> 129 ns +- 2 ns: 1.02x faster (-2%)
    • richards: 180 ms +- 4 ms -> 176 ms +- 2 ms: 1.02x faster (-2%)

    Benchmark hidden because not significant (44)

    @pitrou
    Copy link
    Member

    pitrou commented May 22, 2018

    Interesting. Do you have any comparisons on memory footprint too?

    @serhiy-storchaka
    Copy link
    Member

    This is an interesting idea.

    The other problem with the garbage collecting is that it modifies the memory of all collectable objects. This leads to deduplicating virtually all memory blocks after the fork, even if these objects are not used in the child.

    If gc_refcnt is used only when collecting, what if allocate a linear array for them for that time? This will save less memory (only one word per object in the peak), but avoid modifying the memory of not collected objects (except pointers at the ends of generation lists and neighbors of collected objects).

    @methane
    Copy link
    Member Author

    methane commented May 22, 2018

    $ ./python-gc -c 'import asyncio,sys; sys._debugmallocstats()'

    master:

    # bytes in allocated blocks = 4,011,368
    # bytes in available blocks = 136,640
    50 unused pools * 4096 bytes = 204,800
    # bytes lost to pool headers = 49,824
    # bytes lost to quantization = 53,816
    # bytes lost to arena alignment = 0
    Total = 4,456,448

    patched:

    # bytes in allocated blocks = 3,852,432
    # bytes in available blocks = 132,664
    27 unused pools * 4096 bytes = 110,592
    # bytes lost to pool headers = 47,856
    # bytes lost to quantization = 50,760
    # bytes lost to arena alignment = 0
    Total = 4,194,304

    @methane
    Copy link
    Member Author

    methane commented May 22, 2018

    @serhiy

    php implemented similar idea recently.
    https://react-etc.net/entry/improvements-to-garbage-collection-gc-php-7-3-boosts-performance-in-benchmark

    In short, each tracked object have only "index" of GC struct, not "pointer".
    GC struct is in array and it can be resized.

    I tried to copy it, but there are some challenges:

    • _PyObject_GC_TRACK() will resize GC array and cause MemoryError. It's not API compatible.
    • php's GC is not generational. This design may slow down moving objects between generation.
    • We need one word (index) for object header and two words (refcnt, pointer to the object) for GC struct. It means we can reduce memory footprint only for untracked dicts and tuples.

    And this is my first time GC hack. So I gave up PHP way and choose easier way.

    Anyway, we have gc.freeze() now which can be used for avoid CoW after fork.

    @pitrou
    Copy link
    Member

    pitrou commented May 22, 2018

    Le 22/05/2018 à 17:31, INADA Naoki a écrit :

    INADA Naoki <songofacandy@gmail.com> added the comment:

    $ ./python-gc -c 'import asyncio,sys; sys._debugmallocstats()'

    Thanks. You can also collect peak memory stats during the benchmark
    suite, though the numbers may be approximate.

    @methane
    Copy link
    Member Author

    methane commented May 22, 2018

    In Doc folder:

    make clean
    make PYTHON=../python venv
    /usr/bin/time make html

    master:

    113.15user 0.41system 1:55.46elapsed 98%CPU (0avgtext+0avgdata
    205472maxresident)k
    18800inputs+223544outputs (1major+66066minor)pagefaults 0swaps

    111.07user 0.44system 1:51.72elapsed 99%CPU (0avgtext+0avgdata 205052maxresident)k
    0inputs+223376outputs (0major+65855minor)pagefaults 0swaps

    patched:

    109.92user 0.44system 1:50.43elapsed 99%CPU (0avgtext+0avgdata 195832maxresident)k
    0inputs+223376outputs (0major+63206minor)pagefaults 0swaps

    110.70user 0.40system 1:51.50elapsed 99%CPU (0avgtext+0avgdata 195516maxresident)k
    0inputs+223376outputs (0major+62723minor)pagefaults 0swaps

    It seems reduced 5% memory footprint, and performance difference is very small.

    @1st1
    Copy link
    Member

    1st1 commented May 22, 2018

    This is such a great idea. +1 from me.

    @vstinner
    Copy link
    Member

    Are you sure that all memory allocators align at least on 8 bytes (to give up 3 unused bits)? I don't know the answer, it's an open question.

    @pitrou
    Copy link
    Member

    pitrou commented May 29, 2018

    Are you sure that all memory allocators align at least on 8 bytes (to give up 3 unused bits)?

    If they don't then a simple double array will end up unaligned. It's not impossible but extremely unlikely.

    @pitrou
    Copy link
    Member

    pitrou commented May 29, 2018

    Ok, I ran a subset of the benchmarks to record their memory footprint and got these results:

    master-mem.perf
    ===============

    Performance version: 0.6.2
    Python version: 3.8.0a0 (64-bit) revision 73cbe7a
    Report on Linux-4.15.0-22-generic-x86_64-with-glibc2.9
    Number of logical CPUs: 16
    Start date: 2018-05-29 16:54:24.964539
    End date: 2018-05-29 17:04:05.608437

    methane-mem.perf
    ================

    Performance version: 0.6.2
    Python version: 3.8.0a0 (64-bit) revision 16c66ca
    Report on Linux-4.15.0-22-generic-x86_64-with-glibc2.9
    Number of logical CPUs: 16
    Start date: 2018-05-29 17:15:00.936857
    End date: 2018-05-29 17:24:33.755119

    +------------------------+-----------------+------------------+---------------+------------------------+
    | Benchmark | master-mem.perf | methane-mem.perf | Change | Significance |
    +========================+=================+==================+===============+========================+
    | 2to3 | 7121.8 kB | 6944.4 kB | 1.03x smaller | Significant (t=31.59) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | chameleon | 15.7 MB | 15.3 MB | 1.02x smaller | Significant (t=24.25) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | html5lib | 20.1 MB | 19.4 MB | 1.04x smaller | Significant (t=67.15) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | json_dumps | 8282.2 kB | 8021.4 kB | 1.03x smaller | Significant (t=234.01) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | json_loads | 7671.2 kB | 7494.0 kB | 1.02x smaller | Significant (t=37.43) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | pickle | 7883.4 kB | 7698.6 kB | 1.02x smaller | Significant (t=28.64) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | pickle_pure_python | 7867.4 kB | 7680.0 kB | 1.02x smaller | Significant (t=48.64) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | sqlalchemy_declarative | 18.4 MB | 17.9 MB | 1.03x smaller | Significant (t=132.47) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | sqlalchemy_imperative | 17.8 MB | 17.3 MB | 1.03x smaller | Significant (t=128.94) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | tornado_http | 24.5 MB | 24.0 MB | 1.02x smaller | Significant (t=7.99) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | unpickle | 7883.2 kB | 7694.2 kB | 1.02x smaller | Significant (t=31.19) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | unpickle_pure_python | 7891.6 kB | 7699.4 kB | 1.02x smaller | Significant (t=63.68) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | xml_etree_generate | 12.0 MB | 11.5 MB | 1.04x smaller | Significant (t=26.07) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | xml_etree_iterparse | 11.8 MB | 11.3 MB | 1.04x smaller | Significant (t=146.82) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | xml_etree_parse | 11.6 MB | 11.2 MB | 1.03x smaller | Significant (t=116.08) |
    +------------------------+-----------------+------------------+---------------+------------------------+
    | xml_etree_process | 12.3 MB | 12.0 MB | 1.02x smaller | Significant (t=49.04) |
    +------------------------+-----------------+------------------+---------------+------------------------+

    @vstinner
    Copy link
    Member

    Ok, I ran a subset of the benchmarks to record their memory footprint and got these results:

    I'm not sure that the code tracking the memory usage in performance works :-) It may be worth to double check the code. By the way, perf has a --tracemalloc option, but performance doesn't have it.

    perf has two options: --track-memory and --tracemalloc, see the doc:

    http://perf.readthedocs.io/en/latest/runner.html#misc

    perf has different implementations to track the memory usage:

    • resource.getrusage(resource.RUSAGE_SELF) * 1024
    • Windows: GetProcessMemoryInfo()
    • tracemalloc: tracemalloc.get_traced_memory()[1]

    In the 3 cases, perf saves the *peak* of the memory usage.

    @pitrou
    Copy link
    Member

    pitrou commented May 29, 2018

    I'm not sure that the code tracking the memory usage in performance works

    Why wouldn't it? It certainly gives believable numbers (see above).

    And it seems to defer to your own "perf" tool anyway.

    perf has two options: --track-memory and --tracemalloc, see the doc:

    I don't think tracemalloc is a good tool to compare memory usage, as it comes with its own overhead. Also it won't account for issues such as memory fragmentation.

    In the 3 cases, perf saves the *peak* of the memory usage.

    Well, yes, that's the point, thank you.

    @vstinner
    Copy link
    Member

    Why wouldn't it? It certainly gives believable numbers (see above).
    And it seems to defer to your own "perf" tool anyway.

    I wrote the code and I never seriously tested it, that's why I have doubts :-) Maybe it works, I just suggest to double check the code ;-)

    @pitrou
    Copy link
    Member

    pitrou commented May 29, 2018

    As I said, the code just defers to "perf". And you should have tested that :-)

    I'm not really interested in checking it. All I know is that the very old code (inherited from Unladen Swallow) did memory tracking correctly. And I have no reason to disbelieve the numbers I posted.

    @vstinner
    Copy link
    Member

    As I said, the code just defers to "perf". And you should have tested that :-)

    --track-memory and --tracemalloc have no unit tests, it's in the perf TODO list ;-) Well, it was just a remark. I'm looking for new contributors for perf!

    @serhiy-storchaka
    Copy link
    Member

    Even with smaller benefit the idea looks worth to me. Is the PR already ready for review?

    @methane
    Copy link
    Member Author

    methane commented May 30, 2018

    Is the PR already ready for review?

    I think so.

    @vstinner
    Copy link
    Member

    I asked if this change breaks the stable ABI. Steve Dower replied:

    "Looks like it breaks the 3.7 ABI, which is certainly not allowed at this time. But it’s not a limited API structure, so no problem for 3.8."

    https://mail.python.org/pipermail/python-dev/2018-May/153745.html

    I didn't understand the answer. It breaks the ABI but it doesn't break the API?

    It seems like PyObject.ob_refcnt is part of the "Py_LIMITED_API" and so an extension module using the stable API/ABI can access directly the field with no function call. For example, Py_INCREF() modifies directly the field at the ABI level.

    *But* PyGC_Head is a strange thing since it's stored "before" the usual PyObject* pointer, so fields starting at PyObject* address are not affected by this change, no?

    Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable *ABI*.

    Can someone please confirm my analysis?

    @methane
    Copy link
    Member Author

    methane commented May 30, 2018

    On Wed, May 30, 2018 at 7:14 PM STINNER Victor <report@bugs.python.org>
    wrote:

    STINNER Victor <vstinner@redhat.com> added the comment:

    I asked if this change breaks the stable ABI. Steve Dower replied:

    "Looks like it breaks the 3.7 ABI, which is certainly not allowed at this
    time. But it’s not a limited API structure, so no problem for 3.8."

    https://mail.python.org/pipermail/python-dev/2018-May/153745.html

    I didn't understand the answer. It breaks the ABI but it doesn't break
    the API?

    It breaks ABI, but it is not part of the "stable" ABI.
    So we can't commit it on 3.7, but we can commit it on 3.8.

    It seems like PyObject.ob_refcnt is part of the "Py_LIMITED_API" and so
    an extension module using the stable API/ABI can access directly the field
    with no function call. For example, Py_INCREF() modifies directly the field
    at the ABI level.

    *But* PyGC_Head is a strange thing since it's stored "before" the usual
    PyObject* pointer, so fields starting at PyObject* address are not affected
    by this change, no?

    I think so.

    Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems
    like the PR 7043 doesn't break the stable *ABI*.

    s/from PyGC_Head/from PyObject/
    I think so.

    @vstinner
    Copy link
    Member

    "Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable *ABI*."

    Oops, I mean: PyGC_Head seems to be excluded *from the Py_LIMITED_API*.

    @pitrou
    Copy link
    Member

    pitrou commented May 31, 2018

    Here is a micro-benchmark of GC overhead:

    • before:
    $ ./python -m timeit -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, tkinter.tix, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
    100 loops, best of 5: 2.41 msec per loop
    • after:
    $ ./python -m timeit -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, tkinter.tix, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
    100 loops, best of 5: 2.52 msec per loop

    So it's a 4% slowdown, but GC runs themselves are a minor fraction of usual programs' runtime, so I'm not sure that matters. Though it would be better to test on an actual GC-heavy application.

    @methane
    Copy link
    Member Author

    methane commented Jun 1, 2018

    053111f

    I added one micro optimization.
    Although it is not relating to two-word-gc directly, it makes gc_collect faster than master on the microbench (without importing tkinter.tix, because no GUI on my Linux environment).

    $ ./python -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
    python-master: ..................... 1.66 ms +- 0.08 ms
    python: ..................... 1.58 ms +- 0.00 ms

    Mean +- std dev: [python-master] 1.66 ms +- 0.08 ms -> [python] 1.58 ms +- 0.00 ms: 1.05x faster (-5%)

    @methane
    Copy link
    Member Author

    methane commented Jun 1, 2018

    Oops, this optimization broke trace module.
    I reverted a part of the optimization. Current benchmark is:

    $ ./python-patched -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()"
    python-master: ..................... 1.63 ms +- 0.04 ms
    python-patched: ..................... 1.64 ms +- 0.01 ms

    Mean +- std dev: [python-master] 1.63 ms +- 0.04 ms -> [python-patched] 1.64 ms +- 0.01 ms: 1.01x slower (+1%)

    @serhiy-storchaka
    Copy link
    Member

    There are also problems with func_name and func_qualname. func_name can be an instance of the str subtype and has __dict__. func_qualname seems can be of any type.

    Even if they would be exact strings, excluding them from tp_traverse will break functions that calculate the total size of the memory consumed by the specified set of objects by calling sys.getsizeof() recursively.

    @methane
    Copy link
    Member Author

    methane commented Jun 1, 2018

    You're right. I'll revert the optimization completely...

    @vstinner
    Copy link
    Member

    vstinner commented Jun 1, 2018

    It seems like the PR 7043 has no big impact on performances. Sometimes, it's a little bit faster, sometimes it's a little bit slower. The trend is a little bit more in the "faster" side, but it's not very obvious.

    I approved PR 7043. It shouldn't break the world, it might only break very specialized Python extensions touching very low level details of the Python implementation. The performance seems ok, but it reduces the memory footprint which is a very good thing!

    @methane
    Copy link
    Member Author

    methane commented Jun 1, 2018

    @victor

    Thanks for review.

    Do you think buildbots for master branch are sound enough to commit this change? Or should I wait one more week?
    http://buildbot.python.org/all/#/grid?branch=master

    @vstinner
    Copy link
    Member

    vstinner commented Jun 1, 2018

    Do you think buildbots for master branch are sound enough to commit this change? Or should I wait one more week?

    CIs on master are stable again. Since PyGC_Head is a key feature of Python, I would suggest you to wait at least a second approval of another core dev.

    @methane
    Copy link
    Member Author

    methane commented Jul 10, 2018

    New changeset 5ac9e6e by INADA Naoki in branch 'master':
    bpo-33597: Reduce PyGC_Head size (GH-7043)
    5ac9e6e

    @methane methane closed this as completed Jul 10, 2018
    @vstinner
    Copy link
    Member

    Would you mind to mention your optimization in What's New in Python 3.8? IHMO any enhancement of the memory footprint should be documented :-)

    @methane
    Copy link
    Member Author

    methane commented Jul 11, 2018

    New changeset d5c875b by INADA Naoki in branch 'master':
    bpo-33597: Add What's New for PyGC_Head (GH-8236)
    d5c875b

    @vstinner
    Copy link
    Member

    bpo-33597: Add What's New for PyGC_Head (GH-8236)

    Thank you!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants