Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc.freeze() - an API to mark objects as uncollectable #75739

Closed
ambv opened this issue Sep 23, 2017 · 19 comments
Closed

gc.freeze() - an API to mark objects as uncollectable #75739

ambv opened this issue Sep 23, 2017 · 19 comments
Labels
3.7 interpreter-core performance

Comments

@ambv
Copy link
Contributor

@ambv ambv commented Sep 23, 2017

BPO 31558
Nosy @tim-one, @warsaw, @nascheme, @rhettinger, @jcea, @pitrou, @vstinner, @benjaminp, @methane, @ambv, @ericsnowcurrently, @serhiy-storchaka, @1st1, @applio, @brainfvck
PRs
  • #3705
  • #4013
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-10-16.19:50:28.576>
    created_at = <Date 2017-09-23.00:21:29.490>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'gc.freeze() - an API to mark objects as uncollectable'
    updated_at = <Date 2018-06-22.23:46:56.456>
    user = 'https://github.com/ambv'

    bugs.python.org fields:

    activity = <Date 2018-06-22.23:46:56.456>
    actor = 'eric.snow'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-10-16.19:50:28.576>
    closer = 'lukasz.langa'
    components = ['Interpreter Core']
    creation = <Date 2017-09-23.00:21:29.490>
    creator = 'lukasz.langa'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 31558
    keywords = ['patch', 'needs review']
    message_count = 19.0
    messages = ['302780', '302790', '302831', '302832', '302967', '302969', '302972', '303285', '303836', '303841', '304176', '304191', '304192', '304194', '304196', '304203', '304481', '304482', '304483']
    nosy_count = 15.0
    nosy_names = ['tim.peters', 'barry', 'nascheme', 'rhettinger', 'jcea', 'pitrou', 'vstinner', 'benjamin.peterson', 'methane', 'lukasz.langa', 'eric.snow', 'serhiy.storchaka', 'yselivanov', 'davin', 'brainfvck']
    pr_nums = ['3705', '4013']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue31558'
    versions = ['Python 3.7']

    @ambv
    Copy link
    Contributor Author

    @ambv ambv commented Sep 23, 2017

    When you're forking many worker processes off of a parent process, the resulting children are initially very cheap in memory. They share memory pages with the base process until a write happens [1]_.

    Sadly, the garbage collector in Python touches every object's PyGC_Head during a collection, even if that object stays alive, undoing all the copy-on-write wins. Instagram disabled the GC completely for this reason [2]_. This fixed the COW issue but made the processes more vulnerable to memory growth due to new cycles being silently introduced when the application code is changed by developers. While we could fix the most glaring cases, it was hard to keep the memory usage at bay. We came up with a different solution that fixes both issues. It requires a new API to be added to CPython's garbage collector.

    gc.freeze()

    As soon as possible in the lifecycle of the parent process we disable the garbage collector. Then we call a new API called gc.freeze() to move all currently tracked objects to a permanent generation. They won't be considered in further collections. This is okay since we are assuming that (almost?) all of the objects created until that point are module-level and thus useful for the entire lifecycle of the child process.

    After calling gc.freeze() we call fork. Then, the child process is free to re-enable the garbage collector.

    Why do we need to disable the collector on the parent process as soon as possible? When the GC cleans up memory in the mean time, it leaves space in pages for new objects. Those pages become shared after fork and as soon as the child process starts creating its own objects, they will likely be written to the shared pages, initiating a lot of copy-on-write activity.

    In other words, we're wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking).

    Other attempts
    --------------

    We also tried moving the GC head to another place in memory. This creates some indirection but cache locality on that segment is great so performance isn't really hurt. However, this change introduces two new pointers (16 bytes) per object. This doesn't sound like a lot but given millions of objects and tens of processes per box, this alone can cost hundreds of megabytes per host. Memory that we wanted to save in the first place. So that idea was scrapped.

    Attribution
    -----------

    The original patch is by Zekun Li, with help from Jiahao Li, Matt Page, David Callahan, Carl S. Shapiro, and Chenyang Wu.

    .. [1] https://en.wikipedia.org/wiki/Copy-on-write
    .. [2] https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

    @ambv ambv added 3.7 performance interpreter-core labels Sep 23, 2017
    @benjaminp
    Copy link
    Contributor

    @benjaminp benjaminp commented Sep 23, 2017

    This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

    @methane
    Copy link
    Member

    @methane methane commented Sep 24, 2017

    Nice idea!

    I think it helps not only sharing more memory for forking application,
    but also long running large application.
    There are many static objects which is tracked by GC.
    It makes full GC time long. And CPU cache is filled by unused data.

    For example, web worker loading application after fork. (uWSGI's --lazy-app option).
    Such application can call gc.freeze() after loading full application, before starting processing request.

    @methane
    Copy link
    Member

    @methane methane commented Sep 24, 2017

    AFAIK, Python shutdown process calls full GC.

    Don't touching permanent generation makes shutdown faster.
    On the other hand, there are some downside:

    • Some object may be not freed while shutdown. It looks like "leak" for application embedding Python interpreter.
    • Some __del__ methods may be not be called.

    Of course, GC permanent generation while shutdown doesn't make sense.
    gc.freeze() is used for sharing more memory pages. Shutdown process
    shouldn't unshare them.

    So I think these notable downside should be documented.

    @nascheme
    Copy link
    Member

    @nascheme nascheme commented Sep 25, 2017

    I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3). Whether that generation should be collected on shutdown would still be a question.

    If this gets implemented, it will impact the memory bitmap based GC idea I have been prototyping. Currently I am thinking of using two bits for each small GC object. The bits would mean: 00 - untracked, 01 - gen 0, 10 - gen 1, 11 - gen 2. With the introduction of a frozen generation, I would have to use another bit I think.

    Another thought is maybe we don't actually need 3 generations as they are currently used. We could have gen 0 which is collected frequently and gen 1 that is collected rarely. The frozen objects could go into gen 2 which are not automatically collected or have a user adjustable collection frequency. Collection of gen 1 would not automatically move objects into gen 2.

    I think bpo-3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.

    @warsaw
    Copy link
    Member

    @warsaw warsaw commented Sep 25, 2017

    I like the idea of a gen 4 that never gets collected. This would have been useful for the original problem that inspired me to add the generation argument to gc.collect(). The nice thing about this, is just as you suggest: you could force a collection of gen 4 by gc.collect(3).

    It's unfortunate that you'd have to add a bit to handle this, but maybe you're right that we only really need three generations.

    @pitrou
    Copy link
    Member

    @pitrou pitrou commented Sep 25, 2017

    Le 25/09/2017 à 20:55, Neil Schemenauer a écrit :

    I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3).

    API-wise it would sound better to have a separate gc.collect_frozen()...

    Though I think a gc.unfreeze() that moves the frozen generation into the
    oldest non-frozen generation would be useful too, at least for testing
    and experimentation.

    I think bpo-3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.

    Yeah... It's worse than that. Take for example a bytearray object. The
    basic object (the PyByteArrayObject structure) is quite small. But it
    also has a separately-allocated payload that is deleted whenever
    tp_dealloc is called. The GC isn't aware of that payload. Worse, the
    payload can (and will) change size during the object's lifetime, without
    the GC's knowledge about it ever being updated. (*)

    IMHO, the only reliable way to use memory footprint to drive the GC
    heuristic would be to force all allocations into our own allocator, and
    reconcile the GC with that allocator (instead of having the GC be its
    own separate thing as is the case nowadays).

    (*) And let's not talk about hairier cases, such as having multiple
    memoryviews over the same very large object...

    PS: every heuristic has its flaws. As I noted on python-(dev|ideas),
    full GC runtimes such as most Java implementations are well-known for
    requiring careful tuning of GC parameters for "non-usual" workloads. At
    least reference counting makes CPython more robust in many cases.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Sep 28, 2017

    (my previous msg303283 was for the bpo-11063, I removed it, sorry for the spam.)

    @ambv
    Copy link
    Contributor Author

    @ambv ambv commented Oct 6, 2017

    Alright Python people, I don't see anybody being against the idea on the thread. Can we get a review of the linked PR? I don't think it would be good form for me to accept it.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Oct 6, 2017

    What about msg302790?

    @brainfvck
    Copy link
    Mannequin

    @brainfvck brainfvck mannequin commented Oct 11, 2017

    This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

    What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.

    The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

    This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

    So this is useful if you did some warm-up work in parent process.

    Also it could speedup gc if you have large amount of permanent objects.

    @methane
    Copy link
    Member

    @methane methane commented Oct 12, 2017

    > This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

    What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.

    The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

    GC doesn't cause "memory fragmentation".
    GC touches (writes) GC header and refcount. It cause sharing memory shrink.
    Maybe, you're wrongly understanding "memory fragmentation".

    This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

    It may increase cost of refcount operation, because it makes all INCREF and DECREF bigger.
    Note that this is only helps application using gc.freeze(). This shouldn't slow down all other applications.

    So this is useful if you did some warm-up work in parent process.

    I don't understand this statement.

    Also it could speedup gc if you have large amount of permanent objects.

    Yes, this helps not only "prefork" application, but also all long running applications
    having large baseline data.

    @methane
    Copy link
    Member

    @methane methane commented Oct 12, 2017

    As Instagram's report, disabling cycler GC really helps even if there is refcont.
    All application have some cold data: imported but never used modules, functions.

    @methane
    Copy link
    Member

    @methane methane commented Oct 12, 2017

    Should gc.freeze() do gc.collect() right before freezing?
    Or should we document gc.collect(); gc.freeze(); idiom?

    I don't like gc.freeze(collect=False).
    So if there is at least one use case of gc.freeze() without gc.collect(), I'm +1 on former (current pull request) design.

    Other nitpicking: get_freeze_count() or get_frozen_count()?

    @brainfvck
    Copy link
    Mannequin

    @brainfvck brainfvck mannequin commented Oct 12, 2017

    So what we did is:

    We keep gc **disabled** on parent process and freeze after warmup, enable gc on child process.

    The reason not to do a full collection is mentioned in previous comments/original ticket - (I called it) memory fragmentation.

    The observation is - We keep gc disabled on both parent and child process and did a full collection before fork, it makes the shared memory shrink a lot compared to no collection. - There's no way for disabled gc to touch the head to make copy-on-write.

    Of course, enable gc will make the shared memory shrink more. But the former case is accounting more than latter one.

    So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.

    Though this behavior may have better name?

    @methane
    Copy link
    Member

    @methane methane commented Oct 12, 2017

    So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.

    Though this behavior may have better name?

    OK, now I got what you're talking.
    I don't know proper name about it. I call it as "memory hole" for now.

    But I don't think "memory hole" is big problem, because we already has refcount.
    Say there are 100 function objects in one page, and 99 of them are never used. But when 1 of them are called, the page is unshared.

    Solving memory hole issue is easy: just stop allocating new object from existing pages.
    But I don't think it's worth enough because of refcount issue.

    Instead of trying "share most data", I recommend to "use small number of processes" approach.

    In my company, we don't use "prefork", but "--lazy-app" option of uWSGI for graceful reloading. (e.g. "afterfork")
    But since we use nginx in front of uWSGI, # of uWSGI worker is just 2* CPU cores. We can serve to massive clients from only 16~32 processes.

    So I prefer optimizing normal memory usage. It is good for all applications, not only "prefork" applications.

    In this case, I'm +1 to gc.freeze() proposal because it can be used for single process applications.

    @ambv
    Copy link
    Contributor Author

    @ambv ambv commented Oct 16, 2017

    Based on Inadasan's, Antoine's, Neil's, and Barry's review, I'm merging the change to 3.7.

    @ambv
    Copy link
    Contributor Author

    @ambv ambv commented Oct 16, 2017

    New changeset c75edab by Łukasz Langa (brainfvck) in branch 'master':
    bpo-31558: Add gc.freeze() (bpo-3705)
    c75edab

    @ambv ambv closed this as completed Oct 16, 2017
    @ambv
    Copy link
    Contributor Author

    @ambv ambv commented Oct 16, 2017

    New changeset c30b55b by Łukasz Langa in branch 'master':
    bpo-31558: Update NEWS and ACKS (bpo-4013)
    c30b55b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 interpreter-core performance
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants