Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gc.enable_object_debugger(): detect corrupted Python objects in the GC #80570

Closed
vstinner opened this issue Mar 21, 2019 · 24 comments
Closed
Labels
3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 36389
Nosy @vstinner, @methane, @serhiy-storchaka, @pablogsal
PRs
  • bpo-36389: Add gc.enable_object_debugger() #12480
  • bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory #12770
  • [3.7] bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770) #12780
  • bpo-36389: Change PyMem_SetupDebugHooks() constants #12782
  • [3.7] bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770) #12788
  • bpo-36389: Add _PyObject_CheckConsistency() function #12803
  • bpo-36389: Cleanup gc.set_threshold() #12844
  • bpo-36389: _PyObject_CheckConsistency() now also works in release mode #16612
  • bpo-36389: Fix _PyBytesWriter in release mode #16624
  • bpo-36389: Add newline to _PyObject_AssertFailed() #16629
  • [3.8] bpo-36389: Backport debug enhancements from master #16796
  • [3.7] bpo-38070: visit_decref() calls _PyObject_IsFreed() in debug mode #16816
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-07-08.14:24:04.533>
    created_at = <Date 2019-03-21.11:02:41.933>
    labels = ['interpreter-core', 'type-feature', '3.8']
    title = 'Add gc.enable_object_debugger(): detect corrupted Python objects in the GC'
    updated_at = <Date 2019-10-16.01:37:51.669>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-10-16.01:37:51.669>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-07-08.14:24:04.533>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2019-03-21.11:02:41.933>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36389
    keywords = ['patch']
    message_count = 24.0
    messages = ['338536', '338538', '338539', '338541', '338542', '338603', '338612', '339910', '339911', '339916', '339948', '339960', '339966', '339995', '340101', '340287', '347497', '347498', '354073', '354076', '354117', '354138', '354147', '354679']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'methane', 'serhiy.storchaka', 'pablogsal']
    pr_nums = ['12480', '12770', '12780', '12782', '12788', '12803', '12844', '16612', '16624', '16629', '16796', '16816']
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue36389'
    versions = ['Python 3.8']

    @vstinner
    Copy link
    Member Author

    That's the follow-up of a thread that I started on python-dev in June 2018:

    [Python-Dev] Idea: reduce GC threshold in development mode (-X dev)
    https://mail.python.org/pipermail/python-dev/2018-June/153857.html

    When an application crash during a garbage collection, we are usually clueless about the cause of the crash. The crash usually occur in visit_decref() on a corrupted Python object. Sadly, not only there are too many possible reasons which can explain why a Python object is corrupted, but the crash usually occur too late after the object is corrupted. Using a smaller GC threshold can help, but it's not enough.

    It would help to be able to enable a builtin checker for corrupted objects. Something that we would triggered by the GC with a threshold specified by the user and that would have zero impact on performance when it's not used.

    The implementation would be to iterate on objects and ensure that they are consistent.

    Attached PR is an implementation of this idea. It uses new API that I wrote recently:

    • _PyObject_ASSERT()
    • _PyObject_IsFreed()
    • _PyType_CheckConsistency()
    • _PyUnicode_CheckConsistency()
    • _PyDict_CheckConsistency()

    If an inconsistency is detected, _PyObject_ASSERT() will call _PyObject_Dump() to dump info about the object. This function can crash, but well, anything can crash on a memory corruption...

    @vstinner vstinner added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Mar 21, 2019
    @vstinner
    Copy link
    Member Author

    Hum, _PyType_CheckConsistency() fails on the following assertion during Python finalization:

    ASSERT(type->tp_mro != NULL && PyTuple_Check(type->tp_mro));
    

    Error:
    ---
    /home/vstinner/prog/python/master/python: No module named asyncio.__main__; 'asyncio' is a package and cannot be directly executed
    Objects/typeobject.c:149: _PyType_CheckConsistency: Assertion "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))" failed
    Enable tracemalloc to get the memory block allocation traceback

    object : <enum 'AddressFamily'>
    type : EnumMeta
    refcount: 1
    address : 0x9138f0
    Fatal Python error: _PyObject_AssertFailed

    Current thread 0x00007ffff7be8740 (most recent call first):
    ---

    gdb traceback:
    ---
    (gdb) where
    #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
    #1 0x00007ffff7c0f895 in __GI_abort () at abort.c:79
    #2 0x000000000055bf91 in fatal_error (prefix=0x0, msg=0x68a47d "_PyObject_AssertFailed", status=-1) at Python/pylifecycle.c:2088
    #3 0x000000000055bfbd in Py_FatalError (msg=0x68a47d "_PyObject_AssertFailed") at Python/pylifecycle.c:2098
    #4 0x000000000047be0f in _PyObject_AssertFailed (obj=<EnumMeta() at remote 0x9138f0>,
    expr=0x68d720 "(type->tp_mro != ((void *)0) && ((((((PyObject*)(type->tp_mro))->ob_type))->tp_flags & ((1UL << 26))) != 0))", msg=0x0,
    file=0x68d632 "Objects/typeobject.c", line=149, function=0x690ba0 <func.14343> "_PyType_CheckConsistency") at Objects/object.c:2197
    #5 0x000000000048ebd8 in _PyType_CheckConsistency (type=0x9138f0) at Objects/typeobject.c:149
    #6 0x00000000004770a5 in _PyObject_CheckConsistency (op=<AddressFamily at remote 0x7fffea7a0870>) at Objects/object.c:35
    #7 0x000000000058de3c in gc_check_object (gc=0x7fffea7a0860) at Modules/gcmodule.c:1280
    #8 0x000000000058de90 in _PyGC_CheckAllObjets () at Modules/gcmodule.c:1290
    #9 0x000000000058f4a0 in gc_check_object_debugger () at Modules/gcmodule.c:2007
    #10 0x000000000058f7c7 in PyObject_GC_Del (op=0x7fffea7a27c0) at Modules/gcmodule.c:2109
    #11 0x0000000000467188 in dict_dealloc (mp=0x7fffea7a27c0) at Objects/dictobject.c:1996
    #12 0x000000000047be44 in _Py_Dealloc (op={}) at Objects/object.c:2212
    #13 0x0000000000461cce in _Py_DECREF (filename=0x6855a0 "./Include/object.h", lineno=533, op={}) at ./Include/object.h:470
    #14 0x0000000000461d1c in _Py_XDECREF (op={}) at ./Include/object.h:533
    #15 0x0000000000462fef in free_keys_object (keys=0x9120e0) at Objects/dictobject.c:580
    #16 0x000000000046284f in dictkeys_decref (dk=0x9120e0) at Objects/dictobject.c:324
    #17 0x00000000004664f2 in PyDict_Clear (op={}) at Objects/dictobject.c:1722
    #18 0x00000000004969b8 in type_clear (type=0x911c50) at Objects/typeobject.c:3637
    #19 0x0000000000490d03 in subtype_clear (self=<EnumMeta() at remote 0x911c50>) at Objects/typeobject.c:1118
    #20 0x000000000058d390 in delete_garbage (collectable=0x7fffffffcef0, old=0x7d2560 <_PyRuntime+416>) at Modules/gcmodule.c:931
    #21 0x000000000058d833 in collect (generation=2, n_collected=0x0, n_uncollectable=0x0, nofail=1) at Modules/gcmodule.c:1100
    #22 0x000000000058f175 in _PyGC_CollectNoFail () at Modules/gcmodule.c:1915
    #23 0x000000000054759c in PyImport_Cleanup () at Python/import.c:589
    #24 0x000000000055a442 in Py_FinalizeEx () at Python/pylifecycle.c:1162
    #25 0x000000000055c1e5 in Py_Exit (sts=1) at Python/pylifecycle.c:2188
    #26 0x00000000005677fb in handle_system_exit () at Python/pythonrun.c:642
    #27 0x0000000000567821 in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:652
    #28 0x00000000005674c5 in PyErr_Print () at Python/pythonrun.c:548
    #29 0x00000000004222f2 in pymain_run_module (modname=0x7db6d0 L"asyncio", set_argv0=1) at Modules/main.c:566
    #30 0x0000000000422ad6 in pymain_run_python (interp=0x7da350, exitcode=0x7fffffffd254) at Modules/main.c:799
    #31 0x0000000000422c6a in pymain_main (args=0x7fffffffd2a0) at Modules/main.c:877
    #32 0x0000000000422d4e in _Py_UnixMain (argc=3, argv=0x7fffffffd3c8) at Modules/main.c:922
    #33 0x0000000000420796 in main (argc=3, argv=0x7fffffffd3c8) at ./Programs/python.c:16
    ---

    Maybe my assumption on tp_mro was wrong. I will remove the assertion.

    @vstinner
    Copy link
    Member Author

    I'm not sure if I should include an unit test. WIP patch for that:

    diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
    index 350ef77163..9c0d0cf41a 100644
    --- a/Modules/_testcapimodule.c
    +++ b/Modules/_testcapimodule.c
    @@ -4718,6 +4718,18 @@ negative_refcount(PyObject *self, PyObject *Py_UNUSED(args))
     #endif
     
     
    +static PyObject *
    +corrupted_object(PyObject *self, PyObject *Py_UNUSED(args))
    +{
    +    PyObject *obj = PyList_New(0);
    +    if (obj == NULL) {
    +        return NULL;
    +    }
    +    obj->ob_type = NULL;
    +    return obj;
    +}
    +
    +
     static PyMethodDef TestMethods[] = {
         {"raise_exception",         raise_exception,                 METH_VARARGS},
         {"raise_memoryerror",       raise_memoryerror,               METH_NOARGS},
    @@ -4948,6 +4960,7 @@ static PyMethodDef TestMethods[] = {
     #ifdef Py_REF_DEBUG
         {"negative_refcount", negative_refcount, METH_NOARGS},
     #endif
    +    {"corrupted_object", corrupted_object, METH_NOARGS},
         {NULL, NULL} /* sentinel */
     };
     

    Tested manually using this script:
    ---

    import gc, _testcapi, sys
    
    gc.enable_object_debugger(1)
    x = _testcapi.corrupted_object()
    y = []
    y = None
    # Debugger should trigger here
    x = None

    @serhiy-storchaka
    Copy link
    Member

    It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).

    @vstinner
    Copy link
    Member Author

    It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar).

    Hum, I looked at my PR and I'm not sure that I added such new assertion.

    Note: "assert" on calling assert(_PyDict_CheckConsistency(mp)) is only used to remove the call in release build, but the function always return 1. The function does uses assert() or _PyObject_ASSERT() internally with a different line number and the exact failing expression.

    Do you want me to enhance existing _PyDict_CheckConsistency() assertions in the same PR?

    @methane
    Copy link
    Member

    methane commented Mar 22, 2019

    I don't think calling APIs like _PyDict_CheckConsistency() is super useful. Can the PR find bugs like bpo-33803 quickly?

    I think calling tp_traverse is better.

    static int
    check_object(PyObject *obj, void *unused)
    {
        _PyObject_ASSERT(obj, Py_REFCNT(obj) > 0);
        return 0;
    }
    
    static void
    gc_check_object(PyGC_Head *gc)
    {
        PyObject *op = FROM_GC(gc);
        _PyObject_ASSERT(op, Py_REFCNT(obj) > 0);
        _PyObject_ASSERT(op, _PyObject_GC_IS_TRACKED(op));
        Py_Type(op)->tp_traverse(op, (visitproc)check_object, NULL);
    }

    @vstinner
    Copy link
    Member Author

    I don't think calling APIs like _PyDict_CheckConsistency() is super useful.

    I looked a multiple old issues which contain "visit_decref". Most of them are really strange crashes and were closed with a message like "we don't have enough info to debug, sorry". So honestly, I'm not sure of what is the most "efficient" way to detect corrupted objects.

    I guess that we need a trade-off between completeness of the checks and the performance.

    gc.enable_object_debugger(1) simply makes Python completely unusable. Maybe such very bad performance makes the feature basically useless. I'm not sure at this point.

    I tried to find an old bug which mentioned "visit_decref", tried to reintroduced the fixed bug, but I'm not really convinced by my experimental tests so far.

    That being said, I *like* your idea of reusing tp_traverse. Not only it fits very well into the gc module (I chose to put the new feature in the gc module on purpose), but it's closer to the existing "visit_decref crash". If someone gets a crash if visit_decref() and the object debugger uses tp_traverse, object debugger *will* catch the same bug. The expectation is to be able to get it early.

    --

    Oh by the way, why not using lower GC thresholds? I proposed this idea, but there are multiple issues with that. It can hide the bug (objects destroyed in a different order). It can also change the behavior of the application, which is linked to my previous point (again, objects destroyed in a different order).

    That's how Serhiy Storchaka proposed the design of gc.enable_object_debugger(): traverse without touching the reference counter.
    https://mail.python.org/pipermail/python-dev/2018-June/153860.html

    Thanks Serhiy for this nice idea ;-)

    @vstinner
    Copy link
    Member Author

    I modified my PR to reuse tp_traverse.

    Inada-san: would you mind to review my change?

    @vstinner
    Copy link
    Member Author

    Do you think that a gc.is_object_debugger_enabled() function would be needed?

    The tracemalloc module has 3 functions:

    • start(), stop()
    • is_tracing()

    The faulthandler module has 3 functions:

    • enable() / disable()
    • is_enabled()

    @vstinner
    Copy link
    Member Author

    bpo-33803 bug can be reintroduced using the following patch:

    diff --git a/Python/hamt.c b/Python/hamt.c
    index 67af04c437..67da8ec22c 100644
    --- a/Python/hamt.c
    +++ b/Python/hamt.c
    @@ -2478,8 +2478,10 @@ hamt_alloc(void)
         if (o == NULL) {
             return NULL;
         }
    +#if 0
         o->h_count = 0;
         o->h_root = NULL;
    +#endif
         o->h_weakreflist = NULL;
         PyObject_GC_Track(o);
         return o;

    And then run:

    ./python -m test -v test_context

    The best would be to also be able to catch the bug in:

    ./python -m test -v test_asyncio

    Problem: Right now, my GC object debugger implementation is way too slow to use a threshold lower than 100, whereas the bug is catched like "immediately" using gc.set_threshold(5).

    Maybe my implementation should be less naive: rather than always check *all* objects tracked by the GC, have different thresholds depending on the generation? Maybe reuse GC thresholds?

    @vstinner
    Copy link
    Member Author

    New changeset 2b00db6 by Victor Stinner in branch 'master':
    bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770)
    2b00db6

    @vstinner
    Copy link
    Member Author

    New changeset 4c409be by Victor Stinner in branch 'master':
    bpo-36389: Change PyMem_SetupDebugHooks() constants (GH-12782)
    4c409be

    @vstinner
    Copy link
    Member Author

    Currently, I'm using the following patch to try to detect bpo-33803 bug using my GC object debugger. It's still a work-in-progress.

    diff --git a/Lib/site.py b/Lib/site.py
    index ad1146332b..e184080b19 100644
    --- a/Lib/site.py
    +++ b/Lib/site.py
    @@ -638,3 +638,8 @@ def _script():

     if __name__ == '__main__':
         _script()
    +
    +if 'dev' in sys._xoptions:
    +    import gc
    +    gc.enable_object_debugger(100)
    +    #gc.set_threshold(5)
    diff --git a/Python/hamt.c b/Python/hamt.c
    index 67af04c437..67da8ec22c 100644
    --- a/Python/hamt.c
    +++ b/Python/hamt.c
    @@ -2478,8 +2478,10 @@ hamt_alloc(void)
         if (o == NULL) {
             return NULL;
         }
    +#if 0
         o->h_count = 0;
         o->h_root = NULL;
    +#endif
         o->h_weakreflist = NULL;
         PyObject_GC_Track(o);
         return o;

    @vstinner
    Copy link
    Member Author

    New changeset 9e23f0a by Victor Stinner in branch '3.7':
    [3.7] bpo-36389: _PyObject_IsFreed() now also detects uninitialized memory (GH-12770) (GH-12788)
    9e23f0a

    @vstinner
    Copy link
    Member Author

    New changeset 0fc91ee by Victor Stinner in branch 'master':
    bpo-36389: Add _PyObject_CheckConsistency() function (GH-12803)
    0fc91ee

    @vstinner
    Copy link
    Member Author

    New changeset 0810fa7 by Victor Stinner in branch 'master':
    bpo-36389: Cleanup gc.set_threshold() (GH-12844)
    0810fa7

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 8, 2019

    I opened this issue, because I was convinced that it would be easy to implement checks faster than gc.setthreshold(), but I failed to write efficient tests which detect the bugs that I listed above.

    My approach was basically to check all objects tracked by the GC every N memory allocations (PyGC_Malloc): too slow. I tried to put thresholds per generation: it was still too slow.

    Maybe recent objects should be checked often, but old objects should be checked less often. For example, only check objects in generation 0 and scan new objects, and then remember the size of generation 0. At the next check, ignore objects already checked.

    I failed to find time and interest to implement this approach. I abandon this issue and my PR.

    In the meanwhile, gc.set_threshold(5) can be used. It isn't too slow and is quite good to find most bugs listed in this issue.

    @vstinner vstinner closed this as completed Jul 8, 2019
    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 8, 2019

    Ah by the way, this issue was mostly motivated by a customer issue, but the bug disappeared from customer's production.

    Moreover, Python 3.8 now allows to use debug build without having to recompile all C extensions:
    https://docs.python.org/dev/whatsnew/3.8.html#debug-build-uses-the-same-abi-as-release-build

    A debug build may also help to catch more bugs.

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    Update: I added an assertion which should help to detect some kind of bugs in debug mode:

    commit d91d4de
    Author: Victor Stinner <vstinner@redhat.com>
    Date: Mon Sep 9 17:44:59 2019 +0200

    bpo-38070: visit_decref() calls _PyObject_IsFreed() (GH-15782)
    
    In debug mode, visit_decref() now calls _PyObject_IsFreed() to ensure
    that the object is not freed. If it's freed, the program fails with
    an assertion error and Python dumps informations about the freed
    object.
    

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    I created bpo-38392 "Ensure that objects entering the GC are valid".

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    New changeset 6876257 by Victor Stinner in branch 'master':
    bpo-36389: _PyObject_CheckConsistency() available in release mode (GH-16612)
    6876257

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    New changeset 60ec6ef by Victor Stinner in branch 'master':
    bpo-36389: Fix _PyBytesWriter in release mode (GH-16624)
    60ec6ef

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 7, 2019

    New changeset 7775349 by Victor Stinner in branch 'master':
    bpo-36389: Add newline to _PyObject_AssertFailed() (GH-16629)
    7775349

    @vstinner
    Copy link
    Member Author

    New changeset f82ce5b by Victor Stinner in branch '3.8':
    [3.8] bpo-36389: Backport debug enhancements from master (GH-16796)
    f82ce5b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants