-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gc.enable_object_debugger(): detect corrupted Python objects in the GC #80570
Comments
That's the follow-up of a thread that I started on python-dev in June 2018: [Python-Dev] Idea: reduce GC threshold in development mode (-X dev) When an application crash during a garbage collection, we are usually clueless about the cause of the crash. The crash usually occur in visit_decref() on a corrupted Python object. Sadly, not only there are too many possible reasons which can explain why a Python object is corrupted, but the crash usually occur too late after the object is corrupted. Using a smaller GC threshold can help, but it's not enough. It would help to be able to enable a builtin checker for corrupted objects. Something that we would triggered by the GC with a threshold specified by the user and that would have zero impact on performance when it's not used. The implementation would be to iterate on objects and ensure that they are consistent. Attached PR is an implementation of this idea. It uses new API that I wrote recently:
If an inconsistency is detected, _PyObject_ASSERT() will call _PyObject_Dump() to dump info about the object. This function can crash, but well, anything can crash on a memory corruption... |
Hum, _PyType_CheckConsistency() fails on the following assertion during Python finalization:
Error: object : <enum 'AddressFamily'> Current thread 0x00007ffff7be8740 (most recent call first): gdb traceback: Maybe my assumption on tp_mro was wrong. I will remove the assertion. |
I'm not sure if I should include an unit test. WIP patch for that: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 350ef77163..9c0d0cf41a 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -4718,6 +4718,18 @@ negative_refcount(PyObject *self, PyObject *Py_UNUSED(args))
#endif
+static PyObject *
+corrupted_object(PyObject *self, PyObject *Py_UNUSED(args))
+{
+ PyObject *obj = PyList_New(0);
+ if (obj == NULL) {
+ return NULL;
+ }
+ obj->ob_type = NULL;
+ return obj;
+}
+
+
static PyMethodDef TestMethods[] = {
{"raise_exception", raise_exception, METH_VARARGS},
{"raise_memoryerror", raise_memoryerror, METH_NOARGS},
@@ -4948,6 +4960,7 @@ static PyMethodDef TestMethods[] = {
#ifdef Py_REF_DEBUG
{"negative_refcount", negative_refcount, METH_NOARGS},
#endif
+ {"corrupted_object", corrupted_object, METH_NOARGS},
{NULL, NULL} /* sentinel */
};
Tested manually using this script: import gc, _testcapi, sys
gc.enable_object_debugger(1)
x = _testcapi.corrupted_object()
y = []
y = None
# Debugger should trigger here
x = None |
It is better to not use assert(foo && bar). Use instead two separate asserts: assrte(foo) and assert(bar). |
Hum, I looked at my PR and I'm not sure that I added such new assertion. Note: "assert" on calling assert(_PyDict_CheckConsistency(mp)) is only used to remove the call in release build, but the function always return 1. The function does uses assert() or _PyObject_ASSERT() internally with a different line number and the exact failing expression. Do you want me to enhance existing _PyDict_CheckConsistency() assertions in the same PR? |
I don't think calling APIs like _PyDict_CheckConsistency() is super useful. Can the PR find bugs like bpo-33803 quickly? I think calling tp_traverse is better. static int
check_object(PyObject *obj, void *unused)
{
_PyObject_ASSERT(obj, Py_REFCNT(obj) > 0);
return 0;
}
static void
gc_check_object(PyGC_Head *gc)
{
PyObject *op = FROM_GC(gc);
_PyObject_ASSERT(op, Py_REFCNT(obj) > 0);
_PyObject_ASSERT(op, _PyObject_GC_IS_TRACKED(op));
Py_Type(op)->tp_traverse(op, (visitproc)check_object, NULL);
} |
I looked a multiple old issues which contain "visit_decref". Most of them are really strange crashes and were closed with a message like "we don't have enough info to debug, sorry". So honestly, I'm not sure of what is the most "efficient" way to detect corrupted objects. I guess that we need a trade-off between completeness of the checks and the performance. gc.enable_object_debugger(1) simply makes Python completely unusable. Maybe such very bad performance makes the feature basically useless. I'm not sure at this point. I tried to find an old bug which mentioned "visit_decref", tried to reintroduced the fixed bug, but I'm not really convinced by my experimental tests so far. That being said, I *like* your idea of reusing tp_traverse. Not only it fits very well into the gc module (I chose to put the new feature in the gc module on purpose), but it's closer to the existing "visit_decref crash". If someone gets a crash if visit_decref() and the object debugger uses tp_traverse, object debugger *will* catch the same bug. The expectation is to be able to get it early. -- Oh by the way, why not using lower GC thresholds? I proposed this idea, but there are multiple issues with that. It can hide the bug (objects destroyed in a different order). It can also change the behavior of the application, which is linked to my previous point (again, objects destroyed in a different order). That's how Serhiy Storchaka proposed the design of gc.enable_object_debugger(): traverse without touching the reference counter. Thanks Serhiy for this nice idea ;-) |
I modified my PR to reuse tp_traverse. Inada-san: would you mind to review my change? |
Do you think that a gc.is_object_debugger_enabled() function would be needed? The tracemalloc module has 3 functions:
The faulthandler module has 3 functions:
|
bpo-33803 bug can be reintroduced using the following patch: diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
if (o == NULL) {
return NULL;
}
+#if 0
o->h_count = 0;
o->h_root = NULL;
+#endif
o->h_weakreflist = NULL;
PyObject_GC_Track(o);
return o; And then run: ./python -m test -v test_context The best would be to also be able to catch the bug in: ./python -m test -v test_asyncio Problem: Right now, my GC object debugger implementation is way too slow to use a threshold lower than 100, whereas the bug is catched like "immediately" using gc.set_threshold(5). Maybe my implementation should be less naive: rather than always check *all* objects tracked by the GC, have different thresholds depending on the generation? Maybe reuse GC thresholds? |
Currently, I'm using the following patch to try to detect bpo-33803 bug using my GC object debugger. It's still a work-in-progress. diff --git a/Lib/site.py b/Lib/site.py if __name__ == '__main__':
_script()
+
+if 'dev' in sys._xoptions:
+ import gc
+ gc.enable_object_debugger(100)
+ #gc.set_threshold(5)
diff --git a/Python/hamt.c b/Python/hamt.c
index 67af04c437..67da8ec22c 100644
--- a/Python/hamt.c
+++ b/Python/hamt.c
@@ -2478,8 +2478,10 @@ hamt_alloc(void)
if (o == NULL) {
return NULL;
}
+#if 0
o->h_count = 0;
o->h_root = NULL;
+#endif
o->h_weakreflist = NULL;
PyObject_GC_Track(o);
return o; |
I opened this issue, because I was convinced that it would be easy to implement checks faster than gc.setthreshold(), but I failed to write efficient tests which detect the bugs that I listed above. My approach was basically to check all objects tracked by the GC every N memory allocations (PyGC_Malloc): too slow. I tried to put thresholds per generation: it was still too slow. Maybe recent objects should be checked often, but old objects should be checked less often. For example, only check objects in generation 0 and scan new objects, and then remember the size of generation 0. At the next check, ignore objects already checked. I failed to find time and interest to implement this approach. I abandon this issue and my PR. In the meanwhile, gc.set_threshold(5) can be used. It isn't too slow and is quite good to find most bugs listed in this issue. |
Ah by the way, this issue was mostly motivated by a customer issue, but the bug disappeared from customer's production. Moreover, Python 3.8 now allows to use debug build without having to recompile all C extensions: A debug build may also help to catch more bugs. |
Update: I added an assertion which should help to detect some kind of bugs in debug mode: commit d91d4de
|
I created bpo-38392 "Ensure that objects entering the GC are valid". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: