-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Crash in remove() weak reference callback of weakref.WeakValueDictionary at Python exit #82187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have a case of a segfault on Fedora 32 (currently rawhide) with Python 3.8b4 and FreeIPA. The ipactl Python helper crashes with a segfault when the Python process exits. The segfault occurs in _PyFunction_Vectorcall() with a weakref related local scope function that has a NULL func_code pointer. _PyFunction_Vectorcall() does not check the code object for NULL and access the variable without check: Program received signal SIGSEGV, Segmentation fault. It looks like the func object is the local remove function of weakref.WeakValueDictionary() implementation, Lines 90 to 112 in 353053d
All function object fields except func_qualname and vectorcall are already NULL. (gdb) print ((PyFunctionObject*)func).func_annotations The Fedora downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1747901 |
I don't understand how the function ended up with func_code=NULL. That shouldn't be a valid function to call, IMO. |
Not yet. My current hypothesis is the function code object is already cleaned up *somehow*. |
It doesn't seem possible to create a function with func_code=NULL, nor to set func_code to NULL. func_code can be be set to NULL by func_clear() which is called by func_dealloc(). I bet that func_clear() has been called since most func fields are set to NULL, which is consistent with: static int
func_clear(PyFunctionObject *op)
{
Py_CLEAR(op->func_code);
Py_CLEAR(op->func_globals);
Py_CLEAR(op->func_module);
Py_CLEAR(op->func_name);
Py_CLEAR(op->func_defaults);
Py_CLEAR(op->func_kwdefaults);
Py_CLEAR(op->func_doc);
Py_CLEAR(op->func_dict);
Py_CLEAR(op->func_closure);
Py_CLEAR(op->func_annotations);
Py_CLEAR(op->func_qualname);
return 0;
} The question is how is it possible that a deallocated function is still accessed? It smells like a borrowed reference somewhere in the call chain. |
Using gdb, I checked if func_clear() can be cleared outside func_dealloc(): yes, delete_garbage() (gcmodule.c) calls type->clear(). But I'm surprised that the function would be seen as "unreachable" if it's reference counter was equal to 135: (gdb) print func.ob_refcnt |
(gdb) print ((PyFunctionObject*)func).func_qualname This is a temporary remove() function declared in weakref.WeakValueDictionary constructor: class WeakKeyDictionary(_collections_abc.MutableMapping):
def __init__(self, dict=None):
def remove(k, selfref=ref(self)):
...
self._remove = remove
...
def __setitem__(self, key, value):
self.data[ref(key, self._remove)] = value
Simplified gdb traceback: Py_FinalizeEx()
-> PyImport_Cleanup()
-> _PyGC_CollectNoFail()
-> delete_garbage()
-> func_clear(op=0x7fffe6ebfd30) at /usr/src/debug/python3-3.8.0~b4-1.fc32.x86_64/Objects/funcobject.c:584
Py_CLEAR(op->func_closure);
-> tupledealloc()
-> cell_dealloc()
-> dict_dealloc()
-> dictkeys_decref()
-> cfield_dealloc () from _cffi_backend
-> PyObject_ClearWeakRefs()
-> _PyFunction_Vectorcall (func=<function at remote 0x7fffe6ebfd30>, ...) PyImport_Cleanup() calls delete_garbage() which calls func_clear(op=0x7fffe6ebfd30). But while the function closure is destroyed, a cffi object stored in a dict (namespace?) (stored in the function closure) is cleared, and there was a weak reference to this cffi object with a callback... which is the cleared function (Python object 0x7fffe6ebfd30)... -- I don't understand why/how remove() gets a closure. I modified weakref.py to ensure that remove.__closure__ is None: test_weakref still pass. |
Ok, I found the reason and I wrote PR 15641 to fix it. Previous fix of WeakValueDictionary remove() function: commit 9cd7e17
|
Your workaround solves the segfault issue for me. |
I think the real problem is in vectorcall. tp_clear will make sure all internal pointers are either NULL or valid (by using Py_CLEAR, which also protects against re-entrancy) and it the responsibility of the object or its users to check that the fields that are needed are valid. From the docs on tp_clear:
Notice the "so that self knows the contained object can no longer be used". I think _PyFunction_Vectorcall is missing the check "to see of the contained object can or cannot be used". Here the contained object being everything that has being cleaned before: Py_CLEAR(op->func_code);
Py_CLEAR(op->func_globals);
Py_CLEAR(op->func_module);
Py_CLEAR(op->func_name);
Py_CLEAR(op->func_defaults);
Py_CLEAR(op->func_kwdefaults);
Py_CLEAR(op->func_doc);
Py_CLEAR(op->func_dict);
Py_CLEAR(op->func_closure); <----- Everything before this
Py_CLEAR(op->func_annotations);
Py_CLEAR(op->func_qualname); I think it will be enough to make a check for func_code being NULL in _PyFunction_Vectorcall or before. |
In particular, this was not happening before because the function type did not implement tp_clear: https://github.com/python/cpython/blob/3.7/Objects/funcobject.c#L615 The new implementation of tp_clear without checks is allowing this to happen. |
See also https://bugs.python.org/issue33418 as the potential source of the problem. |
I'm not sure adding a check would solve this. What should be done when a function with func_code=NULL is called? "Silently do nothing" is not really an option; raising an exception wouldn't help much in this case. I wonder if a function's tp_clear we should clear the mutable parts (like func_closure) before the immutable ones (func_code). |
I'm now able to reproduce the FreeIPA crash. In short:
I confirm that the patch below does fix the FreeIPA crash: diff --git a/Objects/funcobject.c b/Objects/funcobject.c
index df5cc2d3f5..357372c565 100644
--- a/Objects/funcobject.c
+++ b/Objects/funcobject.c
@@ -669,7 +669,7 @@ PyTypeObject PyFunction_Type = {
Py_TPFLAGS_METHOD_DESCRIPTOR, /* tp_flags */
func_new__doc__, /* tp_doc */
(traverseproc)func_traverse, /* tp_traverse */
- (inquiry)func_clear, /* tp_clear */
+ (inquiry)0, /* tp_clear */
0, /* tp_richcompare */
offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
0, /* tp_iter */ |
PyFunction_Type.tp_clear changed in bpo-33418 (previously, it was equal to 0: no clear method): commit 3c45240
|
In https://bugs.python.org/issue33418 I proposed reverting tp_clear on function objects.
We can set the error indicator at least. Although I agree that it seems suboptimal. At least I think we should add an assertion to check that the function called is valid so we can identify this situation easier. What downsides do we see raising an exception? Notice that this can happen without involving weak references. For example: You have object A and B. A is a function. The destructor of B calls A. A and B are in a cycle. When calling tp_clear on A the refs of B reaches 0, trying to call A and .... Boom! Either we remove tp_clear or somehow we have to protect all callers of the function. Technically, any field cleared from tp_cleared leaves the function in an inconsistent state |
Yeah, on second thought, that would probably be best. We still want PR bpo-15641 as well, so the exception doesn't actually happen in practice. Note that _PyEval_EvalCodeWithName (the main use of func_globals) already fails with an exception if globals is NULL.
tp_clear has a purpose; it prevents memory leaks. That leaves protecting access to the cleared fields. (But we don't need to clear func_name and func_qualname as these must be strings.) |
The more I think about this the more I think there is something else at play. If the GC is able to follow the dependency chain, all weakrefs should have been already managed correctly and PyObject_ClearWeakRefs should have never be invoked. I think *something* is not informing the gc about the dependency chain |
It's going to be hard to figure that out. FreeIPA uses a ton of C extensions. Some are hand-written, some uses Cython, ctypes, and cffi. |
The traceback does have some ctypedescr_dealloc and cfield_dealloc from _cffi. Could you check what those are? |
I think the problem is that whatever is weak-referenced by the weak ref (CField_Type or similar) is not on the gc list. Because is not on the gc list, handle_weakrefs (https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1090) is not acting correctly on the weakreferences, clearing them before the tp_clear calls and therefore producing the crash. |
Unless we are missing something I think this may be caused by something in cffi that is not implementing gc-related functions correctly, as PyObject_ClearWeakRefs should not be called from a gc run. Given how complicated this is and even if the possible solution is to fix the problem in cffi or elsewhere, we can add an extra check in PyObject_ClearWeakRefs to NOT call callbacks that are in the process of being collected. Technically we should never reach that code, but we won't be segfaulting and the cost is a redundant-check that will be false in the normal path. |
The biggest CFFI based dependency of FreeIPA is cryptography, but cryptography does not use weakref in its sources. CFFI on the other hand has a WeakValueDictionary object attached to its internal typecache backend, https://bitbucket.org/cffi/cffi/src/bf80722dea36237710083315e336c81bc8571fd0/cffi/model.py#lines-572 . Perhaps that is the culprit? I checked the CTypeDescrObject from the stacktrace. Three out of three times it's for "unsigned long(*)(void *)". I wasn't able to get any useful information from the other fields yet. (gdb) up 7 |
For the weakref to be handled correctly the ctypedescr needs to be identified correctly as part of the isolated cycle. If is not in the isolated cycle something may be missing tp_traverse or the GC flags. Can you check if the ctypedescr is part of GC.objects()? |
I investigated the FreeIPA crash.
-- PR 15641 just hides the real bug. One issue is that CFFI doesn't implement correctly the GC protocol. If an object contains another object, its type must:
Another issue is that the GC doesn't prevent the crash. Would it be possible to prevent the crash without changing the behavior (ex: still call weakref callbacks)? |
Łukasz, all type objects have tp_clear slots, and always did. The patch in question put something useful in the function object's tp_clear slot instead of leaving it NULL. No interface, as such, changes either way. |
I think it should not have ABI consequences. However, I see the addition of tp_clear as a new feature. Leaking memory due to reference cycles is bad behavior but not a bug to be fixed in a maintenance release. Unless we require full GC protocol on every container object, we can't promise not to leak. Inaka's change to add func tp_clear has been in all the 3.8.0 pre-releases (I think). So, it should be pretty well exercised by now (at least, as well as the per-releases are tested). |
If either of you resurrects tp_clear in functions in the next 12 hours or so, I'll merge it. |
I created #60706. I'm not exactly sure of the process of making a revert on a branch like '3.8' so hopefully it is correct. The code change is exactly what has been reverted in 3.8. The net effect will be as if the revert was never done. |
I worked some on trying to create a unit test this evening. Attached is my best result so far (gc_weakref_bug_demo.py). It requires that you enable the 'xx' module so we have a container object without tp_traverse. We should probably add one to _testcapi for the real unit test so we have it. I can make the weakref callback run from within delete_garbage(). I haven't been able to make tp_clear for the function execute before the callback and I think that is needed to replicate the crash in this bug report. |
Thanks a million everyone, this is now hopefully solved. I'm leaving it open for Neil to experiment with his unit test (which would be amazing to have!). |
Neil, about this comment: # - ct is not yet trash (it actually is but the GC doesn't know because of I believe gc should know ct is trash. ct is in the cf list, and the latter does have tp_traverse. What gc won't know is that It should blow up anyway :-) Maybe with a simpler structure it would be easier to rearrange code to nudge the callback into getting cleared before use? Z <- Y <- A <-> B -> WZ -> C where WZ is a weakref to Z with callback C, and Y doesn't implement tp_traverse. The only cycle is between A and B, which could just as well be the same object. All the other stuff hangs off that cycle. It's all trash, but we won't know in advance that Z is part of it. |
Łukasz, is there some reason you removed old versions (2.7, 3.6, etc)? The bug is present on those versions of Python and it should be trivial to backport the fix. If those branches are maintained, I think we should fix it. Attached is a small test program (gc_weakref_bug_demo2.py) that causes the same crash as in this bug report (SEGV inside _PyFunction_Vectorcall). Setting things up is quite tricky and it depends on the order that things are cleared in delete_garbage(). Maybe still worth making a unit test for it? Instead of using the 'dummy' function, you can also use types.SimpleNamespace(). Calling repr() on it after tp_clear will result in a crash or an assert failure. Those two crashes are not needed to confirm the bug. Just the fact that 'callback' runs is proof enough. So, in a unit test, the callback to just set a global to indicate failure. |
Oh, Neil missed "bpo-38006: " prefix in his commit. Here it is: commit bcda460
|
Sorry, I did that by mistake. |
Loose ends. Telegraphic because typing is hard.
So best optimistic guess is that only this can go wrong: we promise that a finalizer will run at most once (even if the object is resurrected), but a SF neither records that it has been run nor recognizes that (if so) it's already been run. IOW, a finalizer will be run at most once when it's not a surprise, but will run every time it is a SF.
Must not move invalids to older generation. Move to new internal (to delete_garbage) list instead. That list should be empty at delete_garbage's end. If not, I'd be happy to die with a fatal runtime error. Else, e.g.,
|
Would it make any sense to add an opt-in option to emit a warning when a new type is created with Py_TPFLAGS_HAVE_GC but it doesn't implement tp_traverse? Maybe also emit a warning if it doesn't implement tp_clear? Maybe it could be a ResourceWarning emitted in development mode, when -X dev is used on the command line. |
Or even fatal error out? There's no reason to have Py_TPFLAGS_HAVE_GC but not implement tp_traverse, AFAIR. |
Well... It would prefer a smooth transition. Such sanity checks which mostly concerns developers perfectly fit the -X dev mode. I mean, calling Py_FatalError() only in development mode would be acceptable, rather than blocking the usage of C extensions which are working "perfectly well" on Python 3.7. |
My understanding is that the CFFI types at issue don't even have Py_TPFLAGS_HAVE_GC. They're completely invisible to gc. As Armin says in the CFFI issue report (linked to earlier), he never got the impression from the docs that he needed to implement anything related to cyclic gc. Since Armin is extremely capable, conscientious, and reasonable, that tells me our docs are lacking. It was intended at the start that the worst that could happen if a type ignored the gc protocols was that memory may leak. That story changed radically when weakrefs with callbacks were added - but nobody knew it at the time because the weakref design gave no thought to cyclic gc. It's been driven by segfaults ever since ;-) We're doing all we can to keep non-cooperating code "working", and will continue to do so. But who knows? The next segfault may be one we can't hack around. It's fundamentally insane to expect any gc to work perfectly when it may be blind to what the object graph _is_. |
BTW, the phrase "missing tp_traverse" is misleading. If an object with a NULL tp_traverse appears in a gc generation, gc will blow up the next time that generation is collected. That's always been so - gc doesn't check whether tp_traverse is NULL, it just calls it. It's tp_clear that it checks, because that one is optional. I don't recall any problem we've had with extensions implementing the gc protocol incorrectly or incompletely. It's this issue's problem: containers not participating in gc _at all_. If we had a traditional mark-sweep collector, that would be massively catastrophic. Miss a pointer and you can conclude a live object is actually trash, and tear it down while it's still in use. Our problem is the opposite: miss a pointer and we can conclude a trash object is actually live. At the start, the worst that _could_ do is leak memory. It's the later introduction of time-to-die finalizers (weakref callbacks at first) that created the opportunity for segfaults: the only 100% clearly safe way to run finalizers in cyclic trash is to force them to run _before_ anything at all is torn down by force (tp_clear). But to run them in advance, we have to know the relevant objects _are_ trash. Which we can't always know if containers don't always participate. While Neil & I haven't thought of ways that can go wrong now beyond that a "surprise finalizer" may get run any number of times, that doesn't mean far worse things can't happen - just that they'll surprise us when they do :-) |
I'm often amazed it works at all, let alone perfectly. ;-P This did trigger me to think of another possible problem. I setup my unit test as you suggested:
But what happens if the GC doesn't see that WZ is trash? Then it will not be cleared. Hang it off Y so the GC can't find it. We can set things up so that Z is freed before WZ (e.g. WZ in a second and newer cycle). Then, the callback might still run. On further thought, this seems safe (but maybe broken) because of the handle_weakrefs() logic. The GC will think WZ is going to outlive Z so it will call it before doing any tp_clear calls. |
Indeed! Every time I take a break from gc and come back, I burn another hour wondering why it doesn't recycle _everything_ ;-)
In that case it won't think C is trash either (a weakref holds a strong reference to its callback), so C won't be tp_clear'ed - if WZ isn't in the unreachable set, C can't be either.
Since C hasn't been damaged, and nothing reachable from C would have been considered trash either, I don't see a problem.
Don't think that applies. As above, WZ isn't in the unreachable set, so gc knows nothing about it. Z will eventually be reclaimed via refcounting side effect, C _will_ run, and then WZ, and then C, will be reclaimed by refcounting. Or am I missing something? |
Speaking of which, I no longer believe that's true. Thanks to the usual layers of baffling renaming via macro after macro, I missed object.c's PyObject_CallFinalizerFromDealloc(), which both respects and sets the "am I finalized?" flag regardless of how, e.g., an object with a __del__ ends its life (and possibly resurrects). |
Are we missing something here? |
Closing since I believe the bug is fixed and there is an appropriate unit test. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: