-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The garbage collector doesn't take in account that objects of heap allocated types hold a strong reference to their type #84398
Comments
The bpo-35810 modified the object allocate to hold a *strong* reference to the type in PyObject.ob_type, whereas PyObject.ob_type is a *borrowed* references if the type is statically allocated. commit 364f0b0
The problem is now in some corner cases, the GC fails to visit all referrer of a type and so considers that the type is still alive. bpo-40149 is a concrete example of such bug. I propose to modify the GC to take bpo-35810 in account. ... or maybe I just misunderstood bpo-40149 bug :-) |
What? The GC is agnostic of what it receives. It works with objects in general that implement the gc support, but it does not care about what those objects are. The only specal case are weakreferences and because those have inherit GC semantics. I am not sure about what you are proposing, could you elaborate? |
Also, heap types (created with type_new()) are already taking into account the type when visiting references: https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1111 |
If object A owns a strong reference to object B, and A participates in cyclic gc, and B may be part of a cycle, then it's necessary and sufficient that A's type's tp_traverse implementation invoke Py_VISIT() passing A's pointer to B. It would be a Really Bad Idea to add special cases to the gc module to spare some specific type(s) from following that (currently) utterly uniform rule. |
It would be inconvenient to require adding Py_VISIT(Py_TYPE(self)) in all tp_visit implementations of heap allocated types (including third-party extensions). Since tp_visit is GC specific thing, I think it wold be more convenient make a special case for object types. |
I don't think that follows: The gc defers the logic of what and what should not be visited to the tp_traverse implementation of every object. The GC must be agnostic of the types it receives and heap types here are not special. -- Also, if we make an exception in the GC for this special case, all tp_traverse of all those functions will be incorrect. Someone reading those tp_traverse can say "Oh, why these functions do not visit the type if they own s reference to it, this looks incorrect" and then we need to explain that that is an exception in the GC just because we were lazy to implement them correctly. |
Interesting, this issue may be related to bpo-24379. The problem with the proposed implementation of subscript was that it created a reference loop, and not all links in this loop were visible by GC. |
Also, as I mentioned, you don't need to modify all objects tp_traverse, only it's type.tp_traverse slot. For instance, all python objects know how to traverse stuff because they share the same tp_traverse: https://github.com/python/cpython/blob/master/Objects/typeobject.c#L1082 So unless I am missing something, if you want to affect all heap types you just need to modify one tp_traverse in one place: the superclass. |
Hummm, I think we may just be missing a Py_VISIT(Py_TYPE(self))here: https://github.com/python/cpython/blob/master/Objects/typeobject.c#L3562 A class owns references to its type (type) or another metaclass Tim, could you confirm that that is the case? |
Checking it more closely, that is incorrect, so we are not missing that visitation :( |
Recently many static allocated types were converted to heap allocated types (using PyType_FromSpec). Now you need to add Py_VISIT(Py_TYPE(self)) in all corresponding tp_visit implementations. And since even official example for heap allocated types do not contain it (and it was not needed before 3.9), I am sure that all existing code do not contain it. We cannot change all user code, so we should change the interpreter code so that it will work correctly with existing user code. |
If we made a change that make all user code suddenly incorrect, that change should be reverted. The GC has clear rules about what tp_traverse should and should not do, and we should not violate those rules and make special cases in the gc just because we forced some classes to be incorrect. This will make much more difficult to reason about GC bugs, the tp_traverse implementation of classes and much difficult to maintain the GC itself. |
The problem is that we suddenly changed rules. It was not required that the object's type should be visited in tp_visit. It was incorrect to visit it, because object did not have strong reference to its type. User never created it, and it was not created implicitly. Now we changed rules. A strong reference is created implicitly. Who is responsible to manage a strong reference? Whose who created it, ant it is the interpreter, not the user. User does not know anything about it. If we pass the responsibility for the strong reference to the type on the user, we makes all user code incorrect, and we add a burden of fixing it and maintaining compatibility with incompatible Python versions on the user. I think it is very bad. |
Even if we decide that the patch that introduced the new rules should not be reverted, then what we should do is wrap the tp_traverse of the user in something that also calls Py_VISIT(Py_TYPE(self)) so basically the tp_traverse of the type created by PyType_FromSpec will do static int
PyType_FromSpec_tp_traverse(_abc_data *self, visitproc visit, void *arg)
{
Py_VISIT(Py_TYPE(self))
return self->user_provided_tp_traverse(self, visit, arg);
} That way, we can still reason about what the tp_traverse should do, we don't break more rules and we don't need to make maintaining the GC even more difficult. |
In #19414 I have attached a proof of concept of a wrapper for tp_traverse that only affects types created by PyType_FromSpec that will call Py_VISIT(Py_TYPE(self)) and then the user function. |
I have made some investigation, and I think a form of this bug was there from a long time and does not really relate to heap types. If you consider the more simple heap type: >>> class A:
... ...
...
>>> A().__class__
<class '__main__.A'> As the class instance refers to its type, it must visit it in the gc and we can see that indeed, that is the case: >>> import gc
>>> gc.get_referents(A())
[<class '__main__.A'>] But for instance, consider ast.AST, which in 3.7 is a static type created with PyType_Ready: >>> import ast
>>> x = ast.AST()
>>> x.__class__
<class '_ast.AST'> we can see that the object refers to its type as the other one but....oh no, the gc does not know about that link: >>> import gc
>>> gc.get_referents(x)
[] This is because its traverse function does not visit the type: static int
ast_traverse(AST_object *self, visitproc visit, void *arg)
{
Py_VISIT(self->dict);
return 0;
} This is not a problem if the type is *static* because __although is technically an error__ because the GC cannot resolve The problem appears when the type is not eternal and can be destroyed. For instance, to understand this consider a function
# Create a new heap type with custom traverse function
>>> x = _testcapi.construct_new_gc_type()
>>> sys.getrefcount(x)
5
>>> new_obj = x()
>>> sys.getrefcount(x)
6
# We know that now there should be a link between new_obj and x because one points to the other
>>> x in gc.get_referents(new_obj)
False
# Oh no, there is no link
>>> class A:
... def __del__(self):
... print("Ouch")
...
>>> x_w = weakref.ref(x)
# Create a reference cycle a -> new_obj -> x -> a
>>> a = A()
>>>
>>> x.a = a
>>> a.y = new_obj
>>> a.x = x
>>> gc.collect()
0
>>> del x,a
>>> gc.collect()
0
>>> sys.getrefcount(x_w())
6
>>> del new_obj
# At this point all variables are gone and the collector should clean everything
>>> gc.collect()
0
# Oh, no! The type is still alive
>>> sys.getrefcount(x_w())
6 If we do the same experiment after PR19314: >>> import sys, gc, _testcapi
>>> import weakref
>>> x = _testcapi.construct_new_gc_type()
>>> sys.getrefcount(x)
5
>>> new_obj = x()
>>> sys.getrefcount(x)
6
# We know that now there should be a link between new_obj and x because one points to the other
>>> x in gc.get_referents(new_obj)
True
# Good!
>>> class A:
... def __del__(self):
... print("Ouch")
...
>>> x_w = weakref.ref(x)
# Create a reference cycle a -> new_obj -> x -> a
>>> a = A()
>>> x.a = a
>>> a.y = new_obj
>>> a.x = x
>>> gc.collect()
Traversed!
Traversed!
36
>>> del x,a
>>> gc.collect()
Traversed!
Traversed!
0
>>> sys.getrefcount(x_w())
6
>>> del new_obj
# At this point all variables are gone and the collector should clean everything
>>> gc.collect()
Ouch
8
# Nice, the collector cleaned the cycle So the conclusion:
|
I'm not comfortable with requesting authors of all C extensions to modify their tp_traverse function. As Pablo explained, the type is already visited on subtypes instances. I approved PR 19414. |
Tests should be added, maybe based on msg365963 examples. But I would prefer to get this bug fixed in 3.9.0a6 (if possible), tests can be added later. |
I do not think it is right. Please wait, I'll submit an alternate solution. |
There are limited options to fix this issue: (A) Revert commit 364f0b0 It would reintroduced bpo-35810 bug. Moreover, C extension modules which have been modified to call Py_DECREF(Py_TYPE(self)) in tp_dealloc which suddenly crash. I don't think that anyone wants this option. (B) Require all C extension modules authors to modify their tp_traverse function. It requires to communicate well that all C extension modules which use PyType_FromSpec() need to modify their tp_traverse method to add code specific to Python 3.8 or newer. Serhiy and me are against this option. (C) Modify gcmodule.c to traverse the type. This option is not trivial since *subtype* are already traversed. Moreover, Pablo and Tim are strongly against this option. (D) Modify PyType_FromSpec() to hook into tp_traverse and transparently visit the type, as already done for subtypes. Option chosen by PR 19414. Honestly, I'm unhappy that we have to hook into tp_traverse, but this option sounds like the least bad option to me. Developers don't have to modify their C extensions code, the bug is fixed. |
There is no possible world in which the best answer is "hack gcmodule.c" ;-) I haven't tried to dig into the details here, but Pablo's approach looked spot-on to me: put the solution near the source of the problem. The problem is specific to a relatively tiny number of objects, and all gc asks is that they play by the rules. |
I asked Lukasz (Python 3.9 release manager) in private and he is ok to merge this PR. He plans to wait until Refleak buildbots turn green again before tagging Python 3.9.0alpha6. Serhiy:
Serhiy: I merged Pablo's PR to unblock Refleak buildbots, but I'm still curious to see what you have to propose. We can still change the code before 3.9.0beta1. Don't hesitate to propose your PR! (your PR will have to revert commit 0169d30). -- Pablo: would you mind to write a NEWS entry (in the C API category) for your change? IMHO the commit title is good enough to be reused as the NEWS entry. I added "skip news" label just to unblock buildbots. |
Please backport to 3.8, then it will become part of 3.8.3rc1 which I'll be releasing tomorrow. |
I propose to *not* fix this bug in Python 3.8:
-- The worst thing which can happen with this bug is that a C extension module creates many types and these types are never deleted. Well, it's annoying, but it's unlikely to happen. Usually, types are created exactly once in C extensions. -- I'm not 100% comfortable with backporting the fix. Python 3.8.0, 3.8.1 and 3.8.2 have been released with the bug, and this issue is the first report. But I only saw the bug because many C extension modules of the Python 3.9 stdlib were converted to PyModuleDef_Init() and PyType_FromSpec(). -- I'm not comfortable to change the GC behavior in a subtle way in a 3.8.x minor release. If a C extension module was impacted by this bug and decided to explicitly visit the type in its traverse function, this fix will crash the C extension crash... -- At April 23, Serhiy proposed an alternative fix. But then he didn't propose a PR. -- The bug only impacts C extension modules which use PyModuleDef_Init() and PyType_FromSpec(). Python 3.8 only uses PyModuleDef_Init() in the following modules: vstinner@apu$ grep -l PyModuleDef_Init Modules/*.c Python 3.8 only uses PyType_FromSpec() in the following modules: $ grep -l PyType_FromSpec Modules/*.c
Modules/_curses_panel.c
Modules/_ssl.c
Modules/_testcapimodule.c
Modules/_testmultiphase.c
Modules/_tkinter.c
Modules/xxlimited.c Intersection of the two sets: _testmultiphase and xxlimited, which are modules only used by the Python test suite. It means that Python 3.8 standard library is *not* impacted by this bug. Only third party C extension modules which use PyModuleDef_Init() *and* PyType_FromSpec() are impacted. Most C extension modules use statically allocated types and so are not impacted. |
The issue has been fixed in the master branch, thanks Pablo! I close the issue. |
It looks like the fix breaks types that override Py_tp_alloc but not Py_tp_traverse (as some types in PySide do.) I'll investigate more and work on fixing that if someone doesn't beat me to it. |
Re-opening the issue |
Maybe we should revert this change and modify the "how to port to ..." section as we did in bpo-35810 to tell users that they need to manually visit the parent in classes created with PyType_FromSpec |
Actually, I have been thinking about this more and I cannot really trace how the bug could happen. We control the type allocation, but Py_tp_alloc controls the instance allocation and that can be anything, the current patch should not interfere with that. Petr, do you have a reproducer? I tried to create a reproducer of a type that overrides Py_tp_alloc but not Py_tp_traverse (therefore it does not implement gc support) and I cannot make it crash. |
Indeed, it seems my initial analysis was incorrect. I can't reproduce this outside of PySide. Sorry for the noise. |
Ha! I think I found the issue in PySide now. It's different, but it's still a CPython issue. It's actually a variant of the "defining class" problem from PEP-573. It's legal to get the value of a slot, using PyType_GetSlot. Getting one's superclass is a bit tricky when dealing entirely with FromSpec Base:
Subclass:
So when the subclass is traversed:
Another issue is that What would be the downsides of reverting and documenting that tp_traverse |
Not much IMHO, I actually would prefer this solution over any automatic "hacks". The reason is that any hack will be technically violating the contract: if I read the tp_traverse of any of these classes I would say "wait a minute....this is wrong: the class owns a strong reference to its parent so...why is not visiting it?". But the answer would be that we are hacking because we didn't want to force users to migrate once we added the strong reference. Notice that the root of this problem was bpo-35810. In that issue we added a guide to the docs: https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-c-api. In there it says that you should decref now the parent on tp_dealloc: static void
foo_dealloc(foo_struct *instance) {
PyObject *type = Py_TYPE(instance);
PyObject_GC_Del(instance);
#if PY_VERSION_HEX >= 0x03080000
// This was not needed before Python 3.8 (Python issue 35810)
Py_DECREF(type);
#endif
} but it forgot about that you should also visit the parent in tp_traverse. So, I propose to do the following:
What do you think? |
I have been playing with the reproducer and I am a bit confused: The reproducer crashes in the same way even after reverting PR 19414 so it does not seem related to it. This is what I get: >>> import reproducer
>>> reproducer.Modules/gcmodule.c:114: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Enable tracemalloc to get the memory block allocation traceback object address : 0x55cf429e4080 Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed Current thread 0x00007fc65d2f8740 (most recent call first): This seems to indicate that that reproducer is already implementing incorrectly tp_traverse. |
Ok, I found the problem. The problem is that the reproduced does not correctly work the reference count of base_class because when construction get tuple of bases: PyObject *bases = PyTuple_New(1);
result = PyTuple_SetItem(bases, 0, base_class);
if (result) return -1;
PyObject *subclass = PyType_FromModuleAndSpec(m, &subclass_spec, bases); "PyTuple_SetItem" steals a reference to base_class but "PyModule_AddObject" also does the same, and the refcount is incorrect. If you add a Py_INCREF before, the crash disappears. |
To be clear: the other crash is still in the reproduced: the one that Petr describes in his comment. In PR 20264 I have prepared the changes that I proposed previously (including the revert of the hack). Petr, do you mind reviewing it? |
Thank you for responding so quickly!
Yes, but I'll only get to it next week. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: