Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better deletion of items of TripleDict #13904

Closed
simon-king-jena opened this issue Jan 3, 2013 · 41 comments
Closed

Better deletion of items of TripleDict #13904

simon-king-jena opened this issue Jan 3, 2013 · 41 comments

Comments

@simon-king-jena
Copy link
Member

In #715, TripleDict has been modified, so that the key items are compared by identity rather than equality, and so that weak keys are used when possible.

However, as it turns out, the deletion of the items tracked in a TripleDict was done improperly. The aim of this ticket is to fix that.

Superseded by #13387.

Depends on #13896
Depends on #12313

CC: @nbruin @vbraun @jpflori

Component: memleak

Reviewer: Simon King, Jean-Pierre Flori

Issue created by migration from https://trac.sagemath.org/ticket/13904

@simon-king-jena
Copy link
Member Author

Attachment: trac_13904_fix_del.patch.gz

Attempt to fix TripleDict.__delitem__

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:1

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

The following might be a solution:

  • equip TripleDict with an attribute self.deathrow=[].
  • when TripleDictEraser finds itself removing a non-weakreffed key (other than None, int, float, etc), it appends the key to self.deathrow instead.
    This way, we don't trigger the deletion (and the whole cascade that may follow) in the callback.

Deathrow will be cleaned up eventually, as part of the regular cleanup code for deallocation of TripleDict.

The problem is that weakref callback operates in a strange environment: There is a dealloc call higher up the stack somewhere. Normally python code doesn't run under those circumstances. It shouldn't make much difference, but you should absolutely not delete that object again. Hence, I think the default should be: Don't delete objects in callback. You can modify this rule, but only on a case-by-case basis, by carefully reasoning that the deletions that you do allow won't lead to nasty unpredictable deletion cascades.

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:2

I don't think we need this ticket for the __delitem__ optimization. That can be wrapped into #13387.

@simon-king-jena
Copy link
Member Author

comment:3

Nils, indeed it might be an idea to use the ideas from #13387. However, the ticket here is not about optimization.

Anyway.

I am starting with Sage's debug version as in #13864. With the first patch applied, one gets a crash in modules/module.pyx

With both patches applied, one gets

sage -t  "devel/sage-main/sage/modules/module.pyx"          
         [3.5 s]

So, that looks like a progress.

However, at least with a slightly different version of the first patch, make ptest still results in

        sage -t  -force_lib devel/sage/sage/schemes/elliptic_curves/ell_point.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/algebras/free_algebra_quotient.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/algebras/free_algebra_quotient_element.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/categories/modules_with_basis.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/categories/homset.py # 1 doctests failed

when both patches are applied. Note, however, that I can not reproduce these crashes when I run them independently.

Hence, one needs to dig deeper. See comment:1.

@simon-king-jena
Copy link
Member Author

comment:4

Replying to @nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

Ah, you say that the problem actually isn't so much the __delitem__ but the callback? OK, I'll have a look into that. And of course we should try to rebase #13387.

@simon-king-jena
Copy link
Member Author

comment:5

Oops, I just notice that the patch attempts to fix TripleDictEraser.__call__, hence, the callback, but not TripleDict.__delitem__. So, I did have a look into the callback function...

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:6

Replying to @simon-king-jena:

Ah, you say that the problem actually isn't so much the __delitem__ but the callback? OK, I'll have a look into that. And of course we should try to rebase #13387.

The __delitem__ isn't wrong. It just happens to be slightly more memory intensive that the straight del, so it happened to trigger a GC. The crash was due to Cython not handling GC during deallocation properly. That's getting fixed with #13896. The potential issue I described above hasn't been observed. I suspect that one could engineer an example. Perhaps one should (Robert's cython test case for #13896 can serve as a template). It's just that memory management is such a sensitive business that one should really (informally) prove bad scenarios can't happen rather than wait until they do. I think the scenario I describe can happen and I don't see what measures in the python code would prevent it.

One way to solve this would be to forbid non-weakreffable container types to be used as keys. The problem is that I can't think of a reliable test for that. Since the consequences of violating might lead to incredibly insidious memory corruptions, I think we should guard against it. A consequence is that any non-weakreffable (non-whitelisted such as None or int) key automatically has its life span bounded below by the lifetime of the TripleDict, even if the entry in which it served got removed due to a weakref dying (explicit key deletions would be fine). I don't see how we could get hold of a hook where clearing deathrow would be safe.

@simon-king-jena
Copy link
Member Author

comment:7

Replying to @nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

Is what you describe as follows?

Let's say we have a key (A,B,C), where A is weakrefed and got deleted, so that the key (A,B,C) gets removed by means of A's callback function. Now let B be non-weakrefable. Assume further that the key triple is the last reference to B, and assume that B has an attribute that constitutes a (strong? weak?) reference to A. Then, deletion of B would trigger a second deletion of A.

Or did I misunderstand? Can you elaborate a bit more, please?

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:8

Replying to @simon-king-jena:

Let's say we have a key (A,B,C), where A is weakrefed and got deleted, so that the key (A,B,C) gets removed by means of A's callback function. Now let B be non-weakrefable. Assume further that the key triple is the last reference to B, and assume that B has an attribute that constitutes a (strong? weak?) reference to A. Then, deletion of B would trigger a second deletion of A.

Yes, that's correct. It would have to be a strong reference from B to A, because deletion of weakrefs should not trigger deletion cascades. Of course, that means the reference count to A isn't 0, so normally A would not be eligible for deletion ... but if A and B are found to be cyclic garbage, deallocation could still occur. We'd be at the mercy of GC for how the chain gets broken up and deleted.

In fact, moving B to deathrow could be tricky, because that would resurrect B (generally frowned upon), albeit only within the cyclic garbage. OK, I'll try and see if I can make a testcase that triggers this behaviour. At this point I don't even know how to fix this. Perhaps we should just forbid non-weakreffable keys that refer to another key component. Mind you, that forbids

(None, ZZ, ZZ(1) )

as a key, which doesn't immediately look completely unlikely. We don't generate keys like that, but someone else might ...

@simon-king-jena
Copy link
Member Author

comment:9

Replying to @nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The following might be a solution:

  • equip TripleDict with an attribute self.deathrow=[].
  • when TripleDictEraser finds itself removing a non-weakreffed key (other than None, int, float, etc), it appends the key to self.deathrow instead.
    This way, we don't trigger the deletion (and the whole cascade that may follow) in the callback.

Deathrow will be cleaned up eventually, as part of the regular cleanup code for deallocation of TripleDict.

The problem is that weakref callback operates in a strange environment: There is a dealloc call higher up the stack somewhere. Normally python code doesn't run under those circumstances. It shouldn't make much difference, but you should absolutely not delete that object again. Hence, I think the default should be: Don't delete objects in callback.

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

Problem could be a severe slow-down.

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:10

Replying to @simon-king-jena:

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The cython-induced crash has only weakreffable keys. The hypothetical behaviour has not been observed.

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

But arbitrary code can be executed during weakref callbacks, including arbitrary TripleDict operations. So you'd never be sure if it's safe to kill deathrow.

Really, this is starting to look more like a design deficiency in CPython: it should really mark objects on which dealloc is running to prevent GC, decref, whatever, from reentering dealloc. Preventing this by code protocol just seems to unduly restrict allowed data structures.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

You wouldn't be sure that deathrow entries are now safe to delete ...

EXECUTIVE DECISION: we mandate that key components to TripleDict be either weakreffable or do not hold references to other objects (i.e., are non container-types). Perhaps we can relax this rule once we come up with a smart mechanism.

@simon-king-jena
Copy link
Member Author

comment:11

Replying to @nbruin:

Replying to @simon-king-jena:

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The cython-induced crash has only weakreffable keys...

... and is an upstream bug in Cython?

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

But arbitrary code can be executed during weakref callbacks, including arbitrary TripleDict operations. So you'd never be sure if it's safe to kill deathrow.

Not quite, because we Know exactly what is happening in our weakref callbacks (namely: In the __call__ method of TripleDictEraser). The callback function will not call any method of TripleDict.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

You wouldn't be sure that deathrow entries are now safe to delete ...

Agreed.

EXECUTIVE DECISION: we mandate that key components to TripleDict be either weakreffable or do not hold references to other objects (i.e., are non container-types). Perhaps we can relax this rule once we come up with a smart mechanism.

I guess this would be the easiest solution on the short run. Do you think of a whitelist, i.e., explicitly allowing NoneType, int, long and so on?

@simon-king-jena
Copy link
Member Author

comment:12

By the way, my first attachment is wrong anyway, because I mispelled gc.isenabled().

@simon-king-jena
Copy link
Member Author

For debugging only. Increase the likelyhood that a gc occurs during a weakref callback

@simon-king-jena
Copy link
Member Author

comment:13

Attachment: trac_13904_debug_only.patch.gz

The "debug_only" patch is updated. With it, I tried the following:

sage: cython("""
....: cdef class A:
....:     cdef object x
....:     def __init__(self, x):
....:         self.x = x
....: """)
sage: import gc
sage: from sage.structure.coerce_dict import TripleDict
sage: T = TripleDict(31)
sage: while(1):
....:     b = B()
....:     a = A(b)
....:     T[b,a,None] = a
....:     del a,b
....:     foo = gc.collect()
....:     

That should be exactly the theoretical situation you were describing, isn't it?

But there is no crash. When I interrupt it after a while, with Ctrl-C, I get

sage: len(T)
1681

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

@simon-king-jena
Copy link
Member Author

comment:14

Replying to @simon-king-jena:

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

Well, certainly the weak keys can be cleared. But apparently they can not be cleared, if there is a non-weakreffable item in the key that holds a reference to a weakreffable item in the key.

Hence, on the one hand, it seems that the situation you constructed is a non-issue, because the callback function for the weak reference will not be called.

On the other hand, it is yet another reason to not allow non-weakreffable containers as key.

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

Script that tries to cause a double dealloc but fails because python is mature

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:15

Attachment: test.sage.gz

HOORAY!
As Modules/gc_weakref.txt shows, Zope has already put very good stresstests on python's clearing of weakrefs in cyclic garbage. I think we're safe with TripleDict as it is, because python takes special precautions when running weakref callbacks in cleaning up cyclic garbage. In our case the weakref itself is part of the cyclic garbage. That means Python always stacks the deck such that the weakref appears to be cleared prior to the deletion of what it references. Hence, the callback doesn't happen! I think we're safe with TripleDict as it is.

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:16

Replying to @simon-king-jena:

But there is no crash. When I interrupt it after a while, with Ctrl-C, I get

sage: len(T)
1681

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

No, nothing is broken. You haven't included the definition of B, but I assume it's a weakreffable type.

Although you delete a,b, your dictionary T is still holding a strong reference to a (because it can't weakref it) and a is holding a strong reference to b, so b is reachable. Nothing gets collected.

Things are collectible once you throw away T but thanks to the special handling of callbacks in cyclic garbage this is always handled properly.

@simon-king-jena
Copy link
Member Author

comment:17

Replying to @nbruin:

I think we're safe with TripleDict as it is.

OK. But I wouldn't like to resolve this ticket as "wontfix" or "duplicate".

Still, we should use "del", not "__delitem__", because it gives shorter C-code. And in addition to that, we should see whether the new Cython from #13896 fixes the problem.

@simon-king-jena
Copy link
Member Author

Dependencies: #13896

@nbruin
Copy link
Contributor

nbruin commented Jan 3, 2013

comment:19

Replying to @simon-king-jena:

Still, we should use "del", not "__delitem__", because it gives shorter C-code. And in addition to that, we should see whether the new Cython from #13896 fixes the problem.

Yes, go ahead. Since there is no urgent issue anymore, use this ticket for whatever you want (and yes, that cython does solve the problem, if you compile the whole library with it. I'm now pretty confident I know exactly what went wrong, after seeing it happen on bit-level ...). The gc_weakref.txt is a fascinating read and should be required for anybody intending to meddle with sage's memory management.

@simon-king-jena
Copy link
Member Author

comment:20

It seems that indeed #13896 fixes the problem. Hooray \<sup>.</sup>/

Hence, the priority regarding #715 is to upgrade Cython. The change from calling __delitem__ to del should be done (here) as well, but not so urgent.

@jpflori
Copy link

jpflori commented Jan 4, 2013

comment:22

Can we get the change from __delitem__ to del quickly in?
It is trivial, produce better C code, and potentially avoid horribly complicated gc problems?

From what I've really quickly read, Nils pointed out another problem with weakrefs and callbacks, but let's go one step at a time and deal with that elsewhere (or move the trivial fix on a separate ticket).

@jpflori
Copy link

jpflori commented Jan 4, 2013

comment:23

Oh in fact, I've read a little less quickly and the problem Nils thought about does not occur?
So let's just get the trivial improvement here.
I agree it is not a fix, but an imporvement, so I've changed the ticket title to reflect that as well.

And in case one really wants to let garbage collection occurs during weakref callback or whenever, there are always ways to do that.

@simon-king-jena
Copy link
Member Author

Use del rather then calling __delitem__ directly, in TripleDict and MonoDict

@simon-king-jena
Copy link
Member Author

comment:26

Attachment: trac13904_use_del.patch.gz

I think using del rather than calling __delitem__ makes sense even though the segfaults from #715 seem to be fixed by the Cython upgrade: Apparently del results in shorter C code.

The new patch uses del both on TripleDict and the new MonoDict, hence, it depends on #12313, where the latter was introduced.

Apply trac13904_use_del.patch

@simon-king-jena
Copy link
Member Author

Changed dependencies from #13896 to #13896, #12313

@simon-king-jena
Copy link
Member Author

Author: Simon King

@jpflori

This comment has been minimized.

@jpflori
Copy link

jpflori commented Jan 21, 2013

Reviewer: Jean-Pierre Flori

@jdemeyer jdemeyer modified the milestones: sage-5.6, sage-5.7 Jan 21, 2013
@jdemeyer jdemeyer modified the milestones: sage-5.7, sage-5.8 Feb 1, 2013
@jdemeyer
Copy link

Merged: sage-5.8.beta0

@jdemeyer
Copy link

Changed merged from sage-5.8.beta0 to none

@jdemeyer
Copy link

comment:31

Unmerging because of trouble with #12313.

@jdemeyer jdemeyer reopened this Feb 20, 2013
@nbruin
Copy link
Contributor

nbruin commented Feb 20, 2013

comment:32

Perhaps try #13387 instead? With the solution there, this ticket is superseded there. It improves the performance of these dictionaries in some other ways as well.

@jpflori
Copy link

jpflori commented Feb 28, 2013

comment:33

So this one should get closed as #12313 and #13387 will get merged together?

@jdemeyer
Copy link

Changed reviewer from Jean-Pierre Flori to Simon King, Jean-Pierre Flori

@jdemeyer
Copy link

Changed author from Simon King to none

@jdemeyer

This comment has been minimized.

@jdemeyer jdemeyer removed this from the sage-5.8 milestone Feb 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants