Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cython's gc_track and gc_untrack #13896

Closed
nbruin opened this issue Jan 1, 2013 · 38 comments
Closed

Fix cython's gc_track and gc_untrack #13896

nbruin opened this issue Jan 1, 2013 · 38 comments
Assignees
Milestone

Comments

@nbruin
Copy link
Contributor

nbruin commented Jan 1, 2013

In a long sage-devel thread we eventually found in this message that a GC during a weakref callback on a Cython class can lead to double deallocation of that class. In Python's Objects/typeobject.c, line 1024 and onwards, there are some comments that indicate that earlier version of Python were bitten by this problem too. The solution is to insert the appropriate PyObject_GC_Untrack and PyObject_GC_Track in cython's deallocation code. This is best fixed in cython itself.

Install only the new spkg at http://boxen.math.washington.edu/home/jdemeyer/spkg/cython-0.17.4.spkg

Upstream: Completely fixed; Fix reported upstream

CC: @simon-king-jena @jpflori

Component: memleak

Author: Robert Bradshaw

Reviewer: Jeroen Demeyer

Merged: sage-5.6.beta3

Issue created by migration from https://trac.sagemath.org/ticket/13896

@nbruin nbruin added this to the sage-5.6 milestone Jan 1, 2013
@nbruin
Copy link
Contributor Author

nbruin commented Jan 1, 2013

Patch to more reliably produce crash

@nbruin
Copy link
Contributor Author

nbruin commented Jan 1, 2013

comment:1

Attachment: double-free-crash.patch.gz

With attached patch applied to 5.6.beta2 (and probably also other versions close to it),

sage -t devel/sage/sage/modules/module.pyx

will crash relatively reliably on several machines (including sage.math)

@jpflori
Copy link

jpflori commented Jan 2, 2013

comment:3

I'd like to see this ticket as a blocker, anyone against this idea?

@nbruin
Copy link
Contributor Author

nbruin commented Jan 2, 2013

comment:4

Replying to @jpflori:

I'd like to see this ticket as a blocker, anyone against this idea?

Since this is the ultimate "can generate segfaults anywhere", it's a prime candidate for blocker status. However, we're fully at the mercy of cython developers as to when this gets fixed. Also, if we release with this bug unfixed, we might as well leave #715 in too, since this one has a much wider possible impact :-).

@jpflori
Copy link

jpflori commented Jan 2, 2013

comment:5

Ok, Ive put it as blocker.

For those who want to play while waiting for upstream, I've posted a p0 Cython spkg which does "something" with PyObject_GC_[Un]Track.
Not sure it makes any sense, but it seems to make our bug disappear.
It's at
http://boxen.math.washington.edu/home/jpflori/cython-0.17.3.p0.spkg

@nbruin

This comment has been minimized.

@nbruin
Copy link
Contributor Author

nbruin commented Jan 2, 2013

comment:6

Apologies. I saw I linked to the wrong file. Include/object.h also has some interesting information, but it looks like it is a bit out-of-date on some bits. In particular, if you look at the actual use of the TRASHCAN macros:

    PyObject_GC_UnTrack(self);
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
...
  endlabel:
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

with the explanation a little lower:

       Q. Why the bizarre (net-zero) manipulation of
          _PyTrash_delete_nesting around the trashcan macros?

       A. Some base classes (e.g. list) also use the trashcan mechanism.
          The following scenario used to be possible:

          - suppose the trashcan level is one below the trashcan limit

          - subtype_dealloc() is called

          - the trashcan limit is not yet reached, so the trashcan level
        is incremented and the code between trashcan begin and end is
        executed

          - this destroys much of the object's contents, including its
        slots and __dict__

          - basedealloc() is called; this is really list_dealloc(), or
        some other type which also uses the trashcan macros

          - the trashcan limit is now reached, so the object is put on the
        trashcan's to-be-deleted-later list

          - basedealloc() returns

          - subtype_dealloc() decrefs the object's type

          - subtype_dealloc() returns

          - later, the trashcan code starts deleting the objects from its
        to-be-deleted-later list

          - subtype_dealloc() is called *AGAIN* for the same object

          - at the very least (if the destroyed slots and __dict__ don't
        cause problems) the object's type gets decref'ed a second
        time, which is *BAD*!!!

          The remedy is to make sure that if the code between trashcan
          begin and end in subtype_dealloc() is called, the code between
          trashcan begin and end in basedealloc() will also be called.
          This is done by decrementing the level after passing into the
          trashcan block, and incrementing it just before leaving the
          block.

          But now it's possible that a chain of objects consisting solely
          of objects whose deallocator is subtype_dealloc() will defeat
          the trashcan mechanism completely: the decremented level means
          that the effective level never reaches the limit.      Therefore, we
          *increment* the level *before* entering the trashcan block, and
          matchingly decrement it after leaving.  This means the trashcan
          code will trigger a little early, but that's no big deal.

It's probably better to leave out the trashcan for now. It seems like rather tricky code and I'm not sure it's part of the official Python C-API (it might be something internal, just like they use some macros themselves they find unsafe for use in extension modules)

@jpflori
Copy link

jpflori commented Jan 2, 2013

comment:7

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

Anyway I agree it is a better take to leave that out for now, and anyway, upstream will decide what is the best.

So I've updated the spkg to not include the trashcan parts.

@jpflori
Copy link

jpflori commented Jan 2, 2013

Attachment: cython-0.17.3.p0.diff.gz

@nbruin
Copy link
Contributor Author

nbruin commented Jan 2, 2013

comment:8

Replying to @jpflori:

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

In fact, I think the precautions taken are not enough for general cython classes. With the little

    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
    ...
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

dance they are making sure there is room for one extra trashcan nesting provided that that call doesn't use the same trick. However, a cython class could have a whole inheritance hierarchy going here (that would all use this trick too!), so I'm pretty sure that the exact scenario they describe could still happen. You'd need to know the depth of the inheritance line (for deallocs, multiple inheritance can't happen, right?) and ensure there's enough room for all those.

@robertwb
Copy link
Contributor

robertwb commented Jan 2, 2013

comment:9

cython/cython@9a08ff2

Coming up with a nice clean test was...interesting.

@jpflori
Copy link

jpflori commented Jan 2, 2013

comment:10

Just one potentially naive question:
shouldn't the object get retracked iff you're going to call another dealloc method?
or conversely, if the type does not extend a previous type, shouldn't the object stay untracked when you call tp_free?
I'm not sure it would really matter if the object is still tracked in this latter case, but I got this feeling when staring at CPython's code today.

Anyway, it just made me think of what will happen if your extension class is GC tracked, but the base class is not? In this case you're lost because if you track your object before calling the base dealloc, then you will not untrack it there. Is that even possible? And anyway if a class is not gc tracked, or is not a container I guess it cannot be weakrefed...

@robertwb
Copy link
Contributor

robertwb commented Jan 2, 2013

comment:11

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

@jpflori
Copy link

jpflori commented Jan 2, 2013

comment:12

Thanks for pointing that out.

@robertwb
Copy link
Contributor

robertwb commented Jan 3, 2013

comment:13

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

@nbruin
Copy link
Contributor Author

nbruin commented Jan 3, 2013

comment:14

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

@nbruin
Copy link
Contributor Author

nbruin commented Jan 3, 2013

comment:15

Replying to @robertwb:

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

This does look good to me. JP has already confirmed that this fixed the issue (as does your elegant test in the cython suite). Your pre.spkg has some different files in it, but I guess that's why you don't consider it an actual spkg.

@jpflori
Copy link

jpflori commented Jan 3, 2013

comment:16

Replying to @robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list?
In particular won't a call to gc_list_remove(o) be missing?

@robertwb
Copy link
Contributor

robertwb commented Jan 3, 2013

comment:17

Replying to @jpflori:

Replying to @robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list?
In particular won't a call to gc_list_remove(o) be missing?

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

@robertwb
Copy link
Contributor

robertwb commented Jan 3, 2013

comment:18

Replying to @nbruin:

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

Yeah, this is a separate (and more complicated to resolve) issue.

@nbruin
Copy link
Contributor Author

nbruin commented Jan 3, 2013

comment:19

Replying to @robertwb:

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

... so suppose we have a superclass that doesn't do the untrack/track dance (so this must be a non-container superclass of a container class. We're entering rather hypothetical territory here). We'll be entering its dealloc with tracking SET. I guess the actual memory free happens by our class, so I guess the list of GC-tracked objects will be properly amended eventually. Can we prove that no GC or trashcan-shelving of this intermediate object will happen in between? I guess it's unlikely because non-container types should be easy to deallocate ... unless some callous person writes an extension class that does hold references to other objects but is convinced that those will never lead to cycles and hence makes it non-GC-tracked. Some weakref callbacks and a GC could then find a partially torn down object tracked by the GC. Multithreaded stuff could make this even worse, but I guess we're protected by the GIL here.

It should probably be mandated that any container type has to participate in GC. For a non-container type it's hard to see how a dealloc could ever be interrupted or interleaved by a GC. So this note is probably more a request for clarification (addition to documentation somewhere?) why this is not a problem than a diagnosis of a bug.

@robertwb
Copy link
Contributor

robertwb commented Jan 3, 2013

comment:20

I think it helps to look at the generated code. Suppose one has

cdef class A: ...
cdef class B(A): ...
cdef class C(B): ...
...

In this case one has, roughly,

tp_dealloc_A(self) {
   [optional untrack]
   bodyA
   [optional track]
   PY_TYPE(self)->tp_free(self)
}

tp_dealloc_B(self) {
   [optional untrack]
   bodyB
   [optional track]
   tp_dealloc_A(self)
}

tp_dealloc_C(self) {
   [optional untrack]
   bodyC
   [optional track]
   tp_dealloc_B(self)
}

...

bodyX consists of decrefing Python members, traversing weakrefs, and (if present)

PyRef(self)++;
X.__dealloc__(self);
PyRef(self)--;

The track/untrack markers are added exactly when Python/weakref members are present, which is where a garbage collection might happen. (When executing __dealloc__ the refcount is incremented, also preventing garbage collection.)

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

@jpflori
Copy link

jpflori commented Jan 3, 2013

comment:21

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

That is exactly what I was thinking about, and IIRC what is looked for in the CPython subtype_dealloc when looking for the base type.

If you say it cannot happy in Cython, I'm very happy with that!

@jpflori
Copy link

jpflori commented Jan 3, 2013

comment:22

Are you sure this is the case, e.g., for category_object and sage_object?
I see a TPFLAGS_HAVE_GC on the former but not on the latter.

@nbruin
Copy link
Contributor Author

nbruin commented Jan 3, 2013

Robert's cython test case (I spent quite some time twice to find it, so I'm storing it here for future reference)

@vbraun
Copy link
Member

vbraun commented Jan 3, 2013

comment:23

Attachment: double_dealloc_T796.pyx.gz

And Robert just released Cython 0.17.4, see https://groups.google.com/d/topic/cython-users/s3ycj83Yctw/discussion

@robertwb

This comment has been minimized.

@robertwb
Copy link
Contributor

robertwb commented Jan 3, 2013

@jdemeyer
Copy link

jdemeyer commented Jan 4, 2013

comment:25

Typo in the version number:

=== cython-0.17.3 (Robert Bradshaw, 3 January 2013) ===

should be

=== cython-0.17.4 (Robert Bradshaw, 3 January 2013) ===

@jdemeyer
Copy link

jdemeyer commented Jan 4, 2013

Author: Robert Bradshaw

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

jdemeyer commented Jan 4, 2013

comment:26

Fixed SPKG.txt.

@jdemeyer
Copy link

jdemeyer commented Jan 4, 2013

Reviewer: Jeroen Demeyer

@jdemeyer
Copy link

jdemeyer commented Jan 4, 2013

Changed upstream from Reported upstream. Developers acknowledge bug. to Completely fixed; Fix reported upstream

@robertwb
Copy link
Contributor

robertwb commented Jan 4, 2013

comment:28

D'oh. Thanks.

@jdemeyer
Copy link

jdemeyer commented Jan 7, 2013

Merged: sage-5.6.beta3

@jdemeyer
Copy link

comment:30

I have not seen anymore segmentation faults regarding #715, so this might have fixed it.

@vbraun
Copy link
Member

vbraun commented Jan 10, 2013

comment:31

Yay! Congratulations to everybody and a special thanks to Simon for pushing the weak caches!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants