Category refinement sometimes changes the hash of parents #14471

vbraun · 2013-04-21T15:33:16Z

Some objects with dynamical classes use hash(self.__class__). Since this is not an invariant of dynamical classes bad things will happen.

One such instance is that, under some circumstances, a WeakValueDictionary remove callback is being called but cannot find the object in the dictionary since the hash changed.

Apply

CC: @simon-king-jena @nbruin @nthiery @hivert

Component: memleak

Author: Volker Braun

Reviewer: Simon King

Merged: sage-5.12.beta0

Issue created by migration from https://trac.sagemath.org/ticket/14471

The text was updated successfully, but these errors were encountered:

vbraun · 2013-04-21T15:34:18Z

Attachment: trac_14471_demonstrate.patch.gz

Initial patch

vbraun · 2013-04-21T15:35:43Z

Initial patch

vbraun · 2013-04-21T15:38:47Z

comment:2

Attachment: debug_ignored_exceptions.patch.gz

The attachment: debug_ignored_exceptions.patch is a backport from Python-3.3 to give better diagnosis of ignored exceptions. With it, I get

    Exception ignored in: <function remove at 0x2472938>
    Traceback (most recent call last):
      File "/home/vbraun/opt/sage-5.9.beta5/local/lib/python/weakref.py", line 51, in remove
        del self.data[wr.key]
    KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),)

so the KeyError really comes from the WeakValueDictionary remove callback.

vbraun · 2013-04-21T15:43:31Z

comment:3

Adding some debugging in remove() shows that the wr.key is still in self.dict.keys() but del self.dict[wr.key] fails with a KeyError. Presumably because the whole thing is running during the garbage collection cycle. I tried #14159 but it didn't change anything.

I haven't figured out exactly which cache in the Sage library is the culprit, maybe Simon knows?

nbruin · 2013-04-21T19:12:21Z

comment:4

WeakValueDict is a standard python library class, not one of our own brew. So once we can put the finger on the bug we should likely report upstream. See also #13394, although I doubt the issue pointed out there is what is at play here.

My first guess would be that the cache at play is a "cached_method" or a "cached_function" cache, probably the one from UniqueRepresentation, but I'm not entirely sure. The entries in:

super(MatrixSpace,sage.matrix.matrix_space.MatrixSpace).__classcall__.get_cache().keys()

aren't quite of the shape you're finding.

A quick analysis; This is WeakValueDictionary.__init__:

    def __init__(self, *args, **kw):
        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
                del self.data[wr.key]
        self._remove = remove
        UserDict.UserDict.__init__(self, *args, **kw)

As you can see, remove only stores a link to the dictionary via a weak reference, so that the dictionary can get deallocated before the callback (and the callback finds that the dict doesn't exist anymore and simply does nothing). So that's correct.

We also see that remove only removes the key. It doesn't check whether the value is still what it's supposed to be and it also doesn't check if the key is still there in the first place. That would be safe if there's a guarantee that as soon as a value gets removed from the dictionary, the KeyedRef gets removed as well (in which case the callback dies without getting executed).

The bug we're seeing indicates that we end up with:

weakvaluedict[key]=value
key:KeyedRef(value) pair gets removed from dict - this should more or less atomically destroy the KeyedRef(value).
somehow, the callback from the KeyedRef(value) stored gets called before it gets destroyed.

Since the KeyedRef(value) should really only be referenced from the weakvaluedict, it's a bit hard to let it survive: As soon it gets decreffed it should get deallocated.

My guess is that this happens IN cyclic garbage removal, where callbacks do get separated from deletions. Python tries hard to avoid calling such callbacks unnecessarily or in situations where it's unsafe to do so, but perhaps cleaning up extremely complicated cyclic structures (which we do create) could fool the detection.

If this is right, then these keyerrors should be relatively harmless: They would be happening on a dictionary that is slated for demolition anyway. That would suggest we could just catch and ignore the keyerror in the WeakValueDictionary removal function. It would be good to identify the scenario that justifies doing so, however.

vbraun · 2013-04-21T19:46:57Z

comment:5

I agree and verified with gc.get_referrers that it is sage.structure.unique_representation.CachedRepresentation.__classcall__.cache

If I change the remove() callback to

                import sys
                print >> sys.stderr, 'removing', wr.key in self.data.keys()
                del self.data[wr.key]
                print >> sys.stderr, 'finished removing'

then I get

    ...
    removing True
    finished removing
    removing True
    Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored
    removing True
    finished removing
    removing True
    finished removing
    ...

So the weakvaluedict (self.data) is in a bad state...

nbruin · 2013-04-21T20:50:07Z

comment:6

Replying to @vbraun:

So the weakvaluedict (self.data) is in a bad state...

self.data is a normal dict. Your example doesn't quite show the dict is in a "bad state". It shows that between testing whether the key is in the dict and doing the delete, the key has been removed. That's not quite the same as self.data beng corrupted.

Something is going wrong of course, though. Something seems to be mutating self.data in between. The only methods that seem to be mutating the underlying self.data for a WeakValueDictionary are __setitem__, popitem, pop, setdefault, update (and of course __delitem__ and clear it inherits from UserDict). Perhaps I don't have enough imagination, but indeed the only scenario in which I could see one of those happening during a callback is when a GC gets triggered and basically the whole dict gets cleaned up.

Some of the "iterator" methods on WeakValueDict don't quite seem safe to me, by the way. The iteritems method, for instance, iterates over self.data.itervalues(), but self.data can be mutated outside of your control due to GC callbacks, and iterators over a mutating dict are generally not safe.
I doubt that's what's hurting us here, though.

vbraun · 2013-04-22T08:58:22Z

comment:7

Replying to @nbruin:

Replying to @vbraun:

So the weakvaluedict (self.data) is in a bad state...

self.data is a normal dict. Your example doesn't quite show the dict is in a "bad state".

Fair enough, but even after I catch the KeyError the wr.key in self.data.keys() is still True. So the self.data dict is in a weird place for sure.

I agree that this smells like something else is mutating self.data in-between. It is possible that deallocation somewhere triggers a call to CachedRepresentation, but I don't see where that would be the case.

vbraun · 2013-04-22T09:14:20Z

gdb/cython backtrace

vbraun · 2013-04-22T09:27:39Z

Attachment: sage_crash_NCNONO.log

Attachment: sage_crash_IfVSEK.log

gdb/cython backtrace in a debug build

vbraun · 2013-04-22T09:53:29Z

comment:8

I've added an unguarded_abort to force a gdb backtrace but it looks as it should to me: A object is being deallocated and triggers the WeakValueDictionary remove callback. The only dict that is being cleaned at that time are the globals of the doctest runner.

nbruin · 2013-04-22T16:20:33Z

comment:9

Replying to @vbraun:

Fair enough, but even after I catch the KeyError the wr.key in self.data.keys() is still True. So the self.data dict is in a weird place for sure.

That is weird, and sounds to me like self.data is indeed corrupted. Given that key/value pairs are stored together in a dictionary, it's very hard for this to be true too. Caching the result from data.keys wouldn't make sense at all, so I don't think python will be doing that. Perhaps check that something like wr.key in self.data or self.data.__contains__(wr.key) returns true and/or if value=self.data[wr.key] succeeds (and what kind of value you get back: whether it's wr)

That might give you some indication of what exactly is wrong with the entry and hence where the problem might be coming from.

One thing your traceback is confirming:

There is no GC going on, so callbacks are happening when you think they should.
Hence there's also not a circular data structure directly involved
Things are happening during cleanup, so these callbacks are probably indeed happening on a dict that will be deleted too. It's still bad they don't properly execute, of course, but we've seen before that order in cleanup can sometimes lead to subtle problems (i.e., whether things are stored in a cdef attribute or in a conventional dict).

One possible scenario:

self.data is in the process of being cleared. The dict is actually emptied then before any entries are decreffed (to prevent dict reshaping during deallocation).
One of the weakreffed values also occurs as a key in the dict (and that's the only strong ref)
decreffing that key causes the callback, but the callback finds a dict that is now empty, so of course we get a key error.

Your diagnosis that wr.key in self.data is still true is not consistent with that, but the scenario indeed leads to ignored keyerrors:

import weakref

class A(object):
  def __init__(self,n):
    self.n=n
  def __repr__(self):
    return "A(%d)"%self.n

def mess(n):
  D=weakref.WeakValueDictionary()
  L=[A(i) for i in range(n)]
  for i in range(n-1):
      j=(i+10)%n
      D[L[i]]=L[j]
  return D

D=mess(10000)
D.clear()

(how many you ignored keyerrors you get depends a bit on memory layout etc.)

This is definitely a bug in the python library. The problem is that remove does check if the dict still exists (via the weakref, which is mainly to ensure that presence of callbacks doesn't extend the life of the dict), but doesn't check if the dict is otherwise in a funny state, such as "in the process of a clear".

This is now Python issue 17816. Python3 does not seem to have this problem and it may not be the problem we're running into here.

vbraun · 2013-04-22T17:00:47Z

comment:10

I've looked at the Python 3 code and there weakref keeps a separate list of keys to remove to work around this issue.

Any further non-Sage related discussion should probably move to the Python bugtracker.

As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.

nbruin · 2013-04-22T18:39:33Z

comment:11

Replying to @vbraun:

As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.

I would suggest a paranoid remove function for this:

        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None and self.data.get(wr.key) is wr:
                del self.data[wr.key]

The get method avoids exception handling and hopefully avoids allocation as much as possible (thus avoiding triggering GC) and will return None if the reference is not present. Furthermore, we verify that the correct value is still stored in the dictionary, in case a callback gets delayed so much that another value makes it into that dict under the same key (that shouldn't really happen, so that's the paranoid bit).

I tried. Making the test

            if self is not None and wr.key in self.data:

also makes the error go away (but is less paranoid than the option above), but indeed, making the test

            if self is not None and wr.key in self.data.keys():

does not!! So there is something wrong with self.data.keys(). Indeed, putting

        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
                if wr.key in self.data.keys() and wr.key not in self.data:
                    raise RuntimeError("BAD KEY:%s"%(wr.key,))
                del self.data[wr.key]

in there makes for "ignored runtime error". On the other hand, if I run the example above with the BAD KEY test, I do not get these problems. Therefore, the discrepancy between in data.keys() and in data does not occur in the example above.

So:

there is a fix that apparently solves both problems
there is something different in the problem originally reported here, since keys() seems to report an erroneous list.

One difference between data.keys() and __contains__ is that dictobject.c/dict_keys just iterates over the table storing the dict and extracts the keys of all dictentries that have a non-NULL value field, whereas __contains__ will go through a dictlookup.
Given that some of the key components may well be in the process of being torn down, could it be that their hash is broken? One would still be able to find the corresponding key by iterating through all of them (as wr.key in data.keys() does), but finding it by hash would not work. Both __contains__ and __delitem__ would suffer from that.

This could happen if the MRO of some class changed to insert/remove WithEqualityById in it. Changing the hash on a key component for which a callback happens would have exactly this effect. So my bet is that it's an artifact of the mro voodoo the dynamical category stuff does or it's an unfortunate clearing/changing of a cached hash value somewhere due to a deletion-in-progress.

nbruin · 2013-04-23T01:27:06Z

comment:12

Replying to @vbraun:

    Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored

This entry is produced by the __call__ doctest in matrix0.pyx:

            sage: f(x,y) = x^2+y
            sage: m = matrix([[f,f*f],[f^3,f^4]]); m
            [    (x, y) |--> x^2 + y (x, y) |--> (x^2 + y)^2]
            [(x, y) |--> (x^2 + y)^3 (x, y) |--> (x^2 + y)^4]
            sage: m(1,2)
            [ 3  9]
            [27 81]
            sage: m(y=2,x=1)
            [ 3  9]
            [27 81]
            sage: m(2,1)
            [  5  25]
            [125 625]
+           sage: D=UniqueRepresentation.__classcall__.get_cache()
+           sage: [k for k in D.data.keys() if k not in D.data]
+           []

With the added doctest, we get different behaviour depending on whether
attachment: trac_14471_demonstrate.patch is applied or not: Without the patch, the test passes; with the patch the test fails. That means that the line

hasattr(obj, '_foobar_')

changes the hash of one of the key components involved; probably the "callable function ring", i.e., parent(f). Indeed:

sage: f(x,y) = x^2+y
sage: H=hash(parent(f))
sage: m = matrix([[f,f*f],[f^3,f^4]]);
sage: hasattr(m,'a')
False
sage: H == hash(parent(f))
False

that last one's really bad! The hasattr on that matrix apparently clobbers the hash of its base ring!

nbruin · 2013-04-23T07:36:39Z

comment:13

I think this illustrates the problem rather well:

sage: f(x,y) = x^2+y
sage: P=parent(f)
sage: P.__class__
sage.symbolic.callable.CallableSymbolicExpressionRing_class
sage: H=hash(P.__class__)
sage: m = matrix([[f,f*f],[f^3,f^4]])
sage: H == hash(P.__class__)
True
sage: hasattr(m,'a')
False
sage: P.__class__
sage.symbolic.callable.CallableSymbolicExpressionRing_class_with_category
sage: H == hash(P.__class__)
False

We have that P.__hash__ looks like:

    def __hash__(self):
        return hash((self.__class__, self._arguments))

Obviously, when class is not fixed, it's a bad ingredient for a hash. Given that

sage: T=type(P)
sage: [t for t in T.mro() if '__hash__' in t.__dict__]
[sage.symbolic.callable.CallableSymbolicExpressionRing_class,
 sage.symbolic.ring.SymbolicRing,
 sage.structure.category_object.CategoryObject,
 sage.structure.sage_object.SageObject,
 object]

we do need to do something about hashing, since sage.symbolic.ring.SymbolicRing.__hash__ is just a fixed value and these CallableExpression rings do come in different flavours. The __cmp__ method there also depends on self.__class__ which should obviously not be relied upon.

Given that this code stems from #5930, it predates these dynamic classes voodoo by quite a bit, so the fact that it doesn't operate well with it is a bug in the dynamic classes stuff. A job for Simon, Nicholas and the gang!

Furthermore, note that type(parent(m)) inherits from UniqueRepresentation but that type(parent(f)) does not. It doesn't seem easy to do, but it might be possible to make two equal-but-not-identical callable expression rings. Building matrix rings over them will wreak havoc, because the constructor will use equality testing on the ingredients (and hence return identical matrix rings), but the coercion framework will not honour "equality" between the non-identical base rings.

EDIT: The last bit is not the case. CallableSymbolicExpressionRing is a UniqueFactory. Furthermore, it seems the coercion framework does detect "equal but nonidentical" parent and can sometimes figure out coercions between them.

nbruin · 2013-04-23T15:38:31Z

comment:15

I did a very quick search for lines in source containing both hash and __class__. Lines 73 and 75 in sage/combinat/species/recursive_species.py are also guilty, so whoever is going to fix this might want to look at those lines too.

vbraun · 2013-05-02T18:01:08Z

comment:16

Isn't this super-dangerous, changing the hash can potentially give you wrong results.

Also, I thought it is already wrong to make the bare class (not _with_category) accessible through parent()? Of course that ought to be spelled out in the documentation.

nbruin · 2013-05-03T00:09:44Z

comment:17

Replying to @vbraun:

Isn't this super-dangerous, changing the hash can potentially give you wrong results.

Experience shows you don't have to change hashes for that, but yes, feel free to up the severity of the ticket if you feel strongly about this.

Also, I thought it is already wrong to make the bare class (not _with_category) accessible through parent()? Of course that ought to be spelled out in the documentation.

I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.

vbraun · 2013-05-03T09:22:12Z

comment:18

Replying to @nbruin:

I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.

Is that actually true? This is about constructing a parent, after all. Though maybe somebody who actually wrote the category code could chime in and tell us what he/she envisioned and where it is documented (j/k).

simon-king-jena · 2013-05-03T09:28:57Z

comment:19

See sage.matrix.matrix_space.MatrixSpace.full_category_initialisation for documentation. See #11900 for timings.

The difference of MatrixSpace with respect to other parents is that some code (namely on elliptic curves) creates many different matrix spaces, but uses them only as a container, but not as a fully grown parent.

simon-king-jena · 2013-05-03T09:41:36Z

comment:20

PS: A further difference is that finding the correct category is a bit more involved than in other examples. It may be an algebra or a module, depending on whether we talk about general or square matrices. And concerning module, this involves checking whether the base ring is a field, which also tends to be slow. And all this effort would be in vain, because the category framework isn't used for matrix spaces in the elliptic curve code.

vbraun · 2013-05-03T10:01:19Z

comment:21

But the problem is CallableSymbolicExpressionRing_class, not MatrixSpace? The matrices only trigger the problem by initializing the CallableSymbolicExpressionRing_class_with_category. Presumably matrices are smart enough to not use __class__ for hashes and comparison if they delay category initialization.

simon-king-jena · 2013-05-03T12:12:16Z

comment:22

Replying to @vbraun:

But the problem is CallableSymbolicExpressionRing_class, not MatrixSpace? The matrices only trigger the problem by initializing the CallableSymbolicExpressionRing_class_with_category.

As much as I know, I never touched CallableSymbolicExpressionRing.

Presumably matrices are smart enough to not use __class__ for hashes and comparison if they delay category initialization.

I hope so.

simon-king-jena · 2013-06-27T12:51:15Z

Attachment: trac_14471-review.patch.gz

simon-king-jena · 2013-06-27T12:52:42Z

comment:33

I think it would be a good idea to show that the hash actually did change. I extended the new test accordingly, in a review patch.

Positive review, then!

simon-king-jena · 2013-06-27T12:53:53Z

comment:34

No, I spoke to soon. Now, as the debug flag is set, it makes sense to run the full doc tests. After all, it could be that the debug flag uncovers a bug.

vbraun · 2013-06-27T13:17:20Z

comment:35

Doctests pass on my machine, of course

simon-king-jena · 2013-06-27T14:59:29Z

comment:36

Replying to @vbraun:

Doctests pass on my machine, of course

Confirmed!

jdemeyer · 2013-07-25T06:17:14Z

comment:37

The PDF documentation doesn't build due to the use of single instead of double backticks:

! Missing { inserted.
<to be read again> 
                   _
l.14636 when the constructor calls $Parent.__
                                             init__()$ directly or
?
! Emergency stop.
<to be read again> 
                   _
l.14636 when the constructor calls $Parent.__
                                             init__()$ directly or
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on geometry.log.
 [180[combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target fo
und: ../../../../../../../html/en/reference/structure/sage/structure/unique_representation.html#sage.structure.unique_representation.UniqueRepresentation
[combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target found:
../../../../../../../html/en/reference/structure/sage/structure/parent.html#sage.structure.parent.Parent
make[1]: *** [geometry.pdf] Error 1

vbraun · 2013-07-25T14:44:15Z

Attachment: trac_14471-pdf-fix.patch.gz

Initial patch

vbraun · 2013-07-25T14:47:44Z

comment:38

Fixed.

jdemeyer · 2013-08-02T14:13:55Z

Merged: sage-5.12.beta0

vbraun added this to the sage-5.11 milestone Apr 21, 2013

vbraun added t: bug labels Apr 21, 2013

vbraun assigned rlmill Apr 21, 2013

This comment has been minimized.

Sign in to view

simon-king-jena added the s: needs review label Jun 27, 2013

This comment has been minimized.

Sign in to view

simon-king-jena added s: positive review and removed s: needs review labels Jun 27, 2013

simon-king-jena added s: needs info and removed s: positive review labels Jun 27, 2013

vbraun added s: needs review and removed s: needs info labels Jun 27, 2013

simon-king-jena added s: positive review and removed s: needs review labels Jun 27, 2013

jdemeyer modified the milestones: sage-5.11, sage-5.12 Jul 25, 2013

jdemeyer added s: needs work and removed s: positive review labels Jul 25, 2013

This comment has been minimized.

Sign in to view

vbraun added s: positive review and removed s: needs work labels Jul 25, 2013

jdemeyer removed the s: positive review label Aug 2, 2013

jdemeyer closed this as completed Aug 2, 2013

nthiery mentioned this issue May 22, 2014

Axioms and more functorial constructions #10963

Closed

vbraun mentioned this issue Sep 6, 2013

Fix _repr_ of graphics objects #14469

Closed

Category refinement sometimes changes the hash of parents #14471

Category refinement sometimes changes the hash of parents #14471

Comments

vbraun commented Apr 21, 2013

vbraun commented Apr 21, 2013

This comment has been minimized.

vbraun commented Apr 21, 2013

vbraun commented Apr 21, 2013

vbraun commented Apr 21, 2013

nbruin commented Apr 21, 2013

vbraun commented Apr 21, 2013

nbruin commented Apr 21, 2013

vbraun commented Apr 22, 2013

vbraun commented Apr 22, 2013

vbraun commented Apr 22, 2013

vbraun commented Apr 22, 2013

nbruin commented Apr 22, 2013

vbraun commented Apr 22, 2013

nbruin commented Apr 22, 2013

nbruin commented Apr 23, 2013

nbruin commented Apr 23, 2013

nbruin commented Apr 23, 2013

vbraun commented May 2, 2013

nbruin commented May 3, 2013

vbraun commented May 3, 2013

simon-king-jena commented May 3, 2013

simon-king-jena commented May 3, 2013

vbraun commented May 3, 2013

simon-king-jena commented May 3, 2013

simon-king-jena commented Jun 27, 2013

simon-king-jena commented Jun 27, 2013

This comment has been minimized.

simon-king-jena commented Jun 27, 2013

vbraun commented Jun 27, 2013

simon-king-jena commented Jun 27, 2013

jdemeyer commented Jul 25, 2013

vbraun commented Jul 25, 2013

This comment has been minimized.

vbraun commented Jul 25, 2013

jdemeyer commented Aug 2, 2013