Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Category refinement sometimes changes the hash of parents #14471

Closed
vbraun opened this issue Apr 21, 2013 · 49 comments
Closed

Category refinement sometimes changes the hash of parents #14471

vbraun opened this issue Apr 21, 2013 · 49 comments

Comments

@vbraun
Copy link
Member

vbraun commented Apr 21, 2013

Some objects with dynamical classes use hash(self.__class__). Since this is not an invariant of dynamical classes bad things will happen.

One such instance is that, under some circumstances, a WeakValueDictionary remove callback is being called but cannot find the object in the dictionary since the hash changed.

Apply

CC: @simon-king-jena @nbruin @nthiery @hivert

Component: memleak

Author: Volker Braun

Reviewer: Simon King

Merged: sage-5.12.beta0

Issue created by migration from https://trac.sagemath.org/ticket/14471

@vbraun vbraun added this to the sage-5.11 milestone Apr 21, 2013
@vbraun
Copy link
Member Author

vbraun commented Apr 21, 2013

Attachment: trac_14471_demonstrate.patch.gz

Initial patch

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member Author

vbraun commented Apr 21, 2013

Initial patch

@vbraun
Copy link
Member Author

vbraun commented Apr 21, 2013

comment:2

Attachment: debug_ignored_exceptions.patch.gz

The attachment: debug_ignored_exceptions.patch is a backport from Python-3.3 to give better diagnosis of ignored exceptions. With it, I get

    Exception ignored in: <function remove at 0x2472938>
    Traceback (most recent call last):
      File "/home/vbraun/opt/sage-5.9.beta5/local/lib/python/weakref.py", line 51, in remove
        del self.data[wr.key]
    KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),)

so the KeyError really comes from the WeakValueDictionary remove callback.

@vbraun
Copy link
Member Author

vbraun commented Apr 21, 2013

comment:3

Adding some debugging in remove() shows that the wr.key is still in self.dict.keys() but del self.dict[wr.key] fails with a KeyError. Presumably because the whole thing is running during the garbage collection cycle. I tried #14159 but it didn't change anything.

I haven't figured out exactly which cache in the Sage library is the culprit, maybe Simon knows?

@nbruin
Copy link
Contributor

nbruin commented Apr 21, 2013

comment:4

WeakValueDict is a standard python library class, not one of our own brew. So once we can put the finger on the bug we should likely report upstream. See also #13394, although I doubt the issue pointed out there is what is at play here.

My first guess would be that the cache at play is a "cached_method" or a "cached_function" cache, probably the one from UniqueRepresentation, but I'm not entirely sure. The entries in:

super(MatrixSpace,sage.matrix.matrix_space.MatrixSpace).__classcall__.get_cache().keys()

aren't quite of the shape you're finding.

A quick analysis; This is WeakValueDictionary.__init__:

    def __init__(self, *args, **kw):
        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
                del self.data[wr.key]
        self._remove = remove
        UserDict.UserDict.__init__(self, *args, **kw)

As you can see, remove only stores a link to the dictionary via a weak reference, so that the dictionary can get deallocated before the callback (and the callback finds that the dict doesn't exist anymore and simply does nothing). So that's correct.

We also see that remove only removes the key. It doesn't check whether the value is still what it's supposed to be and it also doesn't check if the key is still there in the first place. That would be safe if there's a guarantee that as soon as a value gets removed from the dictionary, the KeyedRef gets removed as well (in which case the callback dies without getting executed).

The bug we're seeing indicates that we end up with:

  • weakvaluedict[key]=value
  • key:KeyedRef(value) pair gets removed from dict - this should more or less atomically destroy the KeyedRef(value).
  • somehow, the callback from the KeyedRef(value) stored gets called before it gets destroyed.

Since the KeyedRef(value) should really only be referenced from the weakvaluedict, it's a bit hard to let it survive: As soon it gets decreffed it should get deallocated.

My guess is that this happens IN cyclic garbage removal, where callbacks do get separated from deletions. Python tries hard to avoid calling such callbacks unnecessarily or in situations where it's unsafe to do so, but perhaps cleaning up extremely complicated cyclic structures (which we do create) could fool the detection.

If this is right, then these keyerrors should be relatively harmless: They would be happening on a dictionary that is slated for demolition anyway. That would suggest we could just catch and ignore the keyerror in the WeakValueDictionary removal function. It would be good to identify the scenario that justifies doing so, however.

@vbraun
Copy link
Member Author

vbraun commented Apr 21, 2013

comment:5

I agree and verified with gc.get_referrers that it is sage.structure.unique_representation.CachedRepresentation.__classcall__.cache

If I change the remove() callback to

                import sys
                print >> sys.stderr, 'removing', wr.key in self.data.keys()
                del self.data[wr.key]
                print >> sys.stderr, 'finished removing'

then I get

    ...
    removing True
    finished removing
    removing True
    Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored
    removing True
    finished removing
    removing True
    finished removing
    ...

So the weakvaluedict (self.data) is in a bad state...

@nbruin
Copy link
Contributor

nbruin commented Apr 21, 2013

comment:6

Replying to @vbraun:

So the weakvaluedict (self.data) is in a bad state...

self.data is a normal dict. Your example doesn't quite show the dict is in a "bad state". It shows that between testing whether the key is in the dict and doing the delete, the key has been removed. That's not quite the same as self.data beng corrupted.

Something is going wrong of course, though. Something seems to be mutating self.data in between. The only methods that seem to be mutating the underlying self.data for a WeakValueDictionary are __setitem__, popitem, pop, setdefault, update (and of course __delitem__ and clear it inherits from UserDict). Perhaps I don't have enough imagination, but indeed the only scenario in which I could see one of those happening during a callback is when a GC gets triggered and basically the whole dict gets cleaned up.

Some of the "iterator" methods on WeakValueDict don't quite seem safe to me, by the way. The iteritems method, for instance, iterates over self.data.itervalues(), but self.data can be mutated outside of your control due to GC callbacks, and iterators over a mutating dict are generally not safe.
I doubt that's what's hurting us here, though.

@vbraun
Copy link
Member Author

vbraun commented Apr 22, 2013

comment:7

Replying to @nbruin:

Replying to @vbraun:

So the weakvaluedict (self.data) is in a bad state...

self.data is a normal dict. Your example doesn't quite show the dict is in a "bad state".

Fair enough, but even after I catch the KeyError the wr.key in self.data.keys() is still True. So the self.data dict is in a weird place for sure.

I agree that this smells like something else is mutating self.data in-between. It is possible that deallocation somewhere triggers a call to CachedRepresentation, but I don't see where that would be the case.

@vbraun
Copy link
Member Author

vbraun commented Apr 22, 2013

gdb/cython backtrace

@vbraun
Copy link
Member Author

vbraun commented Apr 22, 2013

Attachment: sage_crash_NCNONO.log

Attachment: sage_crash_IfVSEK.log

gdb/cython backtrace in a debug build

@vbraun
Copy link
Member Author

vbraun commented Apr 22, 2013

comment:8

I've added an unguarded_abort to force a gdb backtrace but it looks as it should to me: A object is being deallocated and triggers the WeakValueDictionary remove callback. The only dict that is being cleaned at that time are the globals of the doctest runner.

@nbruin
Copy link
Contributor

nbruin commented Apr 22, 2013

comment:9

Replying to @vbraun:

Fair enough, but even after I catch the KeyError the wr.key in self.data.keys() is still True. So the self.data dict is in a weird place for sure.

That is weird, and sounds to me like self.data is indeed corrupted. Given that key/value pairs are stored together in a dictionary, it's very hard for this to be true too. Caching the result from data.keys wouldn't make sense at all, so I don't think python will be doing that. Perhaps check that something like wr.key in self.data or self.data.__contains__(wr.key) returns true and/or if value=self.data[wr.key] succeeds (and what kind of value you get back: whether it's wr)

That might give you some indication of what exactly is wrong with the entry and hence where the problem might be coming from.

One thing your traceback is confirming:

  • There is no GC going on, so callbacks are happening when you think they should.
  • Hence there's also not a circular data structure directly involved
  • Things are happening during cleanup, so these callbacks are probably indeed happening on a dict that will be deleted too. It's still bad they don't properly execute, of course, but we've seen before that order in cleanup can sometimes lead to subtle problems (i.e., whether things are stored in a cdef attribute or in a conventional dict).

One possible scenario:

  • self.data is in the process of being cleared. The dict is actually emptied then before any entries are decreffed (to prevent dict reshaping during deallocation).
  • One of the weakreffed values also occurs as a key in the dict (and that's the only strong ref)
  • decreffing that key causes the callback, but the callback finds a dict that is now empty, so of course we get a key error.

Your diagnosis that wr.key in self.data is still true is not consistent with that, but the scenario indeed leads to ignored keyerrors:

import weakref

class A(object):
  def __init__(self,n):
    self.n=n
  def __repr__(self):
    return "A(%d)"%self.n

def mess(n):
  D=weakref.WeakValueDictionary()
  L=[A(i) for i in range(n)]
  for i in range(n-1):
      j=(i+10)%n
      D[L[i]]=L[j]
  return D

D=mess(10000)
D.clear()

(how many you ignored keyerrors you get depends a bit on memory layout etc.)

This is definitely a bug in the python library. The problem is that remove does check if the dict still exists (via the weakref, which is mainly to ensure that presence of callbacks doesn't extend the life of the dict), but doesn't check if the dict is otherwise in a funny state, such as "in the process of a clear".

This is now Python issue 17816. Python3 does not seem to have this problem and it may not be the problem we're running into here.

@vbraun
Copy link
Member Author

vbraun commented Apr 22, 2013

comment:10

I've looked at the Python 3 code and there weakref keeps a separate list of keys to remove to work around this issue.

Any further non-Sage related discussion should probably move to the Python bugtracker.

As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.

@nbruin
Copy link
Contributor

nbruin commented Apr 22, 2013

comment:11

Replying to @vbraun:

As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.

I would suggest a paranoid remove function for this:

        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None and self.data.get(wr.key) is wr:
                del self.data[wr.key]

The get method avoids exception handling and hopefully avoids allocation as much as possible (thus avoiding triggering GC) and will return None if the reference is not present. Furthermore, we verify that the correct value is still stored in the dictionary, in case a callback gets delayed so much that another value makes it into that dict under the same key (that shouldn't really happen, so that's the paranoid bit).

I tried. Making the test

            if self is not None and wr.key in self.data:

also makes the error go away (but is less paranoid than the option above), but indeed, making the test

            if self is not None and wr.key in self.data.keys():

does not!! So there is something wrong with self.data.keys(). Indeed, putting

        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
                if wr.key in self.data.keys() and wr.key not in self.data:
                    raise RuntimeError("BAD KEY:%s"%(wr.key,))
                del self.data[wr.key]

in there makes for "ignored runtime error". On the other hand, if I run the example above with the BAD KEY test, I do not get these problems. Therefore, the discrepancy between in data.keys() and in data does not occur in the example above.

So:

  • there is a fix that apparently solves both problems
  • there is something different in the problem originally reported here, since keys() seems to report an erroneous list.

One difference between data.keys() and __contains__ is that dictobject.c/dict_keys just iterates over the table storing the dict and extracts the keys of all dictentries that have a non-NULL value field, whereas __contains__ will go through a dictlookup.
Given that some of the key components may well be in the process of being torn down, could it be that their hash is broken? One would still be able to find the corresponding key by iterating through all of them (as wr.key in data.keys() does), but finding it by hash would not work. Both __contains__ and __delitem__ would suffer from that.

This could happen if the MRO of some class changed to insert/remove WithEqualityById in it. Changing the hash on a key component for which a callback happens would have exactly this effect. So my bet is that it's an artifact of the mro voodoo the dynamical category stuff does or it's an unfortunate clearing/changing of a cached hash value somewhere due to a deletion-in-progress.

@nbruin
Copy link
Contributor

nbruin commented Apr 23, 2013

comment:12

Replying to @vbraun:

    Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored

This entry is produced by the __call__ doctest in matrix0.pyx:

            sage: f(x,y) = x^2+y
            sage: m = matrix([[f,f*f],[f^3,f^4]]); m
            [    (x, y) |--> x^2 + y (x, y) |--> (x^2 + y)^2]
            [(x, y) |--> (x^2 + y)^3 (x, y) |--> (x^2 + y)^4]
            sage: m(1,2)
            [ 3  9]
            [27 81]
            sage: m(y=2,x=1)
            [ 3  9]
            [27 81]
            sage: m(2,1)
            [  5  25]
            [125 625]
+           sage: D=UniqueRepresentation.__classcall__.get_cache()
+           sage: [k for k in D.data.keys() if k not in D.data]
+           []

With the added doctest, we get different behaviour depending on whether
attachment: trac_14471_demonstrate.patch is applied or not: Without the patch, the test passes; with the patch the test fails. That means that the line

hasattr(obj, '_foobar_')

changes the hash of one of the key components involved; probably the "callable function ring", i.e., parent(f). Indeed:

sage: f(x,y) = x^2+y
sage: H=hash(parent(f))
sage: m = matrix([[f,f*f],[f^3,f^4]]);
sage: hasattr(m,'a')
False
sage: H == hash(parent(f))
False

that last one's really bad! The hasattr on that matrix apparently clobbers the hash of its base ring!

@nbruin
Copy link
Contributor

nbruin commented Apr 23, 2013

comment:13

I think this illustrates the problem rather well:

sage: f(x,y) = x^2+y
sage: P=parent(f)
sage: P.__class__
sage.symbolic.callable.CallableSymbolicExpressionRing_class
sage: H=hash(P.__class__)
sage: m = matrix([[f,f*f],[f^3,f^4]])
sage: H == hash(P.__class__)
True
sage: hasattr(m,'a')
False
sage: P.__class__
sage.symbolic.callable.CallableSymbolicExpressionRing_class_with_category
sage: H == hash(P.__class__)
False

We have that P.__hash__ looks like:

    def __hash__(self):
        return hash((self.__class__, self._arguments))

Obviously, when class is not fixed, it's a bad ingredient for a hash. Given that

sage: T=type(P)
sage: [t for t in T.mro() if '__hash__' in t.__dict__]
[sage.symbolic.callable.CallableSymbolicExpressionRing_class,
 sage.symbolic.ring.SymbolicRing,
 sage.structure.category_object.CategoryObject,
 sage.structure.sage_object.SageObject,
 object]

we do need to do something about hashing, since sage.symbolic.ring.SymbolicRing.__hash__ is just a fixed value and these CallableExpression rings do come in different flavours. The __cmp__ method there also depends on self.__class__ which should obviously not be relied upon.

Given that this code stems from #5930, it predates these dynamic classes voodoo by quite a bit, so the fact that it doesn't operate well with it is a bug in the dynamic classes stuff. A job for Simon, Nicholas and the gang!

Furthermore, note that type(parent(m)) inherits from UniqueRepresentation but that type(parent(f)) does not. It doesn't seem easy to do, but it might be possible to make two equal-but-not-identical callable expression rings. Building matrix rings over them will wreak havoc, because the constructor will use equality testing on the ingredients (and hence return identical matrix rings), but the coercion framework will not honour "equality" between the non-identical base rings.

EDIT: The last bit is not the case. CallableSymbolicExpressionRing is a UniqueFactory. Furthermore, it seems the coercion framework does detect "equal but nonidentical" parent and can sometimes figure out coercions between them.

@nbruin
Copy link
Contributor

nbruin commented Apr 23, 2013

comment:15

I did a very quick search for lines in source containing both hash and __class__. Lines 73 and 75 in sage/combinat/species/recursive_species.py are also guilty, so whoever is going to fix this might want to look at those lines too.

@vbraun
Copy link
Member Author

vbraun commented May 2, 2013

comment:16

Isn't this super-dangerous, changing the hash can potentially give you wrong results.

Also, I thought it is already wrong to make the bare class (not _with_category) accessible through parent()? Of course that ought to be spelled out in the documentation.

@nbruin
Copy link
Contributor

nbruin commented May 3, 2013

comment:17

Replying to @vbraun:

Isn't this super-dangerous, changing the hash can potentially give you wrong results.

Experience shows you don't have to change hashes for that, but yes, feel free to up the severity of the ticket if you feel strongly about this.

Also, I thought it is already wrong to make the bare class (not _with_category) accessible through parent()? Of course that ought to be spelled out in the documentation.

I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.

@vbraun
Copy link
Member Author

vbraun commented May 3, 2013

comment:18

Replying to @nbruin:

I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.

Is that actually true? This is about constructing a parent, after all. Though maybe somebody who actually wrote the category code could chime in and tell us what he/she envisioned and where it is documented (j/k).

@simon-king-jena
Copy link
Member

comment:19

See sage.matrix.matrix_space.MatrixSpace.full_category_initialisation for documentation. See #11900 for timings.

The difference of MatrixSpace with respect to other parents is that some code (namely on elliptic curves) creates many different matrix spaces, but uses them only as a container, but not as a fully grown parent.

@simon-king-jena
Copy link
Member

comment:20

PS: A further difference is that finding the correct category is a bit more involved than in other examples. It may be an algebra or a module, depending on whether we talk about general or square matrices. And concerning module, this involves checking whether the base ring is a field, which also tends to be slow. And all this effort would be in vain, because the category framework isn't used for matrix spaces in the elliptic curve code.

@vbraun
Copy link
Member Author

vbraun commented May 3, 2013

comment:21

But the problem is CallableSymbolicExpressionRing_class, not MatrixSpace? The matrices only trigger the problem by initializing the CallableSymbolicExpressionRing_class_with_category. Presumably matrices are smart enough to not use __class__ for hashes and comparison if they delay category initialization.

@simon-king-jena
Copy link
Member

comment:22

Replying to @vbraun:

But the problem is CallableSymbolicExpressionRing_class, not MatrixSpace? The matrices only trigger the problem by initializing the CallableSymbolicExpressionRing_class_with_category.

As much as I know, I never touched CallableSymbolicExpressionRing.

Presumably matrices are smart enough to not use __class__ for hashes and comparison if they delay category initialization.

I hope so.

@simon-king-jena
Copy link
Member

Attachment: trac_14471-review.patch.gz

@simon-king-jena
Copy link
Member

comment:33

I think it would be a good idea to show that the hash actually did change. I extended the new test accordingly, in a review patch.

Positive review, then!

@simon-king-jena

This comment has been minimized.

@simon-king-jena
Copy link
Member

comment:34

No, I spoke to soon. Now, as the debug flag is set, it makes sense to run the full doc tests. After all, it could be that the debug flag uncovers a bug.

@vbraun
Copy link
Member Author

vbraun commented Jun 27, 2013

comment:35

Doctests pass on my machine, of course

@simon-king-jena
Copy link
Member

comment:36

Replying to @vbraun:

Doctests pass on my machine, of course

Confirmed!

@jdemeyer
Copy link

comment:37

The PDF documentation doesn't build due to the use of single instead of double backticks:

! Missing { inserted.
<to be read again> 
                   _
l.14636 when the constructor calls $Parent.__
                                             init__()$ directly or
?
! Emergency stop.
<to be read again> 
                   _
l.14636 when the constructor calls $Parent.__
                                             init__()$ directly or
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on geometry.log.
 [180[combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target fo
und: ../../../../../../../html/en/reference/structure/sage/structure/unique_representation.html#sage.structure.unique_representation.UniqueRepresentation
[combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target found:
../../../../../../../html/en/reference/structure/sage/structure/parent.html#sage.structure.parent.Parent
make[1]: *** [geometry.pdf] Error 1

@vbraun
Copy link
Member Author

vbraun commented Jul 25, 2013

Attachment: trac_14471-pdf-fix.patch.gz

Initial patch

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member Author

vbraun commented Jul 25, 2013

comment:38

Fixed.

@jdemeyer
Copy link

jdemeyer commented Aug 2, 2013

Merged: sage-5.12.beta0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants