What happens when you mess with hashing

asmeurer edited this page Aug 6, 2012 · 4 revisions
Clone this wiki locally

This is based on a message from Aaron Meurer on the mailing list

Python expects a few invariants with regard to hashing, but it does not enforce them. In particular, it expects that:

  • The hash of an object does not change across the object's lifetime (in other words, a hashable object should be immutable).

  • a == b implies hash(a) == hash(b) (note that the reverse might not hold in the case of a hash collision).

So let's look at what happens when these rules are broken. First, let's create an object with __hash__, and try changing the hash across the object's lifetime.

>>> class Bad(object): 
...     def __init__(self, arg): 
...         self.arg = arg 
...     def __hash__(self): 
...         return hash(self.arg) 
... 
>>> Bad(1) 
<__main__.Bad object at ...> 
>>> hash(Bad(1)) 
1 
>>> a = Bad(1) 
>>> b = {a:1} 
>>> a.arg = 2 
>>> hash(a) 
2 
>>> b[a] 
Traceback (most recent call last):
...
KeyError: <__main__.Bad object at ...>

Here, we implicitly changed the hash of a by mutating the argument of a that is used to compute the hash. As a result, the object is no longer found in a dictionary, which uses the hash to find the object.

Note that Python doesn't prevent me from doing this. I could make it if I want, by making __setattr__ raise AttributeError, but even then I could forcibly change it by modifying the object's __dict__. This is what is meant when we say that Python is a "consenting adults" language.

Now let's look at the second point. Bad things also happen if we change the object, but keep the hash the same:

class Bad2(object): 
    def __init__(self, arg): 
        self.arg = arg 
        self.hash_cache = None 
    def __hash__(self): 
        if self.hash_cache is None: 
            self.hash_cache = hash(self.arg) 
        return self.hash_cache 
    def __eq__(self, other): 
        if not isinstance(other, Bad2): 
            return False 
        return self.arg == other.arg 

(this is the pattern used by SymPy's Basic, by the way)

Here we see that finding the key in a dictionary still works:

>>> a = Bad2(1) 
>>> b = Bad2(2) 
>>> d = {b:2} 
>>> b.arg = 1 
>>> hash(b) 
2 
>>> d[b] 
2 

but it no longer keeps the invariant that sets (and dicts) keep unique objects (keys):

>>> a = Bad2(1) 
>>> b = Bad2(2) 
>>> hash(a) 
1 
>>> hash(b) 
2 
>>> b.arg = 1 
>>> a == b 
True 
>>> hash(a) == hash(b) 
False 
>>> set([a, b]) 
set([<__main__.Bad2 object at ...>, <__main__.Bad2 object at ...>]) 

and this can lead to very subtle issues. For example, look what happens when I don't compute the hash before changing the object:

>>> a = Bad2(1) 
>>> b = Bad2(2) 
>>> b.arg = 1 
>>> a == b 
True 
>>> hash(a) == hash(b) 
True 
>>> set([a, b]) 
set([<__main__.Bad2 object at ...>]) 

This time, the set contains one element!

So, yes, Python won't enforce it, but if you don't follow the rule that hash(object) remains the same for the lifetime of object and that a == b implies hash(a) == hash(b), then you will run into very subtle issues, because very basic Python operations expect these invariants.

If you want to be able to mutate an object's properties, you have two options. First, make the object unhashable (make __hash__ raise TypeError). You won't be able to use it in sets or as keys to a dictionary, but you will be free to change the object in place however you want.

A second option is to make all mutable properties non-dependent on hashing or equality testing. This option works well if you just want to cache some internal state that doesn't inherently change the object. Both __eq__ and __hash__ should remain unchanged by changes to this state. You may also want to make sure you use proper getters and setters to prevent modification of internal state that equality testing and hashing does depend on.