# Hashing

----

Contents
 - [Hash table basics](#Hash-table-basics)
 - [Hashing in Python](#Hashing-in-Python})
  - [Dictionaries](#dictionaries)
  - [Dictionary keys](#dictionary-keys)

----

In [52]:
import numpy as np

## Hash table basics

You're probably already familiar with hash tables, but for completeness' sake, let's review the basics.

Hash tables consist of four parts:
1. Keys
2. Values
3. A hash function
4. A storage array

![Hash Table Diagram](hash_overview.png)

The basic idea is that a hash function provides a constant-time approximation of where to find or store a particular key-value pair. In the common case, the key-value pair can be found or stored at that location. In the uncommon case, the function returned the same value for some other key, so some other location must be searched or used to complete the operation. When a hash function returns the same value for two distinct keys, it is called a collision, and the mechanism to recover is called collision resolution.

## Python dictionaries

Python `dict`s are hash tables. We're going to understand how they operate internally.

As we've already discussed, Python data structures are largely defined by the interfaces they present. There are two interfaces we care about here:

 - Objects that act like dictionaries
 - Objects that can act as key for dictionaries

### Dictionaries

Python dictionaries map keys to values.

Just like sequences and iterables, the interface for Python dictionaries can be described by a small set of special methods. The easiest way to understand this is to match Python statements involving dictionaries with the special methods they call.

In [181]:
dictionary = {}
### All of these pairs of statements are equivalent.

# Setting an item
dictionary['key1'] = 'value1'
dictionary.__setitem__('key2', 'value2')

# Getting an item
x = dictionary['key1']
x = dictionary.__getitem__('key2')

# Get the size of an item
len(dictionary)
dictionary.__len__()

# Iterating over a dictionary
[key for key in dictionary]
[key for key in dictionary.__iter__()]

# Deleting an item
del dictionary['key1']
dictionary.__delitem__('key2')

# Checking at item (OPTIONAL)
'key1' in dictionary
dictionary.__contains__('key2')
for key in dictionary: # Fallback behavior, if __contains__ is not defiend
    if key=='key1':
        True
else: # You can have else blocks on for-loops in Python
    False

# Default values (OPTIONAL)
# Normal dicts do not implement this feature
# x = dictionary['missing-key']
# x = dictionary.__missing__('missing-key')

**Dictionaries that are not dictionaries**

One side-effect of duck-typed interfaces is that *anything* can be a duck, so long as it quacks.

As an example, let's abuse this interface to create a dictionary which never contains more than 10 values.

In [215]:
class ForgetfulDict(dict):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._memory = [None]*10
        [self.__setitem__(k,v) for (k,v) in self.items()]
    # Inherit __getitem__, __iter__, __contains__, and __len__
    def __setitem__(self, key, value):
        if key is None:
            return None # disallow None
        elif key in self._memory:
            super().__setitem__(key, value)
        else:
            lostkey = self._memory.pop(-1)
            self._memory.insert(0, key)
            super().__setitem__(key,value)
            if lostkey is not None:
                print('What was "'+str(lostkey)+'" again?')
                super().__delitem__(lostkey)
    def __delitem__(self, key):
        if key in self._memory:
            self._memory.remove(key)
            self._memory.append(None)
            super().__delitem__(key)
        else:
            self.__missing__(key)
    def __missing__(self, key):
        print('Did I forget something?')
        return None

In [222]:
forgetful = ForgetfulDict(zip('abcdefghij',range(0,10)))
print(len(forgetful),forgetful)
print('a is',forgetful['a'])

forgetful['z'] = -1
forgetful['y'] = -2
print(len(forgetful),forgetful)
print('a is',forgetful['a'])

10 {'a': 0, 'i': 8, 'h': 7, 'c': 2, 'e': 4, 'b': 1, 'g': 6, 'd': 3, 'j': 9, 'f': 5}
a is 0
What was "a" again?
What was "i" again?
10 {'g': 6, 'd': 3, 'f': 5, 'y': -2, 'z': -1, 'h': 7, 'c': 2, 'e': 4, 'b': 1, 'j': 9}
Did I forget something?
a is None


While the utility of a forgetful dictionary is questionable, the ability to manipulate the interface in this way is pretty powerful.

### Dictionary keys

Dictionary keys in Python can be any object that satisfies the interface convention.

A common misconception is that dictionary keys must be immutable.

In actuality, Python dicts define an interface just like everything else. There are two relavent special methods:

1. `__hash__`: returns an integer. This will eventually get converted by Python to an index.
2. `__eq__`: compares two objects for equality. This is how collisions are detected.

In general, immutable built-in types (like numbers and tuples) have `__hash__` methods defined, and built-in mutable types (like lists and dicts) have `__hash__` set to `None`.

Let's see some examples.

In [170]:
# VALID KEYS

valid = {
    0: 'numbers',
    'abc': 'strings',
    (1,2): 'tuples',
    complex(1,2): 'certain classes',
    (1,('a','b'),((),(),),): 'ridiculous combinations of the above',
}

[hash(k) for k in valid.keys()]

[0, 3713081631934410656, -5968570388787316237, -1903593707857203530, 2000007]

In [81]:
# INVALID KEYS

try:
    invalid = { [1,2,3]: 'lists' }
except:
    print('lists are mutable')

try:
    invalid = { {1:2}: 'other dicts' }
except:
    print('dicts are mutable, so no recursive dicts')

print( getattr( [1,2], '__hash__') )
print( getattr( {1:2}, '__hash__') )

lists are mutable
dicts are mutable, so no recursive dicts
None
None


Perhaps surprisingly to some of you, you can *absolutely* have mutable keys.

In [89]:
class MutableClass:
    def __init__(self, init):
        self.v = init
    def set(self, new):
        self.v = new
    def __repr__(self):
        return '<MutableClass: '+str(self.v)+'>'

m = MutableClass(0)
dictionary = { m: 'value' }
print(dictionary, dictionary[m])

m.set(-1)
print(dictionary, dictionary[m])

{<MutableClass: 0>: 'value'} value
{<MutableClass: -1>: 'value'} value


So what's going on here? Two things:

 - User-defined classes inherit a `__hash__` from `object` which returns a hash based on the identity of the object.
 - The implementation of `__eq__` inherited from `object` compares the identity of two objects.

In [106]:
A = MutableClass(0)
B = MutableClass(0)
C = B
print([A,B,C])
print([hash(o) for o in [A,B,C]])
print(A==B, B==C)

C.set(1)
print([A,B,C])
print([hash(o) for o in [A,B,C]])
print(A==B, B==C)

[<MutableClass: 0>, <MutableClass: 0>, <MutableClass: 0>]
[-9223372036579556110, 275219709, 275219709]
False True
[<MutableClass: 0>, <MutableClass: 1>, <MutableClass: 1>]
[-9223372036579556110, 275219709, 275219709]
False True


Now in general, these two functions should behave rationally:

 - `hash(A)==hash(A)`
 - Normally, `hash(A)!=hash(B)` if `A!=B`
 - `A==B` implies `hash(A)==hash(B)`
 
But we're in Python, so we can happily break any or all of these to see what happens.

In [124]:
class UtilityMethods(): 
    def __init__(self, v):
        self.set(v)
    def set(self, v):
        self.v = v
    def __repr__(self):
        return '<OBJ|id:'+str(id(self))+'|v:'+str(self.v)+'>'

First up, `hash(A)==hash(A)`. Hashes should be consistent.

In [142]:
import random
class InconsistentHashable(UtilityMethods):
    def __hash__(self):
        return random.randint(0,1000)

In [143]:
A = InconsistentHashable(v=0)
print(A)
print(hash(A), hash(A), hash(A))

dictionary = {}
dictionary[A] = 'x'
dictionary[A] = 'y'
dictionary[A] = 'z'
print(dictionary)
print( A in dictionary )

<OBJ|id:4411763904|v:0>
705 773 241
{<OBJ|id:4411763904|v:0>: 'z', <OBJ|id:4411763904|v:0>: 'y'}
True


Fun. Reassigning `A` has the effect of adding multiple copies, and lookup is effectively broken.

Next: Normally, `hash(A)!=hash(B)` if `A!=B`. We normally want to *avoid* collisions in hash tables.

In [148]:
class NormalHashable(UtilityMethods):
    def __hash__(self):
        return hash(id(self))
    def __eq__(self, other):
        return self.v==other.v
class SlowHashable(UtilityMethods):
    def __hash__(self):
        return 0
    def __eq__(self, other):
        return self.v==other.v

In [151]:
%%timeit
dictionary = {}
for i in range(1000):
    A = NormalHashable(i)
    dictionary[A] = i

1000 loops, best of 3: 965 µs per loop


In [152]:
%%timeit
dictionary = {}
for i in range(1000):
    A = SlowHashable(i)
    dictionary[A] = i

10 loops, best of 3: 127 ms per loop


Constantly causing collisions *really* slows things down.

Finally, the strange one: "`A==B` implies `hash(A)==hash(B)`. If two objects are equal, they should always hash to the same value.

In [122]:
class StrangeHashable(UtilityMethods):
    def __hash__(self):
        return id(self)%10
    def __eq__(self, other):
        return self.v==other.v

In [169]:
A = [A for A in [StrangeHashable(v=0) for _ in range(0,50)] if hash(A)==0][0] # Find an instance which hashes to 0
B = [B for B in [StrangeHashable(v=0) for _ in range(0,50)] if hash(B)==0][0] # Find an instance which hashes to 0
C = [C for C in [StrangeHashable(v=0) for _ in range(0,50)] if hash(C)==2][0] # Find an instance which hashes to 2

print(A, B, C)
print('(A==B)',A==B, ' (hash(A)==hash(B))',hash(A)==hash(B))
print('(B==C)',B==C, ' (hash(B)==hash(C))',hash(B)==hash(C))

dictionary = {}
dictionary[A] = 'a'
print(dictionary)
dictionary[B] = 'b'
print(dictionary)
dictionary[C] = 'c'
print(dictionary)


<OBJ|id:4415029600|v:0> <OBJ|id:4415029880|v:0> <OBJ|id:4415030832|v:0>
(A==B) True  (hash(A)==hash(B)) True
(B==C) True  (hash(B)==hash(C)) False
{<OBJ|id:4415029600|v:0>: 'a'}
{<OBJ|id:4415029600|v:0>: 'b'}
{<OBJ|id:4415029600|v:0>: 'b', <OBJ|id:4415030832|v:0>: 'c'}


If equality doesn't necessarily imply hash equality, then correct lookup behavior will depend entirely on random collisions in the hash space. Yuck.

** TL;DR: **

 - `__hash__` and `__eq__` govern `dict` behavior in Python.
 - If you're going to implement a class which can act like a dictionary key:
  - `hash(A)==hash(A)`
  - Normally, `hash(A)!=hash(B)` if `A!=B`
  - `A==B` implies `hash(A)==hash(B)`