# Hash Table - open addressing
Yesterday I spent some time on [hash table](https://en.wikipedia.org/wiki/Hash_table) implementation and talked a bit about two essential concepts behind, [hash function](https://en.wikipedia.org/wiki/Hash_function) and collision resolution.

The way I chose to solve collisions was called linked-list chaining and there were some advantages and disadvantages of the approach. While the implementation was simple, each entry required bunch of additional memory and I also had to support a different data structure.

Today I decided to implement a brand different approach called [open addressing](https://en.wikipedia.org/wiki/Hash_table#Open_addressing).

The idea is pretty intuitive. Find a bucket based on hash code of the key. If the bucket is occupied, simply probe another bucket. If that one is also occupied, probe another one until you find free slot.

Which bucket should you probe? There are various strategies. Linear probing iteratively searches `[hash(key) + i]; for i=0..N`. Quadratic probing searches `[hash(key) + i**2]; for i=0..N`. You can also use a secondary hash function to search at `[hash(key) + hash2(key, i)]; for i=0..N`.

And here is the catch in implementation. If you attempt to remove a key, you must not simply remove the bucket. Instead, the bucket has to be marked as empty so that probing of sequence of entries doesn’t get corrupted.

I chose to implement linear probing, which is probably the worst of open addressing techniques, yet, it’s the best one to show its disadvantage. After a while, open addressing tends to create a long consecutive sequences of occupied buckets. This effect is called clustering and may notably degrade hash table performance.

![table slots](resource/day72-hashtable.png)
x — filled slot, o — empty slot

Removing a key-value pair doesn’t help since the bucket is still considered to be occupied, only marked as empty (denoted as `o` in the picture). The only way to get rid of clusters is to reset the table and re-hash all the entries.

Other probing methods like double hashing or quadratic probing were proposed to solve this problem. However, even these techniques tend to create clusters, they are just way harder to be noticed.

In [1]:
import numpy as np

## algorithm

In [2]:
class HashTable:

    ratio_expand = .7
    ratio_shrink = .2
    min_size = 11
    empty = (None,)

    def __init__(self, size=None):
        self._size = size or self.min_size
        self._buckets = [None] * self._size
        self._count = 0

    def _entry(self, key):
        # get hash
        hash_ = hash(key)
        idx1 = None

        for i in range(self._size):
            # quadratic probing
            idx = (hash_ + i) % self._size
            entry = self._buckets[idx]

            # end of chain
            if not entry:
                break
            # remember first empty bucket
            elif entry is self.empty:
                if idx1 is None:
                    idx1 = idx
            # test key
            elif entry[0] == key:
                return idx, entry

        else:
            # out of space
            if idx1 is None:
                raise IndexError()

        # return first empty bucket
        return (idx, None) if idx1 is None else (idx1, None)

    def _ensure_capacity(self):
        fill = self._count / self._size
        
        # expand or shrink?
        if fill > self.ratio_expand:
            self._size = self._size * 2 + 1
        elif fill < self.ratio_shrink and self._size > self.min_size:
            self._size = (self._size - 1) // 2
        else:
            return

        # reallocate buckets
        entries = self._buckets
        self._buckets = [None] * self._size

        # store entries into new buckets
        for entry in entries:
            if entry and entry is not self.empty:
                idx, _ = self._entry(entry[0])
                self._buckets[idx] = entry

    def __len__(self):
        return self._count

    def __contains__(self, key):
        _, entry = self._entry(key)
        return bool(entry)

    def __getitem__(self, key):
        _, entry = self._entry(key)
        return entry and entry[1]

    def __setitem__(self, key, value):
        idx, entry = self._entry(key)

        # set value
        self._buckets[idx] = key, value

        # expand
        self._count += bool(not entry or entry is self.empty)
        self._ensure_capacity()

    def __delitem__(self, key):
        idx, entry = self._entry(key)

        # delete key and value
        if entry:
            self._buckets[idx] = self.empty

        # shrink
        self._count -= bool(entry and entry is not self.empty)
        self._ensure_capacity()

    def __iter__(self):
        for entry in self._buckets:
            if entry and entry is not self.empty:
                yield entry[0]

    def slots(self):
        return ''.join('-' if not p else 'o' if p is self.empty else 'x' for p in self._buckets)


## run

In [3]:
table = HashTable()

In [4]:
# add random values
for _ in range(1000):
    key, value = np.random.randint(1000), np.random.rand()
    if np.random.rand() >= .5:
        table[key] = value
    else:
        del table[key]

In [5]:
len(table), table._size

(309, 767)

In [6]:
table.slots()

'oxxxx-xx---x-xxxxxxxxxxxx-x-x-x--xx-xxxxxoxx-xxxxxxxxxxxxxxx----x--xxxxxxx--xxo-x-x----xx-xx-xxxx-oxxxx-xx-x-o----xxxxx---xxxxxxxoxxxxxxx--x--x-xxox-xxxxxx-----x-xxxx-x-xxxxxoxxx-xx-xxxx-xx-----xxxx-xxxxxxx-x-xx-xoxo-x-xx-oxx-xxxx------xx----o--xo--xxox----x-x-o---x----xx---xx--------xxx----x----x-x--xx-xxxo----xx-x-xx--ox--x-xxx-----x----------x-x---xxxx-x-x---x--xxxo--x----xx--x-x-x---x--x-xxx-x-----xxxx-xx--------xxxxx---------------------x-xx-----xx-------oxxx----x--x----xo----x-------x-x--x------xx--x-xxx-x----x------xxx---o---x---xx--xx------x--x--xx----ox----x-----x--oxx----------x--xx-x--x---x------xxx--x-xx-x-------x--x--x--x--xx--x---x-x-xo-----xxx---------xxo--x---xx----x----x-x-x--x----------------o-x---xxx-o-x--x----------xx----x--x-----x--xx-o'

In [7]:
# print some values
for key in list(table)[:5]:
    print(key, table[key])

1 0.5616508934358246
768 0.02918759614070654
3 0.43531050838505947
770 0.24205659708634175
773 0.13063569064816627


In [8]:
# delete all the values
for key in list(table):
    del table[key]

In [9]:
len(table), table._size

(0, 11)