# Hash Tables

Hash tables is effective in implementing dictionaries. It performs extremely well under reasonable assumption that time average to search for element in a hash table is $O(1)$.<br>

- Enables direct addressing.
- hash table typically uses array size proportional to number of keys stored.

<br> 

Direct addressing is a simple technique that works well if universe U of keys is reasonably small. We can represent the direct-address table by an array T[0..m-1] in which each slot corresponds to a key in the universe U. <br>

Downside of direct addressing is : if universe U is large, table T can be impractical. Actual set of keys stored might also be small relative, hence wasting space. A Hash table requires much less storage while maintaining search to $O(1)$ on average. 
- difference between direct addressing is key is stored in $h(k)$ instead of $k.$ which is a hash function, computing slot from key k, mapping universe of keys to the hash table T. 
- We call $h(k)$ the hash value of key $k$.
- if the key is addressed to the same slot, it collides, which si resolved by chaining/concat, where it contains a linked list of all keys whose hash is j. 

In [5]:
def directAddressSearch(T,k):
    return T[k]

def directAddressInsert(T,x):
    T[x.key]=x
def directAddressDelete(T,x):
    T[x.key]=NIL

In [6]:
#slot j contains pointer to head of list to all stored elements in j

def chainhashInsert(T,x):
    #insert x at the head of list T[h(x.key)]
    
def chainhashSearch(T,k):
    #search element key k in list T[h(k)]
    
def chainhashDel(T,x):
    #delete x from list T[h(x.key)]

IndentationError: expected an indented block (<ipython-input-6-b1bafed185ef>, line 6)

Creating Hash Functions<br>

- Heuristic Hashing
    * Hash by division
    * Hash by multiplication
- Universal hashing
    * Randomized 
    
<br>

A good hashing function satisfies as much as possible simple uniform hashing (equal likelihood and independent). 

## Division Method
Map a key k into one of m slots by taking remainder of k by m.<br>
$h(k) = k$ mod $m$<br>
m should not be power of 2 -> if $m=2^p$ , then $h(k)$ is just the low order bits. <br>
A prime number not too close to an exact power of 2 is often good. 

## Multiplication Method
Multiplies key k by a constant A, where $ 0 < A < 1$, and extract the fraction part of k A. We then multiply this fraction by m and take the floor of the result. <br>
$h(k) =$ floor$m * (kA$mod$1)$

## Universal Hashing
Random and independent way of storing. We first select the hash function at random from a carefully designed class of functions. Algorithm can be different based on each execution. This guarantees a good average case. <br>

- djb2 
    * djb2 is a popular hash function algorithm.
    * apparently number 33 works more optimally than most constants prime or not. 
    * hash = (hash <<5)+hash +c ; hash *33 +c 
- unif distrib
- avoid collisions
- fast compute


In [20]:
def hashdbj2(key):
    hash = 5381
    for c in key:
        hash = (hash*33)+ord(c) 
    return hash



### Open Addressing
- if we dont want to use additional data structure(linekdlist)

- linear probing -> 
    * we store key value pairs within array itself. -> check directly next to it for empty. 
- Double hashing -> 
    * if collision, we pick a number as an interval to check for empty open address. 
    * less likely to have clusters. 
    * i = $h_1$(key) mod 8
    * (i+c) mod 8 
    * (i+2c) mod8 
    * greater common factor (c,m) = 1 (m is prime)
    * c <- {$h_2$(key) mod (m-1)} +1
<br>
All elements occupy the hash table itself. Each table entry contains either an element or NIL. Hash talbes can fill up such that no insertions can be made, so load factor can never exceed 1. This way we avoid pointers all together. We compute the sequence of slot to be examined/probed. <br>

A Good hash function needs
- Never evaluate to Zero
- Ensure all cells can be probed

In [24]:
def double_hashing(keys, hashtable_size, double_hash_value):
    hashtable_list = [None] * hashtable_size
    for i in range(len(keys)):
        hashkey = keys[i] % hashtable_size 
        if hashtable_list[hashkey] is None: #if unoccupied, just slot in.
            hashtable_list[hashkey] = keys[i]
        else: #if collision
            new_hashkey = hashkey
            while hashtable_list[new_hashkey] is not None:
                steps = double_hash_value - (keys[i] % double_hash_value)
                new_hashkey = (new_hashkey + steps) % hashtable_size  
            hashtable_list[new_hashkey] = keys[i]
    return hashtable_list  


values = [26, 54, 94, 17, 31, 77, 44, 51]
print( double_hashing(values, 13, 5) )

[26, None, 54, 94, 17, 31, 44, 51, None, None, None, None, 77]


In [27]:
def linear_probe(keys, hashtable_size):
    hashtable_list = [None] * hashtable_size
    for i in range(len(keys)):
        hashkey = keys[i] % hashtable_size 
        if hashtable_list[hashkey] is None: #if unoccupied, just slot in.
            hashtable_list[hashkey] = keys[i]
        else: #if collision
            new_hashkey = hashkey
            while hashtable_list[new_hashkey] is not None:
                new_hashkey = (new_hashkey + 1) % hashtable_size  
            hashtable_list[new_hashkey] = keys[i]
    return hashtable_list  

values = [26, 54, 94, 17, 31, 77, 44, 51]
print( linear_probe(values, 13) )

#can use a better hash func.

[26, 51, 54, 94, 17, 31, 44, None, None, None, None, None, 77]


In [21]:
class HashTable(object):
    def __init__(self):
        self.max_length = 8
        #this is alpha which is n/m
        self.max_load_factor =0.75
        self.length = 0
        self.table = [None] * self.max_length
        
    def __len__(self):
        return self.length
    
    def __setitem__(self, key, value):
        self.length += 1
        hashed_key = self._hash(key)
        while self.table[hashed_key] is not None:
            if self.table[hashed_key][0] == key:
                self.length -= 1
                break
            hashed_key = self._increment_key(hashed_key)
        tuple = (key, value)
        self.table[hashed_key] = tuple
        if self.length / float(self.max_length) >= self.max_load_factor:
            self._resize()

    def __getitem__(self, key):
        index = self._find_item(key)
        return self.table[index][1]

    def __delitem__(self, key):
        index = self._find_item(key)
        self.table[index] = None

    def _hash(self, key):
        hash = 5381
        for c in key:
            hash = (hash*33)+ord(c) 
        return hash(key) % self.max_length

    def _increment_key(self, key):
        return (key + 1) % self.max_length

    def _find_item(self, key):
        hashed_key = self._hash(key)
        if self.table[hashed_key] is None:
            raise KeyError
        if self.table[hashed_key][0] != key:
            original_key = hashed_key
            while self.table[hashed_key][0] != key:
                hashed_key = self._increment_key(hashed_key)
                if self.table[hashed_key] is None:
                    raise KeyError
                if hashed_key == original_key:
                    raise KeyError
        return hashed_key

    def _resize(self):
        self.max_length *= 2
        self.length = 0
        old_table = self.table
        self.table = [None] * self.max_length
        for tuple in old_table:
            if tuple is not None:
                self[tuple[0]] = tuple[1]