# Hash Table

## 1. Introduction

### Recap - Dictionary

Dictionary is a built-in data structure which store data in key-value pair. 
* Values are accessed by key.
* Keys must be unique in a dictionary.

#### Construct a Dictionary

In [1]:
d = {'name':'Mark', 'gender':'Male', 'address':'Singapore'}

#### Access an Item

In [2]:
d['name']

#### Update an Item

In [3]:
d['name'] = 'Kelvin'

#### Add an Item

In [4]:
d['age'] = 18

#### Hashable Key

If you try to use a list as a key, it will throw a `TypeError` exception.

This shows that a Dictionary internally uses a **hash table** structure to store data.

In [5]:
d1 = {['weight', 'height']: (75, 170)}

TypeError: unhashable type: 'list'

### Hash Table

**Hash table** is data structure that maps **keys** to **values (data)**. This is similar to a dictionary.

<u>For example</u>, to store a phone book, you uses a person's name as key, and his phone number is the data to be looked up.

**Hash function** is function which takes in a value and generates another value.
* Key is passed into the **hash function** to generate an index value which points to a location where data is stored.

**Bucket** is the place where data is stored.
* Potentially multiple data may be stored in the same bucket, i.e. multiple keys may point to same bucket.


<img src="images/hash_table_wiki.png" width=300 />
<center>https://en.wikipedia.org/wiki/Hash_table</center>

## 2. Basic Hash Table

Let's implement a hash table for a phone book. Each entry in the phone book is a pair of `Name` and `Phone`.
* `Name` is used as the key.
* `(Name, Phone)` tuple is saved as the data.

### Hash Table

We will define a class `HashTable` to store the data.
* It has a list attribute `buckets` which keeps all data.
* Initialize the list size, i.e. how many buckets, by input parameter `size`.
* It has a <u>static</u> function `_hash()` which returns an `index` value based on input parameter `key`. 
* The `index` value specifies which bucket to put the data.

**Hash Function**

The logic to be implemented in `_hash()` function is straight forward. We will simply return length of the `key` as the `index` value.


In [None]:
class HashTable:
    
    def __init__(self, size):
        self.buckets = [None]*size
    
    @staticmethod
    def _hash(key):
        return len(key)
    

### Test

#### Test - Add Items

Let's try to add following items into the HashTable.
* Create a hash table of 10 buckets.
* For each contact, 
    * Use `_hash()` function to find out which bucket it belongs to;
    * Put the contact in the bucket.
* Print out the `buckets` to view how contacts are stored.

```python
contacts = [
    ('Ben', '357-0394'),
    ('Alan', '558-9171'),
    ('Freddi', '760-2466'),
    ('Stephanie', '299-5109')]
```

In [None]:
ht = HashTable(10)
print(ht.buckets)

contacts = [
    ('Ben', '357-0394'),
    ('Alan', '558-9171'),
    ('Freddi', '760-2466'),
    ('Stephanie', '299-5109')]

for c in contacts:
    idx = HashTable._hash(c[0])
#     print(c[0], idx)
    ht.buckets[idx] = c

print(ht.buckets)

<img src="images/hash_function_good.png" width=160/>

In this case, the time spent in finding an item is **O(1)**.

#### Test - Find an Item

With the populated hash table, how do you retrieve the data of for a name, e.g. `'Freddi'`?
* Use `_hash()` function to find `index` value.
* Locate the bucket by index.
* Return the bucket.

In [None]:
idx = HashTable._hash('Freddi')
ht.buckets[idx] 

#### Test - Remove an Item

We may need to remove an item, e.g. `'Freddi'`, from the hash table.
* Use _hash() function to find index value.
* Locate the bucket by index and set it to `None`.

In [None]:
idx = HashTable._hash('Freddi')
ht.buckets[idx] = None

ht.buckets

### Support Basic Operations

A Hash Table class commonly implement methods to support **add**, **find** and **remove** operations.

With knowledge of previous session, Enhance `HashTable` class by implementing `add(key, data)`, `find(key)` and `remove(key)` methods.

In [None]:
class HashTable:
    
    def __init__(self, size):
        self.buckets = [None]*size
    
    @staticmethod
    def _hash(key):
        return len(key)
    
    def add(self, key, data):
        idx = self._hash(key)
        self.buckets[idx] = data
    
    def find(self, key):
        idx = self._hash(key)
        return self.buckets[idx]
    
    def remove(self, key):
        idx = self._hash(key)
        self.buckets[idx] = None
    

<u>Test:</u>

In [None]:
contacts = [
    ('Ben', '357-0394'),
    ('Alan', '558-9171'),
    ('Freddi', '760-2466'),
    ('Stephanie', '299-5109')]

table = HashTable(10)

for c in contacts:
    table.add(c[0], c)

print(table.buckets)

In [None]:
table.find('Freddi')

In [None]:
table.remove('Freddi')
print(table.buckets)

## 3. Better Hash Table

### Support Multiple-Items Bucket

What if we need store following data in the hash table?

```
contacts = [
    ('Amanda', '357-0394'),
    ('Christ', '558-9171'),
    ('Freddi', '760-2466'),
    ('Steven', '299-5109')]
```

Since all contacts' name has length of 6 characters, their hashed indexes point to the same bucket. Thus 6th bucket needs to be able to hold multiple contacts.

Since we still need to scan all items in a bucket, a **linked-list** implementation is more common because it is more memory efficient.

For simplicity, We will implement a bucket as a list. 

### Hash Node

The data can be any data type. To make Hash Table methods usable for any data type. It is better to use one common data type for item in the hash table. 

We will create a class `HashNode` where each data element is stored in a `HashNode` object. 

Define a `HashNode` class with instance attributes `key` and `data`.
* Implement its `__init__()` function to initialize `key` & `data`.
* Implement its `__str__()` function to return `data` in string format.
* Implement its `__repr__()` function to return same value as `__str__()`.
* Implement its `__eq__()` function to compare 2 nodes by their `key`.

In [None]:
class HashNode:
    
    def __init__(self, key, data):
        self.key = key
        self.data = data
        
    def __str__(self):
        return str(self.data)
        
    def __repr__(self):
        return self.__str__()
        
    def __eq__(self, other):
        return self.key == other.key


### Hash Table with Hash Node

Modify the `HashTable` class with following enhancements:
* Implement each bucket as a list.
* Use `HashNode` to hold data

In [None]:
 class HashTable:
        
    def __init__(self, size):
        self.buckets = [None]*size
    
    @staticmethod
    def _hash(key):
        return len(key)
    
    def add(self, key, data):
        node = HashNode(key, data)
        idx = HashTable._hash(key)
        
        if self.buckets[idx] is None:
            self.buckets[idx] = [node]
            return True
        elif node in self.buckets[idx]:
            return False
        else:
            self.buckets[idx].append(node)
            return True
    
    def find(self, key):
        idx = HashTable._hash(key)
        bucket = self.buckets[idx]
        
        node = HashNode(key, None)
        if node in bucket:
            i = bucket.index(node)
            return bucket[i].data
        else:
            return None
        
    def remove(self, key):
        idx = HashTable._hash(key)
        bucket = self.buckets[idx]
        node = HashNode(key, None)
        
        if node in bucket:
            bucket.remove(node)
            if (len(bucket) == 0):
                self.buckets[idx] = None
            return True
        else:
            return False
    

#### <u>Test:</u>

Test the basic `add()`, `find()` and `remove()` functions.

In [None]:
contacts = [
    ('Amanda', '357-0394'),
    ('Christ', '558-9171'),
    ('Freddi', '760-2466'),
    ('Steven', '299-5109')]

table = HashTable(10)

for c in contacts:
    table.add(c[0], c)

table.buckets

In [None]:
table.find('Freddi')

In [None]:
table.remove('Freddi')
table.remove('Amanda')
table.remove('Christ')
table.remove('Steven')

table.buckets

## 4. Importance of Hash Function

### Hash Table Collision

Ideally, the hash function will assign each key to a unique bucket. But since a hash function returns a small number for a big key, there is possibility that two keys result in same value. That is **hash table collision**. 

In previous example, the hash function generates same index value for all entries, and all data are stored in same bucket. 
<img src="images/hash_function_bad.png" width=350/>

This is the worst case where a hash table acts a list and time spent in searching is **O(n)**. To improve efficiency, we need a better hash function.

### Good Hash Function

To achieve a good hashing mechanism, It is important to have a good hash function with the following basic requirements:

**Easy to Compute**
* A hash function, should be easy to compute the unique keys.

**Less Collision**
* When elements equate to the same key values, there occurs a collision. There should be minimum collisions as far as possible in the hash function that is used. As collisions are bound to occur, we have to use appropriate collision resolution techniques to take care of the collisions.

**Uniform Distribution**
* Hash function should result in a uniform distribution of data across the hash table and thereby prevent clustering.


### Hash Function v2

Python provides a `hashlib` module implementing different cryptographic hashing algorithms. These hashing functions take variable length of bytes and converts it into a fixed length sequence.

* md5
* sha1
* sha224
* sha256
* sha384
* sha512



Following code converts a string `hello world` to an integer value. 

In [None]:
import hashlib

i = int(hashlib.md5('hello World'.encode('utf-8')).hexdigest(),16)
i % 1000

We can enhance our `_hash()` function in `HashTable` class.

In [None]:

def _hash(key):
    bins = 10
    i = int(hashlib.md5(key.encode('utf-8')).hexdigest(),16)
    return i % bins
    

Above `_hash()` function gives a better result than using length of the string. 

In [None]:
print(_hash('Amanda'), _hash('Christ'), _hash('Freddi'),_hash('Steven'))
    

### Don't Use `hash()`

Python has a hashing function `hash()` which can be apply to any object, and returns an integer in the range `-2**31` to `2**31 - 1` on 32-bit system, and `-2**63` to `2**63 - 1` on 64-bit system.

But starting from Python version 3.3, "for security reason", `hash()` generates different values in different Python session.


In [None]:
hash('hello')

### Summary

The performance of a hash table depends on following factors:
1. How good the hash function could distribute the keys evenly over the hash table
1. Size of the hash table

In this example, we store all the data inside the hash table. In practise, we store pointers to the actual records which could be in the memory or permanent storage (such as disc).