### Hash Table

A **hash table (hash map)** is a data structure that implements an associative array abstract data type, a structure that can map **keys** to **values(data)**. A hash table must come with a **hash function** to compute an index value (also called a **hash code**) which points to into an array of **hash buckets**, the place where data is stored. This is similar to a dictionary.

<center>
<img src="hash-table.png" width="400" align="center"/>
</center>

<u>For example</u>, to store a phone book, you uses a person's name as key, and his phone number is the value(data) to be looked up. Note that **BOTH** the key and the value has to be stored.
- Key is passed into the hash function to generate an index value which points to a location where data is stored.
- Potentially multiple data may be stored in the same bucket, i.e. multiple keys may point to same bucket.

#### Hash Table Operations
The basic operations of a hash table is to add, find and remove item from table.
- `insert(key,value)`: Add `value` data associated with key `key`
- `find(key)`: return `value` data with provided `key`
- `remove(key)`: remove `key` and its associated `value` data from the table


#### Hash Function ( DIFFERENT from a Cryptographic Hash)

A hash function is a function that takes an input (key) and maps it to an address or index in a fixed size array. It should be
-   deterministic (given an input, it must always gives the same output)
-   fast to compute
-   uniformly distribute keys to minimize collisions.

A **collision** in a hash function occurs when two different inputs (keys) produce the same output (hash value). This happens because the domain size of the input is larger than the range of possible outputs.(the indexes in a fixed size array)

#### Collision Resoution
- **Linear Probing** : When the hash function causes a collision by mapping a new key to a bucket of the hash table that is already occupied by another key, linear probing searches the table for the closest following free location and inserts the new key there. Lookups are performed in the same way, by searching the table sequentially starting at the position given by the hash function, until finding a bucket with a matching key or an empty bucket.

- **Separate Chaining** : this technique involves building a linked list with key-value pair for each search array indices. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.



##### Python's builtin hash function
- returns a integer from $−2^{63} $ to $2^{63}−1$
- to use the builtin- function we need to mod it with the size of the hash table

> For every instance of Python interpreter, `hash()` would have different seeds and thus, could give out a different value even though we have same object.

In [None]:
data = ("John", 98765431) # store (key, value)
hash(data[0])%200

20

---
#### Exercise 1

Implement the following array-based `HashTable` encapsulated using OOP.


<center>

|`HashTable`|
|------------------------|
|------------------------|
|`-array: ARRAY OF OBJECT`|
|`-size: INTEGER`|
|------------------------|
|`+constructor(INTEGER)`|
|`+insert(OBJECT): BOOLEAN`|
|`+find(key): OBJECT`|
|`+remove(key): BOOLEAN`|
|`+hash(key): INTEGER` |
|`+__repr__(): STRING`|
|`get_array()`:`ARRAY`|

</center>
<br />
<center>

|Attribute/Method descriptions: |  |
|-|-|
| `HashTable.constructor(INTEGER)`| Initialises a `HashTable` with the given size. |
| `HashTable.insert(OBJECT)`	 | Inserts an OBJECT into the `array` at index. The index is generated by a hash function which takes in a key. The key is obtained from the object to be inserted. When a collision occurs, the open addressing strategy of `linear probing` is utilised. Returns True if success else False|
| `HashTable.find(key): OBJECT`	 | Returns the object based on its key, if NOT FOUND returns None|
| `HashTable.remove(key): BOOLEAN` | Attempts to delete the given OBJECT based on its key, returns True if delete else False|
|`HashTable.hash(object):INTEGER`| See below |
| `HashTable.get_array()` |	Returns the `array` object used by the HashTable. This is used for debugging purposes only|

</center>
<br/>

**NOTE:**

-   OBJECT is usually implemented as a data structure consisting of two components, a key and a value(data). For example
    - a `tuple, (key, value)`
    - a `list, [key, value]`
    - a custom object where one of its attributes will be used as a key

##### Task 1

- Implement the class HashTable.
    - implement the following methods first:
        - `constructor`
        - `insert`
        - `hash`
        - `__repr__`
        - `get_array`


- For the initial implementation we shall use a naive hash function that will cause a lot of collisions:

>```python
>  // size_of_table is the size of the array used
>  // LEN() returns length of the string
>  FUNCTION hash(key: STRING) RETURNS INTEGER
>       RETURNS LEN(key) MOD size_of_table
>>
>```




In [1]:
## Code for the Class HashTable
class HashTable:
    def __init__(self, size):
        self.__size = size
        self.__array = [None for _ in range(size)]

    def get_array(self):
        return self.__array ## for debugging

    def hash(self, key):
        #1 return len(str(key)) % self.__size
        return sum([ ord(c) for c in key]) % self.__size
        #3 return hash(key) % self.__size
        #4 return 0

    def insert(self, obj):
        key, value =obj
        index = self.hash(key)
        if self.__array[index] == None:
            self.__array[index] = obj
            return True
        else:
            ## linear probing algorithm
            for i in range(1, self.__size):
                cur_index = (index + i)%self.__size # wrap around
                if self.__array[cur_index] == None:
                    self.__array[cur_index] = obj
                    return True
            print("Hash table full")
            return False

    def find(self, key):
        index = self.hash(key)
        for i in range(1, self.__size): #
            cur_index = (index + i )%self.__size
            if self.__array[cur_index][0] == key:
                return self.__array[cur_index]
        else:
            return None


##### Task 2

- Create a hash table of size 200 to store phone book entries. where each entry in the phone book is a pair of `Name` and `Phone`.

-   `Name` is used as the key.
-   `(Name, Phone)` tuple is saved as the data.

- Read the contents from the file `contacts_50.txt`, which has 50 contacts and store them in the hash table.

- Print the hash table to verify that the contacts are being stored in the array.


#### Task 3

We shall now observe the clustering produced by the naive hash function.

-   import the ploting utility provided in `ht_plot.py`
-   make sure you have a method get_array() implemented in your hash table
-   Use the following code to visualise the clustering in your hash table.
```python
import ht_plot
contacts = HashTable(201)
## use the code in task 2 to insert records into the contacts hash table
ht_plot.show_cluster(contacts)

```

In [None]:
# Task 2 and Task 3
import csv, ht_plot
contacts = HashTable(201)
for name, contact in list(csv.reader(open("contacts_50.txt"))):
    contacts.insert(
        (name, contact)
    )
#print(contacts)
ht_plot.show_cluster(contacts)

#### Task 4
You should see a plot like this :

![alt text](image.png)

Modify the naive hash function such that the contents in the hash table is more uniformly distributed.

Run the code to plot the distribution of indices again to verify that your new hash function workds.


#### Task 5

Implement the rest of the methods in Hashtable:
- `find`
- `remove`

Test your code by

- `find('Batisah Wong')` -> should return the tuple
- `find('')` -> return None
- `find('Mr Kwek`) -> return None
- `remove('Ng Jeremy')` -> return True
-`remove('Mr Lee')` -> return False

#### Conclusion: Open Addressing Hash Table performance is determined by 3 factors:
1.  Hash function algorithm
2.  Probing algorithm
3.  Load factor ( ratio of occupied slots over capacity)

***Latest Research O(logN) Hash Table ( open addressing)***

https://arxiv.org/html/2501.02305v1

### Task 5
Implement BST using a seperate chaining strategy when a collision occured in the hash function