### Hash Tables

- Used in many areas when we want fast lookup (e.g. phone contacts, search number by name)

- We'll discuss several implementations in this section

### Direct addressing

- Assume we want a hash table for (Singapore) phone numbers, which have 8 digits
    - This gives us 10^8 possible numbers

- In direct addressing, this means that we create an array of size 10^8 to store these numbers. Approach:
    - `def hash(phone_number): ...`: Takes in a phone number and hashes it, returning an index between 1 and 10^8. We will use this to look up the contact information
    - `def GetName(phone_number): ...` Hash the phone number to get index `i`, and return the contact at index `i` of the phonebook array
    - `def SetName(phone_number, name): ...` Hash the phone number to get index `i`, and assign the `name` at position `i` of the phonebook array

- Asymptotics
    - Time
        - `hash():` O(1)
        - `SetName():` O(1)
        - `GetName():` O(1)
    - Space
        - O(U), where U is the size of the phone number universe
    - Very inefficient! You need to maintain an array of 10^8 size even if you only have 1 phone number
    

### Other data structures for hashmaps

- Linked lists
    - To resolve the space issue, can we use a different data structure? Like a linked list
        - Put pairs (Phone number, Name) into doubly linked lists, maintaining `head` and `tail` pointers
        - Adding a contact is now O(1), just append to tail
        - But `GetName` becomes O(N), because you search through the whole linked list!

- Dynamic sorted array
    - Since retrieval from linked list is slow, what if we put the (names, phone number) pairs into a dynamic sorted array instead?
    - `GetName` becomes O(log N) by binary search
    - But adding a new contact becomes O(N)! Because you look through the whole array to find the right spot

### Hash functions

- Since neither linked lists nor dynamic arrays will suffice, let's see how a **hash function** helps us here!
    - Definition: For any set of objects $S$ and integer $m \gt 0$, a hash function $h: S \rightarrow \{0,1,..., m-1\}$ is called a hash function
    - $m$ is the **cardinality** of the has function $h$

- What makes a good has function?
    - Fast to compute
    - Different values for different objects
    - Direct addressing with $O(m)$ memory; basically we want $m$ to be small enough such that we can use direct addressing
        - Think of $m$ as a "postal code" of sorts, so we know which region to look for the desired value
    - We'd like $m$ to be small (small cardinality)
    - Note that it is impossible to have different hash values if the number of objects $|S|$ exceeds $m$

- What is a collision?
    - When you hash 2 different things, and it gives you the same hash i.e. When $h(o_1) = h(o_2)$ and $o_1 \neq o_2$, this is a collision!
    - If you have a small enough $m$, collisions are bound to happen. If $m$ is so large that you have no collisions, this is just direct addressing
    - So some small probablity of collision is ok! 

### Chaining

- We already know we want to have a hashing function with cardinality $m$ to store data, such that $m << |U|$, where $|U|$ is the universe of things to be hashed 
- But we also stated that collisions are inevitable in such a case

- How should we deal with this? Set up a **map**
    - Definition: A map from set $S$ objects to set $V$ values is a data structure with methods `HasKey`, `Get`, `Set` where $\text{object} \in S$ and $\text{value} \in V$
    - In a `Map` from `S` to `V`, objets from $S$ are usually called **keys** of the `Map`. Objects from `V` are values of the map. 

- A map introduces a new idea called **chaining**
    - The idea is that the hashing function you set up an array with $m$ `Chains`, where each chain is a doubly linked list
    - The hash points you to the right chain, and you traverse the chain to find the value you want

### Implementation of chaining

In [11]:
from dataclasses import dataclass, field
from typing import Type
from collections import namedtuple

PhoneContact = namedtuple('PhoneContact', 'name, phone_number')

@dataclass
class Chain:
    data: list[PhoneContact] = field(default_factory=list)

AllChains: list[Chain] = []

def hash(contact_name):
    ...

def HasKey(contact_name):
    chain = AllChains[hash(contact_name)]
    for contact in chain:
        if contact.name == contact_name:
            return True
    return False

def Get(contact_name):
    chain = AllChains[hash(contact_name)]
    for contact in chain:
        if contact.name == contact_name:
            return contact
    return None

def Set(contact_name, number):
    chain = AllChains[hash(contact_name)]
    for contact in chain:
        if contact.name == requested_contact:
            contact.number = number
            return
    chain.append(PhoneContact(val=contact_info(contact_name, number)))


#### Asymptotics

- Time complexity:
    - If $c$ is the length of the longest chain in `Chains`, then the running time of `HasKey`, `Get`, and `Set` is $\Theta(c+1)$
        - Because if c=0, it is $O(1)$
    - Intuition: if the chain corresponding to an object is not empty, we need to scan it fully to check if the item is in it before doing of any of the 3 operations. 

- Space complexity:
    - $\Theta(N)$ to store $n$ pairs of (object, value)
    - $\Theta(m)$ to store $m$ chains
    

### HashMap

- HashSet
    - The hashmap structure is similar to a hashmap, except instead of storing a key value pair, just store a key
    - You should implement methods `Add`, `Remove`, `Find`
    - The actual set object can use a hashmap implementation, with all values `V` just equals True
    - Or you can just store a list of keys

In [None]:
from dataclasses import dataclass, field

@dataclass
class Chain:
    data: list = field(default_factory=list)

def hash(object) -> int:
    ...

Chains: list[Chain] = []

def Find(object):
    chain = Chain[hash(object)]
    for key in chain:
        if key == object:
            return True
    return False

def Add(object):
    chain = Chain[hash(object)]
    for key in chain:
        if key == object:
            return 
    chain.append(object)

def Remove(object):
    if not Find(object):
        return
    chain = Chain[hash(object)]
    chain.remove(object)

### TLDR

- A hash table is either an implementation of Set or Map
- Recall jargon
    - $n$: Number of objects in universe to store
    - $m$: Cardinality of hash function
    - $c$: Longest chain length
- Asymptotics
    - $\Theta(n+m)$  memory
    - $\Theta(c+1)$  time