### Phone book implementation with Hashmap

- Having discussed hash tables, maps, and sets, how do we implement a phone book?

- Requirements:
    1. Add and delete contacts fast
    2. Call person by name
    3. Determine who is calling given their phonenumber

- Quite clearly, requirements 2 and 3 are different look ups, one looks up a number using name as a key, the other looks up a name using number as a key
    - So we need 2 maps; `number -> name` and `name -> number`
    - Both of these are hash tables

- Recall the following from previous notes:
    - Jargon
        - $n$: Number of objects in universe to store
        - $m$: Cardinality of hash function
        - $c$: Longest chain length
    - Asymptotics
        - $\Theta(n+m)$  memory
        - $\Theta(c+1)$  time

- We want to keep $m$ and $c$ as small as possible!
- We further know that $c \ge \frac{n}{m}$
    - The smallest $c$ you can get is if you evenly divide all $n$ objects between the $m$ chains

### What is a good hash function for phone number?

- Options
    - First 3 digits? Bad, because area code is often the same, so you get large $c$
    - Last 3 digits? Might be bad, if many numbers end with `000`
    - Random? Good distribution guaranteed, but hash cannot be repeated!

- Remember, we want our hash function to be 
    - Deterministic (i.e. for a given value the computed hash is always the same)
    - Fast to compute
    - Distributes keys well info different cells
    - Few collisions

- **Problem**: if the number of possible keys is much bigger than cardinality of the hash function $|S| >> m$, then any hash function $h$ can give you a bad input with collisions

### Universal family

- So if no single hash function that exists can give us the desired case of few collisions, we rely instead on a `universal family` of hash functions
    - It is similar to the quicksort idea, where choosing pivot randomly helps us get better performance asymptotically!

- We will choose a family (set) of hash functions, then choose a random one from the family

- Formally
    - Let $U$ be the **universe** i.e. set of all possible keys that we want to hash
    - Let $\mathbb{H} = \{h: U \rightarrow \{0,1,...m-1\} \}$ be a set of hash functions
    - $\mathbb{H}$ is a **universal family** if for any two keys $x,y \in U \text{ and } x \neq y$, the probability of collision is at most $\frac{1}{m}$
    $$Pr[h(x) = h(y)] \le \frac{1}{m}$$
    
- Intuitively, it just means that if I randomly pick some hash function from this set, and computed h(x) and h(y) for a specific pair (x, y), I have at most $\frac{1}{m}$ probabilty of collision
    - Just as an example, if I uniformly pick a random hash function for x, and another for y, this gives us collision with probability $\frac{1}{m}$
    - Of course this doesn't work, because then the hashing isn't deterministic. It is just to illustrate the idea
    - In actual implementation, we will use the same $h$ throughout the algorithm

### Load Factor

- Let's discuss 1 more concept, called the load factor $\alpha$
    - $\alpha = \frac{n}{m}$
    - It is simply the ratio between the number of objects and cardinality of the hash 

- Theorem: If we choose $h$ randomly from universal family, the average length of the longest chain $c$ is $O(1 + \alpha)$, where $\alpha=\frac{n}{m}$ is the load factor of the table
    - That is; if $h$ is from the universal family, operations with hash table run on average in time $O(1 + \alpha)$

- So effectively, our problem reduces to choosing a good $\alpha$, which we can do by choosing a good $m$
    - Ideally, we want $0.5 \lt \alpha \lt 1$
    - Once alpha is chosen, the memory we need is automatically $O(m) = O(\frac{n}{\alpha}) = O(n)$
    - Operations run in $O(1+\alpha) = O(1)$ time 

### Dynamic Hash Table

- We often don't know the size of $n$ in advance
- So instead of wasting space by starting with a big hash table, we can use the idea of dynamic arrays
    - Start with hash table of some size, and resize when $\alpha$ becomes too large
    - Then choose a new hash function from universal set, and rehash all objects

- `Rehash` is technically O(N), but since it is rarely called, it is actually O(1) on average
    - This kind of asymptotic analysis, ignoring expensive but infrequent operations, is called **amortized time complexity**

In [1]:
def Rehash(hash_table):
    ## Assume some preset size of hash table
    load_factor = len(hash_table.keys()) / hash_table.size

    if load_factor > 0.9:
        hash_table_new = make_hash_table(size=2*hash_table.size)
        hash_new = choose_hash(universal_set)
        for key, value in hash_table.items():
            position_in_new = hash_new(key)
            hash_table_new[position_in_new].append((key, value))
    
    return hash_table_new, hash_new

### So what is a universal family?

- For any finite family of integers, there is not universal hash function, but there is a **universal family** of hash functions. Let's see what it looks like in our phone book example

- Assume phone numbers go up to 15 digits. So our requirement is that any universal family we define must be able to take in numbers up to 10^15

- Universal family
    - Let $\mathbb{H}_p = \{h_p^{a,b}(x) = ((ax+b) \mod p) \mod m \}$ be a set of hash functions
    - Assume $a,b: 1 \le a \le p-1, 0 \le b \le p-1$
    - $\mathbb{H}_p$ is a **universal family** for the set of integers between 0 and $p-1$ for any prime number $p$

    - You choose some values $a, b$ to generate the hash function. Since there are $p$ choices of $b$, and $p-1$ choices of $a$, there are $p(p-1)$ total hash functions that can be generated
    - You choose a prime number $p$, which has to be larger than the value you wish to hash
    - $m$ is the cardinality of the hash function

    - This is super fast to compute, and scales quite easily!

    - Example: Let $a=34, b=2, p=10,000,019$
        - Assume we want to hash $x=1482567$
        - $(34 * 1482567 + 2) \mod 10,000,019 = 407,185$
        - $407185 \mod 1000 = 185$
        - $h(x) = 185$

- Why does this work?
    - Just assume it works, the proof is too involved

- In general
    - Define the maximum length $L$ of the input (in this case, phone number has maximum of 8 characters)
    - Convert input to integers between 0 and $10^L - 1$. We subtract 1 because 0 indexing
    - Choose a prime number $p \gt 10^L$
    - Choose a hash table cardinality $m$
    - Choose random hash function from universal family (i.e. choose random values of $a$ and $b$)

### Hashing non-integers

- We stated above a method to hash numbers. But clearly, the universal family doesn't work for non-integer inputs. What if we want to implement a hash table lookup `name` to `number`?

- Definitions
    - Let $|S|$ be the length of a string
    - Let $S = S[0]S[1]...S[|S|-1]$ where $S[i]$ are individual characters

- To hash the string:
    - We convert each $S[i]$ to integers using using ASCII, Unicode etc
    - Choose a big prime number $p$
    - Then the universal family is $\mathbb{P}_p = \{h_p^x(S) = \sum_{i=0}^{|S|-1} S[i] x^i \mod p \}$
        - Choose $p$ is a fixed prime value
        - Choose $1 \le x \le p-1$ as the *polynomial*

- What is the idea here?
    - Basically, we want to generate the hash by taking the polynomial sum of each character in the string
    - Imagine we have some string $S$ where $|S| = 3$
        - Iteration 1: Hash = 0
        - Iteration 2: Hash = $S[2] \mod p$
        - Iteration 2: Hash = $((S[2] \mod p) \cdot x + S[1]) \mod p$
        - Iteration 3: Hash = $(((S[2] \mod p) \cdot x + S[1]) \mod p) \cdot x + S[0] = S[2]x^2 \mod p + S[1]x \mod p + S[0] \mod p$
        
- Let's implement this

In [None]:
def PolyHash(string, prime, polynomial):
    '''
    Time complexity: O(N) in the length of the string to be hashed
    '''
    hashvalue = 0
    for _, character in enumerate(string):
        hashvalue = ((hashvalue * polynomial) + ord(character)) % prime
    return hashvalue

### Asymptotics of polyhash

- Assume we have two strings $s_1$ and $s_2$ of length at most $L + 1$
- If we choose $h$ from $\mathbb{P}_p$ at random (i.e. choose random value of $x \in [1, p-1]$), the probability of collision $\text{Pr}[h(s_1) = h(s_2)]$ is at most $\frac{L}{p}$
    - Proof: Again, proof is too involved. Just know that the proof is because the equation $a_0 + a_1 x + a_2 x^2 + ... a_L x^L = 0 (\mod p)$ for prime $p$ has at most $L$ different solutions $x$

- Of course, the procedure above for polyhash doesn't give us desired cardinality $m$ yet
    - As you can see, $m$ does not appear in the universal set 
    - To fix this, notice that the output of the polyhash $h_p(S)$ is just an integer
    - So we can just combine this output with our earlier hashing function, to give us $h_m(S) = h_{a,b}(h_x(S)) \mod m$ for any desired cardinality $m$

- Result stated without proof
    - For any 2 strings $s_1, s_2$ with length at most $L+1$ and cardinality $m$, the probability of collision $Pr[h_m(s_1) = h_m(s_2)]$ is at most $\frac{1}{m} + \frac{L}{p}$
    - Hence, if $p \gt m \cdot L$, the probability of collision is $O(\frac{1}{m})$

- Running time
    - For $p \gt m \cdot L$, the longest chain length is $c = O(1 + \frac{n}{m}) = O(1 + \alpha)$
    - `PolyHash(S)` will run in $O(|S|)$ time
    - If lengths of names in the phone book are bounded by constant $L$, computing $h(S)$ takes $O(L) = O(1)$

### TLDR

- Phone book implemented in 2 hash tables: names -> numbers, and numbers -> names
- Both strings and integers can be hashed, with the appropriate hash function
- Search and modification runs in O(1) on average with the appropriate hashing!