# Hash Tables

### WEB SERVICE EXAMPLE
**Goal** Count the number of times different IPs access my servers during an hour window. 
<div style="column-count: 2;">

  <div align="center">
    <img src="images\web_service.png" width="600" style="display: block; margin-bottom: 20px;">
  </div>

  <div align="center">
    <img src="images\ip_logs.png" width="600" style="display: block; margin-bottom: 20px;">
  </div>


## Hash Tables
**Use-cases:**
* Python dictionaries
* file systems
* password verification
* storage optimisation


**Definition:** A generalisation of a queue where each element is assigned a priority and elements come out in order by priority.

#### Operations for Web Address Problem
| Operation      | Output | Time (array: direct addressing) | Time (list) | Time (hash)
| :-            | :-    | :-: | :-: | :-: |
`UpadeAccessList(p)`  <br> • $c = \text{length of longest chain}$|   ...  | $O(1)$ | $O(1)$ | $O(c+1)$ 
`AccessedLastHour()`  <br> • $c = \text{length of longest chain}$|   ...  | $O(1)$ | $O(n)$ | $O(c+1)$ 
**Required Memory** <br> • $n = \text{\#active IPs}$ <br> • $N = \text{\#all possible IPs}$ <br> • $m = \text{cardinality of hash function}$ |   ...  | $O(N)$ <br> • Need $2^{32}$ memory for few IPs <br> • IPv6: $2^{128}$ won't fit in memory | $O(n)$ | $O(n+m)$ 

We want to make $m$ and $c$ as small as possible.

## Hash Function
**Definition:** For any set of objects $S$ and any integer $m>0$, a function $h : S \to \{0, 1, \dots , m-1\}$ is called a **hash function**.  
**Definition:** $m$ is called the **cardinality** of hash function $h$.  
**Definition:** When $h(o_1) = (o_2)$ and $o_1 \neq o_2$, this is called a **collision**.

**Desirable properties:**
* $h$ should be fast to compute
* different values for different objects
* direct addressing with $O(m)$ memory
* want small cardinality $m$
* impossible to have all different values if number of objects $|S|$ is more than $m$.

<p align="center">
    <img src="_.png" width="450" style="display: inline-block; margin-right: 0px;">
</p>

### Map
**Definition:** A **Map** from $S$ to $V$ is a data structure with methods `HasKey(O)`, `Get(O)`, `Set(O)`, where $O \in S, v \in V$.

Store mapping from objects to other objects:
* Filename $\to$ file on disk
* Student ID $\to$ student name
* Contact name $\to$ contact phone number


### Chaining
Chaining is a technique to implement a hash table.  

$h : S \to \{0, 1, \dots , m-1\}$  
$O, O' \in S$  
$v, v' \in V$  
$A \gets$ array of $m$ lists (chains) of pairs $(O, v)$  
<p align="left">
    <img src="images\chaining.png" width="450" style="display: inline-block; margin-right: 0px;">
</p>

### Set

**Definition:** A **Set** is a data structure with methods `Add(O)`, `Remove(O)`, `Find(O)`.  

$h : S \to \{0, 1, \dots , m-1\}$  
$O, O' \in S$  
$A \gets$ array of $m$ lists (chains) of objects $O$ 

**EXAMPLES**
* IPs accessed during last hour
* Students on cmapus
* Keyword in a programming language

## Hash Table

**Definition:** An implementation of a **set** or a **map** using hashing is called a hash table.

**Set:**
* `unordered_set` in C++
* `HashSet` in Java
* `set` in Python

**Map:**
* `unordered_map` in C++
* `HashMap` in Java
* `dict` in Python




### Phone Book Problem
Design a data structure to store your contacts: names of people along with their phone numbers. The data structure should be able to do the following quickly:
* Add and delete contacts,
* Lookup the phone number by name,
* Determine who is calling given their phone number.

We need implement two Maps as hash tables.  
1. $\text{phone number} \to \text{name}$  
`int(123-45-67) = 1234567`

2. $\text{name} \to \text{phone number}$

#### Parameters
* $n$: #phone numbers stored
* $m$: cardinality of the hash function
* $c$: length of the lengest chain
* $O(n+m)$: memory used
* $\alpha = \frac{n}{m}$: load factor
* $O(c+1)$: run time of operations  

Want to make $m$ and $c$ small!

### Hash Function
Ley $U$ be the **universe** - the set of all possible keys.  
A set of hash functions 
$$
\mathcal{H} = \{h:U \to \{0, 1, 2, \dots , m-1 \} \}
$$

 is called a **universal family** if for any two keys $x, y, \in U, x \neq y$ the probability of **collision**

$$
\mathbb{P}[h(x) = h(y)] \leq \frac{1}{m}
$$

**LEMMA**  
If $h$ is chosen randomly from a universal family, the average length of the longest chain $c$ is $O(1+\alpha)$, where $\alpha$ is the load factor of the family.

**COROLLARY**  
If $h$ is from a universal family, operations with hash table run on average in time $O(1 + \alpha)$.

#### Choosing Hash Table Size
* Ideally, $0.5 < \alpha < 1$
* Use $O(m) = O(\frac{n}{\alpha}) = O(n)$ memory to store $n$ keys
* Operations run in time $O(1 + \alpha) = O(1)$ on average

What if $n$ **unknown**?  
Copy the idea of dynamic arrays:
1. Resize the hash table when $\alpha$ becomes too large.
2. Choose new hash function and **rehash** all the objects.

Similarly to dynamic arrays, single rehashing takes $O(n)$ time, but amortized running time of eachc operation with hash table is still $O(1)$ on average, because rehashing will be rare.

### Hashing Integers - 1. $\text{phone number} \to \text{name}$  
1. Take phone numbers up to length $L=7$, e.g. `148-25-67`.
2. Convert phone numbers to integers from $0$ to $10^L-1 = 9 \; 999 \; 999$:  
`148-25-67` $\to$ `1482567`
3. Choose prime number bigger than $10^L$, e.g. `p = 10000019`.
4. Choose hash table size, e.g. `m = 1000`.

**LEMMA**  
The family of hash functions 
$$
\mathcal{H}_p = \{h_p^{a,b}\}(x) = ((ax+b)\mod p) \mod m \}, \quad a \in [1,p], b \in [0, p-1]
$$
is a **universal family**.

### Hashing Strings - 2. $\text{name} \to \text{phone number}$  
1. Given a string $S$, convert each character $S[i]$ to integer code: ASCII code, Unicode, etc.
2. Choose a big prime number.
3. Randomly select a hash function $h_x^p$ from $\mathcal{P}_p$.
4. Randomly select a hash function $h_p^{a,b} from \mathcal{H}_p$.
4. Concatenate the two functions to get the hash function $h_m$.

**LEMMA**  
The family of polynomial hash functions 
$$
\mathcal{P}_p = \left\{h_p^x(S) = \sum_{i=0}^{|S|-1} S[i]x^i \mod p) \right\}, \quad p \; \text{prime}, x \in [1,p-1]
$$
has probability of **collsion** $\leq \frac{L}{p}$ (for any randomly selected hash function).

**LEMMA**  
For any two different strings $s_1$ and $s_2$ of length at most $L+1$ and cardinality $m$, the probability of **collsion** $ \mathbb{P}[h_m(s_1) = h_m(s_2)] \leq \frac{1}{m} + \frac{L}{p}$.

**COROLLARY**  
If $p > ML$ for any two different strings of lengths at most $L+1$m the probability of **collision** $\mathbb{P}[h_m(s_1) = h_m(s_2)] = O(\frac{1}{m})$.


**RUNNING TIME**  
* For big enough $p$, again $c = O(1 + \alpha)$
* Computing `PolyHash(S)` runs in time $O(|S|)$.
* If lengths of the names in the phone book are bounded by constant $L$, computing $h(S)$ takes $O(L) = O(1)$ time.

## Searching for Patterns

Given a text $T$ (book, website, Facebook profile) and a pattern $P$ (word, phrase, sentence), find all occurances of $P$ in $T$.

**Examples**  
* Your name on a website
* Twitter messages about your company
* Detect files infected by virus - code patterns

**Definition:** Denote $S[i..j]$ the substring of $S$ starting in position $i$ and ending in position $j$.

### Problem: Find Pattern in Text

**Input:** Strings $T$ and $P$.

**Output:** All positions $i \in [0, \: |T| - |P|]$ such that $T[i \: .. \: i + |P|-1] = P$.

| Algorithm      | Running Time
| :-            | :- 
| `FindPatternNaive(T,P)` | $\Theta(\|T\|\|P\|)$


In [14]:
def AreEqual(S1, S2):
    """Running time = O(m)"""
    if len(S1) != len(S2):
        return False
    for i in range(len(S1)):
        if S1[i] != S2[i]:
            return False
    return True

def FindPatternNaive(T, P):
    """Running time = O((n-m+1)*m) = O(n*m)"""
    result = []
    n = len(T)
    m = len(P)
    for i in range(n - m + 1):
        if AreEqual(T[i:i+m], P):
            result.append(i)
    return result

FindPatternNaive("a cat and a dog", "cat")

[2]

* Need to compare $P$ with all substrings $S$ of $T$ of length $|P|$.
* IDea: use hashing to quickly compare $P$ with substrings of $T$.


### Rabin-Karp's Algorithm
1. Use polynomial hash family $\mathcal{P}_p$ with big prime $p$
2. If $h(P) \neq$ h(S)$, then definitely $P \neq S$
3. If $h(P) = h(S)$, call `AreEqual(P,S)`
4. If equal, append index to return array.


* If $P \neq S$, the probability $\mathbb{P}[h(P) = h(S)]$ is at most $\frac{|P|}{P}$ for polynomial hashing.
* On average, the total number of false alarms will be $(|T|-|T|+1)\frac{|P|}{p}$, which can be made small by selecting $p \gg |T||P|$.

**Running time**
* $h(P)$ is computed in $O(|P|)$
* $h(T[i \: .. \: i + |P|-1])$ is computed in $O(|P|)$, $|T| - |P| +1$ times
* `AreEqual` is computed in $O(|P|)$
* If number of flase alarms is negligible and $P$ found $q$ times in $T$, then  `AreEqual` is computed $q + \frac{(|T|-|P|+1)|P|}{p}$ times
* Total time spent in `AreEqual` is $O((q + \frac{(|T|-|P|+1)|P|}{p})|P|) = (q|P|)$ for $p \ \gg |T||P|$
* Total running time is $O(|P|) + O((|T| - |P| +1)|P|) + (q|P|) = O(|T||P|)$

**Improving running time by precomputing hashes**  
* $h(S) = \sum_{i=0}^{|S| - 1} S[i]x^i \mod p$
* $H[i] = h(T[i \: .. \: i + |P|-1]) = \sum_{j=i}^{i+|P| - 1} T[j]x^{j-i} \mod p$

We can rewrite

$$
H[i] = xH[i+1] + (T[i] - T[i + |P|]x^{|P|}) \mod p
$$

So, we can
1. Compute $H[|T|-|P|]$ in time $O(|P|)$ using `PolyHash`
2. Compute $x^{|P|}$ in time $O(|P|)$
3. Use the recursive formula to compute $H[i]$ for all $i \in [0, \; |T|-|P|-1]$ in time $O(|T|-|P|)$

Total time to precompute $H$ is $O(|T|+|P|)$

**Improved running time**
1. Compute $h(P)$ in time $O(|P|)$
2. Precompute hashes in time $O(|T|+|P|)$
3. Total time in `AreaEqual` is $O(q|P|)$ on average
4. Total average running time is $O(|T| + (q+1)|P|) \ll O(|T||P|)$ as $q$ is usually small.

In [None]:
def RabinKarp(T,P):
    ...

## Instant Uploads and Storage Optimization in Dropbox

**STORAGE OPTIMIZATION**  

Consider a storage platform, e.g. OneDrive, Google Drive or Dropbox.

When three different users upload the same file (e.g. Nyan Cat video), store only 1 copy, linking user files to that one copy.

#### A. Naive Comparison
1. Upload new file
2. Go through all stored files
3. Compare each stored file with new file byte-by-byte
4. If there's the same file, store a link to it instead of the new file

*Drawbacks*
* Have to upload the file first anyway
* $O(NS)$ to compare file of size $S$ with $N$ other files
* $N$ grows, so total running time of uploads grows as $O(N^2)$

#### B. Hash Comparison
1. Upload new file and compute its hash
2. Compare hashes (as in Rabin-Karp's algorithm)
    - If hashed are different, files are different
    -  If there's a file with the same hash, upload and compare directly

*Drawbacks*
* There can be collisions
* Still have to upload the file to compare directly
* Still have to compare with all $N$ stored files

#### C. Several Hashes (Final Solution)
0. Choose several (3 to 5) different hash functions (e.g. polynomial hashing with different $p$ or $x$) and compute all hashes for all files
2. Upload new file and compute its hashes (locally before upload)
2. Compare hashes
    - If hashed are different, files are different
    - If there's a file with the same hash, don't upload new file

*Drawbacks*
* Collisions can happen even for several hashes simultaneously. However,
    * There are algorithms to find collisions for known hash functions
    * Even for one hash function, collisions are extremely rare
    * Using 3 or 5 hashes, you probably won't see a collision in a lifetime

<br>

* Still have to compare with $N$ already stored files
    * When a file is submitted for upload, hashes are computed anyway, so
    * store file addresses in a hash table.
    * Also store all the hashes there.
    * Only need the hashes to search in the table (we do not need the file itself)

**MORE Problems**
* Billions of files are uploaded daily
* Trillions stored already
* Too big for a simple hash table
* Millions of users upload simultaneously
* Too many requests for a single table

## Big Data

* Need to store trillions or more objects, e.g.g file addresses, user profiles, emails
* Need fast search/access
* Hash tables provide $O(1)$ search/access on average, but for $n=10^{12}$, $O(n+m)$ memory becomes too big to store on one computer.

Solution: **distributed hash tables**.

### Distributed Hash Tables
1. Get 1000 computers
2. Create a hash table on each of them
3. Determine which computer "owns" object $O$ by using another hash: number $h(O) \mod 1000$
4. Send request to that computer, seach/modify in the local hash table

**Problems**
* Computers sometimes break, e.g.  
computer breaks once in 2 years $\implies$ one of 1000 computers breaks every day.
* Therefore, store several copies of the data.
    * Need to relocate the data from the borken computer
    * Service grows, and new computers are added to the cluster
    * $h(O) \mod 1000$ no longer work! Therefore use **Consistent Hashing**


**Consistent Hashing**
* Choose hash function $h$ with cardinality $m$ and put numbers from $0$ to $m-1$ on a circle clockwise
* Each object $O$ is then mapped to a point on the circle with number $h(O)$
* Map computer IDs to the same circle: $\text{compID} \to$ point number $h(\text{compID})$  
* Each object is stored on the "closest" computer
* Each computer stores all objects falling on some arc of the circle

<div style="column-count: 2;">

  <div align="center">
    <img src="images\consistent_hashing_3.png" width="600" style="display: block; margin-bottom: 20px;">
    When a computer goes off, "neighbours" take its data
  </div>

  <div align="center">
    <img src="images\consistent_hashing_4.png" width="600" style="display: block; margin-bottom: 20px;">
    When a new computer is added, it takes fata from the "neighbours"
  </div>

**Overlay Network**
* Need to copy\relocate data
* How will a node know where to send the data?
    * Each node will know a few neighbours
    * For each key, each node will either store it or know some node "closer" to this key
    * E.g., each node knows neighbours, $\pm 1, \pm 2, \pm 4, \pm 8, \dots $ nodes and can get/send any key in $O(\log n)$.


**CONCLUSION**
* Distributed Hash Tables (DHT) store Big Data on many computers
* Consisten Hashing (CH) is one way to determine which computer stores which data
* CH uses mapping of keys and computer IDs on a circle
* Each computer stores a range of kets
* Overlay Network is used to route the data to/from the right computer