# Hash Table

* Given a key k, use a hash function to compute a number (called a hash value) that represents k. 
* keys are not sorted => hash table = unordered set 

### Hash Function

1. Property of Equality (if two keys are equal, they must have the same hash value)
2. Property of Inequality (if two keya are not equal, it would be nice (but not necessary) for thwm to have different hash value). 

**Perfect** hash function = return different values fot different keys (hsve both properties)

* check a file for validity: [MD5](https://en.wikipedia.org/wiki/MD5), [SHA-1](https://en.wikipedia.org/wiki/SHA-1), [CRC](https://en.wikipedia.org/wiki/Cyclic_redundancy_check)

### Hash Table

* average-case time complexity - O(1)
* the average-case performance of a Hash Table is independent of hte number of elements it stores. 
* Hash Table's capacity = M = size of an array 
* key k by using hash function 
* to compute an index in the array at which to store k 

<center><img src="https://ucarecdn.com/21dc92e4-1186-4cb7-90ea-c4f3743866d5/" width=600 /></center>

<center><img src="https://ucarecdn.com/bef8b50c-2623-4c36-96df-88472fa443bc//" width=400 /></center>


> Find the key A

<center><img src="https://ucarecdn.com/7730aec4-bd4d-48e8-a2e2-c109915c6988/" width=600 /></center>

> Find the key E

<center><img src="https://ucarecdn.com/5394e27f-5c65-4f6e-a0b8-41167bd9303b/" width=600 /></center>

In general, a Hash Table needs to have the following set of functions:

* insert(key): Insert key into the Hash Table (no duplicates)
* find(key): Return true if key exists in the Hash Table, otherwise return false
* remove(key): Remove key from the Hash Table (if it exists)
* hashFunction(key): Produce a hash value for key to use to map to a valid index
* key_equality(key1, key2): Check if two keys key1 and key2 are equal

```
insert(key): // Insert key into the Hash Table
    index = H(key)
    if arr[index] is empty:
        arr[index] = key
```

```
find(key): // Return true if key exists in the Hash Table, otherwise return False
    index = H(key)
    return arr[index] == key
```

### Collisions

$$ P(event) = 1 - P(no event) $$

Probability of encountering a collision. 

$$ P_N,_M (>= 1 collisions) = 1 - P_N,_M(no collisions) $$

$$ P(A and B) = P(A)P(B | A) = P(B)P(A | B) $$

#### Birthday Paradox 

* The goal: to figure out how likely is it that at least 2 people share a birthday given a pool of N people.

For example, suppose there are 365 slots in a Hash Table: M = 365 (because there are 365 days in a non-leap year). Using the equations from the previous steps, we see the following for various values of N individuals:

* For $ N=10, P_{N,M}(\ge 1\ collision) = 12\% $

* For $ N=20, P_{N,M}(\ge 1\ collision) = 41\% $ 

* For $ N=30, P_{N,M}(\ge 1\ collision) = 71\% $ 

* For $ N=40, P_{N,M}(\ge 1\ collision) = 89\% $  

* For $ N=50, P_{N,M}(\ge 1\ collision) = 97\% $ 

* For $ N=60, P_{N,M}(\ge 1\ collision) = 99+\% $

So, among 60 randomly-selected people, it is almost certain that at least one pair of them will have the same birthday. That is crazy! The fraction $ \frac{60}{365} $ is a more $ 16.44\% $, which is how full our Hash Table of birthdays needed to be in order to have almost guaranteed a collision.

<center><img src="https://ucarecdn.com/712649f2-0b3c-40e6-8308-4ddafc27171a/" width=600 /></center>

* If Load Factor of Hash Table is more than 0.75 -> resize arrya to keep it low 

* otherwise, will be a lot of collision 

* Capacity of Hash Table should be larger than the amount of keys!

* Capacity of Hash Table should be a prime number

#### How to design Hash Table

* To be able to maintain an average-case time complexity of O(1), we need our Hash Table's backing array to have free space

* If we expect to be inserting N keys into our Hash Table, we should allocate an array roughly of size M = 1.3N

* If we end up inserting more keys than we expected, or if we didn't have a good estimate of N, we should make sure that our load factor $ \alpha=\frac{N}{M} $ never exceeds 0.75 or so

* Our analysis was based on the assumption that every slot of our array is equally likely to be chosen by our indexing hash function. As a result, we need a hash function that, on average, spreads keys across the array randomly

* In order to help spread keys across the array randomly, the size of our Hash Table's backing array should be prime


#### Linear Probing

* collision resolution straregy 

* The idea: if an object key maps to an index that is already occupied, simply shift over and try the next available index.

* worst-case time complexity O(N)

> Example, $ H(k) = (k + 3) % m $ (k - ASCII value, m - size of the backing array)

#### Inserting 

<center><img src="https://ucarecdn.com/737d90c0-ddf4-4cdc-b4f4-33ac9b6bf8ea/" width=400 /></center>

The core of Linear Probing 

$$ index = (index + 1) % M $$

#### Pseudocode 
```
insert_LinearProbe(k): // Insert k into the Hash Table using Linear Probing
    index = H(k)

    Loop infinitely:
        if arr[index] == k:         // check for duplicate insertions
            return false

        else if arr[index] == NULL: // insert if slot is empty
            arr[index] = k
            return true

        else:                       // there is a collision, so recalculate index
            index = (index + 1) % M

        if index == H(k):           // we went full circle (no empty slots)
            enlarge arr and rehash all existing elements
            index = H(k)            // H(k) will index differently now that arr is a different size
```

* closed hashing = we will insert the actual key only in an address bounded by the realms of our Hash Table 

* open addressing = the key are open to move to an address other than the address to which they initially hashed 


#### Finding 

<center><img src="https://ucarecdn.com/97338f43-7e3f-47bc-8ed6-328c39892e52/" width=400 /></center>

> if the empty slot is found -> finding is terminating

#### Deleting

<center><img src="https://ucarecdn.com/b8dcbc5e-fead-4174-928e-04f8101e56b6/" width=400 /></center>

<center><img src="https://ucarecdn.com/c8b54442-ed71-4f3c-8785-c0d3f5bb4702/" width=400 /></center>

> to not break find-algorithm -> use deleted flag 

<center><img src="https://ucarecdn.com/a2ab89c8-8ac8-4b87-8c72-78f7cfe780bd/" width=400 /></center>

#### Negative quality about Linear Probing 

* it results in clusters, or "clumps," of keys in the Hash Table.

> It turns out that clumps are not just "bad luck": probabilistically, they are actually more likely to appear than not! 

<center><img src="https://ucarecdn.com/e0a1db6a-4d7b-4a21-ae7d-d81373117459/" width=400 /></center>

> insert a new key, it would have a $ \frac{1}{5} $ chance of landing in any of the 5 slots. Let's say that, by chance, it landed in slot 0:

<center><img src="https://ucarecdn.com/b4eb73bc-8f73-43d0-9eed-de5610f6bcbb/" width=400 /></center>

> insert a new key again. It would have a $ \frac{1}{5} $ chance of landing in any of the slots (it can still index to slot 0)! (For slots 2, 3, and 4, if the key indexes to 2, 3, or 4, respectively). 

> What about slot 1? If the new key indexes to slot 1, it will be inserted into slot 1. However, remember, if the element indexes to slot 0, because of Linear Probing, we would shift over and insert it into the next open slot, which is slot 1. As a result, there are two ways for the new key to land in slot 1, making the probability of it landing in slot 1 become $ \frac{2}{5} $, which is twice as likely as any of the other three open slots! Because this probability is twice as large as the others, let's say we ended up inserting into slot 1:

<center><img src="https://ucarecdn.com/9485d40e-8539-4349-8006-e4c04631f94f/" width=400 /></center>

* we could just choose a larger "skip" instead of 1

Let's start with the following Hash Table that contains a single element, but this time, let's use a skip of 3:

<center><img src="https://ucarecdn.com/1a0f102f-777c-4445-bd7a-2e650a8f9a3b/" width=400 /></center>

What would happen if we were to try to insert a new element? Like before, it has a $ \frac{1}{5} $ chance of indexing to 0, 1, 2, 3, or 4. If it happens to land in slots 1, 2, 3, or 4 (each with $ \frac{1}{5} $ probability), we would simply perform the insertion. 

> However, if it were to land in slot 0 (with probability $ \frac{1}{5} $), we would have a collision. Before, in traditional Linear Probing, we had a skip of 1, so we shifted over to slot 1, so the probability of inserting in slot 1 was elevated (to $ \frac{2}{5} $). Now, if we index to 0, slot 3 has an elevated probability of insertion (to $ \frac{2}{5} $, like before). 

In other words, by using a skip of 3, we have the exact same predicament as before: one slot has a higher probability of insertion than the others! All we've done is changed which slot it is that has a higher probability!

<center><img src="https://ucarecdn.com/548b142b-87af-4f5b-9b3e-53a9bd19a63a/" width=400 /></center>


#### How to avoid clumps 

* designate a different offset for each particular key. 

<center><img src="https://ucarecdn.com/1a0f102f-777c-4445-bd7a-2e650a8f9a3b/" width=400 /></center>

#### Double Hashing 

* to use two hash functions: $ H_1(k) $ to calculate the hashing index and $ H_2(k) $ to calculate the offset in the probing sequence.

* $ H_2(k) $ should return an integer value between 1 and M-1, where M is the size of the Hash Table. 

> Linear Probing  -> $ H_2(k)=1 $ (i.e., we always moved the key 1 index away from its original location). 

* A common choice in Double Hashing is to set $ H_2(K)=1+\frac{K}{M}\ \%\ (M−1) $

#### Pseudocode 

```
insert_DoubleHash(k): // Insert k using Double Hashing as the collision resolution strategy
    index = H1(k) 
    offset = H2(k)

    Loop infinitely:
        // check for duplicate insertions (not allowed)
        if arr[index] == k:
            return false

        // check if the slot of the array is empty (i.e., it is safe to insert)
        else if arr[index] == NULL:
            arr[index] = k
            return true

        // there is a collision, so re-calculate index
        else:
            index = (index + offset) % M

        // we have tried all possible indices and we are now going in a circle
        if index == H1(k):
            throw an exception OR enlarge table 
```

#### Random Hashing 

* collision resolution strategy 

* The idea: use a pseudorandom number generator seeded by the key to produce a sequence of hash values.

> Once an individual hash value is returned, the algorithm just mods it by the capacity of the Hash Table, M, to produce a valid index that the key can map to.

> If there is a collision, the algorithm just chooses the next hash value in the pseudorandomly-produced sequence of hash values.

#### Pseudocode 

```
insert_RandomHash(k): // Insert k using Random Hashing as the collision resolution strategy
    RNG = new Pseudorandom Number Generator seeded with k
    nextNumber = next pseudorandom number generated by RNG
    index = nextNumber % M

    Loop infinitely:
        // check for duplicate insertions (not allowed)
        if arr[index] == k:
            return false

        // check if the slot of the array is empty (i.e., it is safe to insert)
        else if arr[index] == NULL:
            arr[index] = k
            return true

        // there is a collision, so re-calculate index
        else:
            nextNumber = next pseudorandom number generated by RNG
            index = nextNumber % M

        // we have tried all possible indices and we are now going in a circle
        if all M locations have been probed:
            throw an exception OR enlarge arr and rehash all existing elements
```

* An important nuance: we must seed the pseudorandom number generator by the key. 

> Because we need to make sure that our hash function is deterministic (i.e., that it will always produce the same hash value sequence for the same key). Therefore, the only way we can do that is to guarantee to always use the same seed: the key we are currently hashing.

* In practice, Random Hashing is considered to work just as well as Double Hashing. However, very good pseudorandom number generation can be an inefficient procedure, and as a result, it is more common to just stick with Double Hashing.

### Separate Chaining

* main idea: keep pointers to Linked Lists as the keys in our Hash Table.

* clossed adressing collision resolution strategy

> Hash function: H(k) = (k + 3) % m, k - ASCII value of the key, m - size of the backing array 

<center><img src="https://ucarecdn.com/e906d940-460f-4fc1-a9c4-8ea4ecdaa7dd/" width=600 /></center>


#### Inserting the key 

> Hash function H(k)
 
```
insert_SeparateChaining(k): // Insert k using Separate Chaining for collision resolution

    index = H(k)

    // check for duplicate insertions (not allowed) and perform insertion
    if Linked List in arr[index] does not contain k:
        insert k into Linked List at arr[index]; n = n+1

        // resize backing array (if necessary)
        if n/m > loadFactorThreshold:
           arr2 = new array with a size ~2 times the size of arr that is prime  // new backing array
           insert all elements from arr into arr2 using normal insert algorithm // rehash all elements
           arr = arr2; m = length of arr2                                       // replace arr with arr2

        return true  // successful insertion

    else:
        return false // unsuccessful insertion
```

* The core of the algorithm:

```
insert k into Linked List at arr[index]
```

* key is inserted strictly within the single index to which it originally hashed.

* closed addressing collision resolution strategy (i.e. the key must be located in the original address). 

* since we are now allowing multiple keys to be at a single index, we like to say that Separate Chaining is an open hashing collision resolution strategy (i.e., the keys do not necessarily need to be physically inside the Hash Table itself).

#### Find and Remove Algorithms

* hash to the correct index of Hash Table 
* search for the item in the respective Linked List 

> Not necessarily use Linked List, it can be Trees, etc. 

#### Advantages of Separate Chaining 

* the average-case performance is much better than Linear Probing and Double Hashing as the amount of keys approaches. 

> This is because the probability of future collisions does not increase each time an inserting key faces a collision with the use of Separate Chaining. It is also important to note that we could never have exceeded M in Linear Probing or Double Hashing (not that we would want to in the first place) without having to resize the backing array and reinsert the elements from scratch.

#### Disadvantage of Separate Chaining 

* dealing with a bunch of pointers.

* lose some optimization in regards to memory for two reasons:

    * extra storage now for pointers (when storing primitive data types).
    * All the data in our Hash Table is no longer huddled near a single memory location (since pointers can point to memory anywhere), and as a result, this poor locality causes poor cache performance (i.e., it often takes the computer longer to find data that isn't located near previously accessed data)

### Cuckoo Hashing 

> Cuckoo Hashing—and its weird name—comes from the concept of actual Cuckoo chicks pushing each other out of their nests in order to have more space to live. 

* if an inserting key collides with a key already in the Hash Table, the inserting key pushes out the key in its way and takes its place. 

* [visualization](http://www.lkozma.net/cuckoo_hashing_visualization/)

<center><img src="https://ucarecdn.com/e087eee6-7deb-40c4-97d2-5314e5af1dd0/" width=400 /></center>

* Cuckoo Hashing is defined as having two hash functions, $ H_1(k) $ and $ H_2(k) $, both of which return a position in the Hash Table. 

* As a result, one key has strictly two different hashing locations, where $ H_1(k) $ is the first location that a key always maps to (but doesn't necessarily always stay at). 

* the hash function $ H_1(k) $ hashes keys exclusively to the first Hash Table $ T_1 $, and the hash function $ H_2(k) $ hashes keys exclusively to the second Hash Table $ T_2 $.

* A key k starts by hashing to $ T_1 $, and if another arbitrary key j collides with key k at some point in the future, key k then hashes to $ T_2 $.

* However, a key can also get kicked out of $ T_2 $, in which case it hashes back to $ T_1 $ and potentially kicks out another key.



#### Pseudocode of Cuckoo Hashing 

```
insert_CuckooHash(k): // return true upon successful insertion

    index1 = H1(k)
    index2 = H2(k) 

    // check for duplicate insertions (not allowed)
    if arr1[index1] == k or arr2[index2] == k:
        return False

    current = k

    // loop for a limited amount of time (we will discuss details in the next step) 
    while looping less than MAX times: // MAX is commonly set to 10

        // insert until the slot inserted in is empty
        oldValue = arr1[H1(current)]   // save the value currently in the slot
        arr1[H1(current)] = current    // insert the new key

        if oldValue == NULL:           // if slot was empty, we are done inserting
            return True

        current = oldValue             // time to re-insert what was kicked out

        oldValue = arr2[H2(current)]   // save the value currently in the slot
        arr2[H2(current)] = current    // insert the new key

        if oldValue == NULL:           // if slot was empty, we are done inserting
            return True

        // repeat loop, but with the key displaced from arr2
        current = oldValue

     // loop ended, so insertion failed (need to rehash the table)
     // rehash is commonly done by introducing two new hash functions
     return False
```

#### Infinite loop 

* no empty slots in both tables 

<center><img src="https://ucarecdn.com/9c4a42e7-d3d3-4d9d-b680-e489138eb84f/" width=400 /></center>

> Note: It is important to make sure that the second hash function used returns different indices for keys that originally hashed to the same index. This is because, if a key collides with another key in the first Hash Table, we want to make sure that it will not collide with the same key again in the second Hash Table. Otherwise, we risk hitting a cycle the moment we insert two keys that hash to the same first index.



#### Worst-case constant time complexity

* "find": if the key is not in either $ index_1 = H_1(k) $ or $ index_2 = H_2(k) $, then it is not in the table; this is a constant time operation

* "delete": if the key exists in our table, we know that it is either in $ index_1 = H_1(k) $ or $ index_2 = H_2(k) $, and all we have to do is remove the key from its current index; this is a constant time operation


> For the "insert" operation in Cuckoo Hashing, however, we only get an average-case constant time complexity because, in the worst case, we would have to rehash the entire table, which has an O(n) time complexity. 

Fun Fact: A lot of proofs about cycles in Cuckoo Hashing are solved by converting the keys within the two Hash Tables to nodes and their two possible hashing locations to edges to create a graph theory problem!

Note: Cuckoo Hashing is not necessarily restricted to using just two Hash Tables; it is not uncommon to use more than two Hash Tables, and generally, for d Hash Tables, each Hash Table would have a capacity of $ \frac{M}{d} $, where M is the calculated capacity of a single Hash Table (i.e., the capacity that we would have calculated had we decided to use a different collision resolution strategy that required only one Hash Table).

### Hash Table

<center><img src="https://ucarecdn.com/23f88623-1ffd-46d2-9e28-97739e0a9804/" width=500 /></center>

### Map Abstract Data Type (= Map ADT)

* map keys to their corresponding values 
* associative array

Map ADT is defined by the following set of functions:

* put(key,value): perform the insertion, and return the previous value if overwriting, otherwise NULL
* get(key): return the value associated with key if key is in the Map, otherwise fail
* remove(key): remove the (key, value) pair associated with key, and return value upon success or NULL on failure
* size(): return the number of (key, value) pairs currently stored in the Map
* isEmpty(): return true if the Map does not contain any (key, value) pairs, otherwise return false

> can be implemented in multiply ways (e.g. Binary Search Tree, Hash Table)

### Hash Map

* insert(key,value): perform the insertion, and return the previous value if overwriting, otherwise NULL
* find(key): return the value associated with the key
* remove(key): remove the (key, value) pair associated with key, and return value upon success or NULL on failure
* hashFunction(key): return a hash value for key, which will then be used to map to an index of the backing array
* key_equality(key1, key2): return true if key1 is equal to key2, otherwise return false
* size(): return the number of (key, value) pairs currently stored in the Hash Map
* isEmpty(): return true if the Hash Map does not contain any (key, value) pairs, otherwise return false


* keys for accessing the addresses
* keys must be hashable
* be able to check for uniqueness

When we find, insert, or remove (key, value) pairs in a Hash Map, we do everything exactly like we did with a Hash Table, but with respect to the key.

#### Insertion

* given a (key, value) pair
* hash only the key but store the key and the value together

<center><img src="https://ucarecdn.com/e9108a34-1aad-4123-b52d-be6e9d695028/" width=600 /></center>

```
insert(key,value): // insert <key,value>, replacing old value with new value if key exists
    index = hashFunction(key)
    returnVal = NULL

    // if key already exists, save the old value
    if arr[index].key == key:
        returnVal = arr[index].value // we want to return the old value instead of NULL

    // perform the insertion
    arr[index] = <key,value>
    return returnVal
```

> if a key that was being inserted already existed, the value will be overwritten by the new one. (In Hash Table it will be error)

#### Finding 

* given a (key, value) pair 
* store the key and the value 
* return the value (once the key, values was found)

```
find(key): // return value associated with key if key exists, otherwise return NULL
    index = hashFunction(key)
    if arr[index].key == key:
        return arr[index].value
    else:
        return NULL
```

#### Removing 

* find (key, value) pair (with respect to the key) 
* remove the pair 

```
remove(key): // remove <key,value> if key exists and return value, otherwise return NULL
    index = hashFunction(key)
    returnVal = NULL

    // if key already exists, save the old value
    if arr[index].key == key:
        returnVal = arr[index].value // we want to return the value instead of NULL
        delete arr[index]            // perform the removal

    // return the appropriate value
    return returnVal
```

* often use a Hash Map to implement "one-to-many" relationships. 


### Bloom Filter 

* space-efficient probabilistic modification of Hash Table 
* instead of storing actual elements -> stores boolean values (in the form of a bit array)
* The array is initialized with m bits, all set to false (or 0).
* instead of having a single hash function for a given element type, a Bloom filter requires us to define k different hash functions, each mapping to any of the m bits uniformly.
* To insert an element, you use each of the k hash functions to compute k indices, and you set the bits at each of those k indices to true (or 1).

#### Example of a Bloom filter (m = 5 bits and k = 3 hash functions)

* h₁(x) returns the ASCII value of character x
* h₂(x) returns 2 plus the ASCII value of character x
* h₃(x) returns 4 times the ASCII value of character x

<center><img src="https://ucarecdn.com/a7b9b3ce-8986-4f61-98f4-915918732f67/" width=500 /></center>

#### Add new element 

add the character 'B' into our Bloom filter, we would set the bits at the following indices to true (or 1):

* h₁(B) = 66 → 66 % m = 66 % 5 = 1
* h₂(B) = 2 + 66 = 68 → 68 % m = 68 % 5 = 3
* h₃(B) = 4 * 66 = 264 → 264 % m = 264 % 5 = 4

<center><img src="https://ucarecdn.com/30a679d3-29c3-4bd7-a610-fc603265a1b7/" width=500 /></center>

```
insert(x): // Insert x into this Bloom filter
    for each hash function h:
        index = h(x)
        arr[index] = true
```

#### Checking if an element exists

If the bit at any of those indices is false, the query element definitely does not exist. 
> Using the same example as before, where we only inserted the character 'B':

* h₁(x) returns the ASCII value of character x
* h₂(x) returns 2 plus the ASCII value of character x
* h₃(x) returns 4 times the ASCII value of character x

<center><img src="https://ucarecdn.com/30a679d3-29c3-4bd7-a610-fc603265a1b7/" width=500/></center>

> search for the character 'D'

* h₁(D) = 68 → 68 % m = 68 % 5 = 3
* h₂(D) = 2 + 68 = 70 → 70 % m = 70 % 5 = 0
* h₃(D) = 4 * 68 = 272 → 272 % m = 262 % 5 = 2

Because at least one of the hash functions yielded an index with a false bit, we are guaranteed that 'D' definitely does not exist in this Bloom filter.

> search for the character 'L'

* h₁(L) = 76 → 76 % m = 76 % 5 = 1
* h₂(L) = 2 + 76 = 78 → 78 % m = 78 % 5 = 3
* h₃(L) = 4 * 76 = 304 → 304 % m = 304 % 5 = 4

Even though the only letter we inserted was 'B', when we searched for 'L', we ended up with the exact same indices, so as far as we're concerned, 'L' looks like it exists in this Bloom filter as well! This is an example of a False Positive (FP): our Bloom filter returned true for an element that it didn't actually contain!

```
find(x):  // Return false if x definitely doesn't exist, or true if it MIGHT exist
    for each hash function h:
        index = h(x)
        if arr[index] is false:
            return false // x definitely does NOT exist
    return true // all of x's bits were found, so it MIGHT exist (but not sure)
```

This is why we call the Bloom filter a probabilistic data structure

* when the find algorithm returns false, we are guaranteed that the query does not exist
* but when the find algorithm returns true, there is some probability that we have encountered a False Positive (FP), meaning the query actually does not exist even though we returned true. 

* This is precisely the trade-off between a Bloom filter and a Hash Table: **we are sacrificing precision in order to gain memory efficiency**.

#### What exactly is the probability of encountering a FP?

First, we need to make two simplifying assumptions:

* Each of our k hash functions uniformly distributes across the m bits of our array
* Each of our n insertions is independent

The probability of a specific bit being set to true by a specific hash function hᵢ(x) during the insertion of a single element: 
$$ \frac{1}{m} $$ 
​	
Therefore, the probability that the same specific bit is not set to true by that hash function:
$$ 1-\frac{1}{m}=\frac{m-1}{m} $$ 

the hash function selected any of the other $ m-1 $ bits to set to true.

If we have k (independent) hash functions, the probability of a specific bit not being set to true by any of the hash functions: 
$$ (\frac{m-1}{m})(\frac{m-1}{m})\ldots(\frac{m-1}{m})=(\frac{m-1}{m})^k $$  

We can use the following well-known identity:
$$ \lim_{m\to\infty}(\frac{m-1}{m})^m=e^{-1} $$  

Specifically, we can do the following (assuming large values of m):
$$ (\frac{m-1}{m})^k=((\frac{m-1}{m})^m)^\frac{k}{m}\approx e^{-\frac{k}{m}} $$  

If we have n (independent) element insertions in total, the probability that a specific bit is not set true by any of the k hash functions during any of the n insertions would be the following:
$$ (\frac{m-1}{m})^k(\frac{m-1}{m})^k\ldots(\frac{m-1}{m})^k=(\frac{m-1}{m})^{kn}\approx e^{-\frac{kn}{m}} $$  

Therefore, the probability that the specific bit is set to true by at least one of the k hash functions during any of the n insertions would be the complement: 
$$ 1-(\frac{m-1}{m})^{kn}\approx 1-e^{-\frac{kn}{m}} $$  

Given a single new element that does not exist in the Bloom filter, the probability of a FP (each of the k hash functions happens to go to an index that is set to true) is the following: 
$$ \epsilon=(1-(\frac{m-1}{m})^{kn})^k\approx (1-e^{-\frac{kn}{m}})^k $$  

We have now derived that the probability of a False Positive (FP) is
$$ \epsilon\approx \left(1-e^{-\frac{kn}{m}}\right)^k $$ 

which we of course want to minimize if we are designing our own Bloom filter:

* We have no power over n: this is the number of elements that are inserted, which is entirely up to the user
* As m increases, the FP probability decreases

Given n and m, the value of k that minimizes the FP probability is the following:
$$ k=\frac{m}{n}\ln(2) $$  

This can be simplified to
$$ \ln(\epsilon)=-\frac{m}{n}\left(\ln(2)\right)^2 $$

which results in the following:
$$ m=-\frac{n\ln(\epsilon)}{\left(\ln(2)\right)^2} $$ 

Therefore, the optimal number of bits per element is the following:
$$ \frac{m}{n}=-\frac{\log_2(\epsilon)}{\ln(2)}\approx -1.44\log_2(\epsilon) $$ 

This corresponds to the following optimal number of hash functions:
$$ k=-\frac{\ln(\epsilon)}{\ln(2)}=-\log_2(\epsilon) $$

How does this help us? As the developers of a Bloom filter, we can do the following to inform our design:

1. Guess roughly how many elements (n) the user will insert (we can over-estimate if we are unsure)
2. Pick a FP probability (ε) that we feel is appropriate (smaller = fewer FPs but more memory)
3. Determine the optimal number of hash functions: $ k=-\log_2(\epsilon) $ 
4. Determine the optimal size of the backing array: $ m=-\frac{n\ln(\epsilon)}{\left(\ln(2)\right)^2} $ 

### Count-Min Sketch

* space-efficient probabilistic data structure
* can be used as frequescy table of events in a stream of data. 
* store array of counts (unsigned integers)

> a Count-Min Sketch is typically designed to be much smaller than the number of elements for which it wants to count frequencies, whereas a Bloom filter is typically designed to be similar in size as the number of elements it wishes to store.

Much like a Hash Map, a Count-Min Sketch relies on a backing 2D array structure with m columns and k rows, with every cell initialized to 0, to represent the stored counts.

Also, much like a Bloom filter, instead of defining a single hash function for the keys, a Count-Min Sketch requires us to define k different hash functions (one per row of our 2D array), each mapping to any of m cells of its row uniformly. 

To increment the count of an element, you use each of the k hash functions to compute k indices, and you increment the counts at each of those indices.

> Example of a Count-Min Sketch (m = 5 cells and k = 3 hash functions):

* h₁(x) returns the ASCII value of character x
* h₂(x) returns 2 plus the ASCII value of character x
* h₃(x) returns 4 times the ASCII value of character x

<center><img src="https://ucarecdn.com/e87f5bf8-118d-4740-a022-5448353006a9/" width=500/></center>

If we were to increment the count of the character 'B', we would increment the following indices:

* Row 1: h₁(B) = 66 → 66 % m = 66 % 5 = 1
* Row 2: h₂(B) = 2 + 66 = 68 → 68 % m = 68 % 5 = 3
* Row 3: h₃(B) = 4 * 66 = 264 → 264 % m = 264 % 5 = 4

<center><img src="https://ucarecdn.com/2ac4ec61-e929-4965-8923-766a773aaf17/" width=500/></center>

What happens if we now increment the count of 'G'?

* Row 1: h₁(G) = 71 → 71 % m = 71 % 5 = 1
* Row 2: h₂(G) = 2 + 71 = 73 → 73 % m = 73 % 5 = 3
* Row 3: h₃(G) = 4 * 71 = 264 → 264 % m = 264 % 5 = 4

<center><img src="https://ucarecdn.com/f01a9bc9-5b15-40b2-98be-85b167169755/" width=500/></center>

'G' is completely indistinguishable from 'B' in this poorly-designed Count-Min Sketch! 

Therefore, if we try to check the count of 'B', we retrieve a count of 2 in each of the 3 indices we check, and the same occurs for 'G', even though we only saw 1 instance of either of them.

* A Count-Min Sketch cannot actually tell the exact count of a given query! 

* it can only give me an upper limit on the count of a query.

> Specifically, given a query element x, if I use all k of my hash functions on x and look up the count at each corresponding index, the minimum of these counts is the maximum possible count of x. The actual count of x may be less than this value, or it may be equal to this value, but it will never be greater than this value.

```
increment(x): // Increment the count of x
    for each hash function h:
        index = h(x)
        mat[h][index] += 1
```

```
find(x): // Return an (over-)estimate of the count of x
    est = infinity
    for each hash function h:
        index = h(x)
        curr = mat[h][index]
        if curr < est:
            est = curr
    return est // this is greater than or equal to the true count of x
```

We want to design a Count-Min Sketch that provides as accurate of counts as possible without too significant memory consumption.

To do so, we simply need to select the number of columns (m) and the number of rows / hash functions (k). 

* Let $ n $ denote the total number of elements that will be seen by the Count-Min Sketch
* Let $ c_xc $ denote the true count of x, and let $ \hat{c}_x $ denote the estimated count of x (i.e., the smallest count after using each hash function on x and going to the corresponding cell), where we are guaranteed that $ c_x\le\hat{c}_x $ 
* Let ε denote an additive factor and δ denote a probability such that $ \hat{c}_x\le c_x+\epsilon n $ with probability $ 1-\delta $ 
* In other words, the smaller the values of ε and δ, the closer we expect $ \hat{c}_x $ and $ c_x $ to be

1. Guess roughly how many elements (n) we will encounter (we can over-estimate if we are unsure)
2. Pick a reasonable upper-bound $ \hat{c}_x-c_x\le \epsilon n $ we would like to see (smaller ε = better estimates but more memory)
3. Pick a reasonable probability of being in this bound $ 1-\delta $ (smaller δ = more likely to be in that upper-bound but more memory)
4. Determine the optimal number of columns: $ m=\lceil\frac{e}{\epsilon}\rceil $, where e is Euler's number
5. Determine the optimal number of rows / hash functions: $ k=\lceil\ln\left(\frac{1}{\delta}\right)\rceil $  

In [7]:
88 * 4 % 5

2