In [51]:
import cProfile
import math

from collections import namedtuple

# Hash tables and hashing functions

## Hashing functions

Want a function $h(k)$ which maps every key $k$ in the universe $U$ to some index $i, 0 \le i < m$. We call this function a \textbf{hash function}. Given a $k$, we compute $i = h(k)$ and then use this $i$ as a basis for insertion, searching or deletion. Some functions are better than others. A good hash function has these properties:
- item want $i$ to be uniformly distributed over all $m$
- item should not hash similar/related keys to the same slot.
- item fast to compute. $O(1)$ time.
- item should not change with time. We want some key $k$ to always hash to the same $h(k)$.


### Simple Uniform Hashing Assumption
We want our hash functions to distribute the hashes uniformly across all $m$. We usually assume something called simple uniform hashing. It means that given any two randomly chosen keys $k_1$ and $k_2$, the probability that they have equal hashes $= \frac{1}{m}$.

$$ 
P[h(k_1) = h(k_2)] = \frac{1}{m}
$$

Let's see some examples of hash functions and analyze them.

### Examples of hash functions

#### Divition method

The division method is one way to create hash functions. The functions take the form

$$
h(k) = k \mod m
$$

Since we’re taking a value $\mod m$, $h(k)$ does indeed map the universe of keys to a slot in the hash table. 

It’s important to note that if we’re using this method to create hash functions, $m$ should not be a power of 2. If $m = 2 p$ , then the $h(k)$ only looks at the $p$ lower bits of $k$, completely ignoring the rest of the bits in $k$. 

In [9]:
def hashtest(b, mod):
    for i in range(2*mod):
        print(" %d * %d (mod %d) = %d" % (b, i, mod, (b*i) % mod))

In [10]:
hashtest(4, 8)

 4 * 0 (mod 8) = 0
 4 * 1 (mod 8) = 4
 4 * 2 (mod 8) = 0
 4 * 3 (mod 8) = 4
 4 * 4 (mod 8) = 0
 4 * 5 (mod 8) = 4
 4 * 6 (mod 8) = 0
 4 * 7 (mod 8) = 4
 4 * 8 (mod 8) = 0
 4 * 9 (mod 8) = 4
 4 * 10 (mod 8) = 0
 4 * 11 (mod 8) = 4
 4 * 12 (mod 8) = 0
 4 * 13 (mod 8) = 4
 4 * 14 (mod 8) = 0
 4 * 15 (mod 8) = 4


Is it enough to take odd numbers then? 

With a composite number you have the additional issue where you get poor performance if a disproportionate number of keys share factors with $m$.

In [12]:
hashtest(15, 12)

 15 * 0 (mod 12) = 0
 15 * 1 (mod 12) = 3
 15 * 2 (mod 12) = 6
 15 * 3 (mod 12) = 9
 15 * 4 (mod 12) = 0
 15 * 5 (mod 12) = 3
 15 * 6 (mod 12) = 6
 15 * 7 (mod 12) = 9
 15 * 8 (mod 12) = 0
 15 * 9 (mod 12) = 3
 15 * 10 (mod 12) = 6
 15 * 11 (mod 12) = 9
 15 * 12 (mod 12) = 0
 15 * 13 (mod 12) = 3
 15 * 14 (mod 12) = 6
 15 * 15 (mod 12) = 9
 15 * 16 (mod 12) = 0
 15 * 17 (mod 12) = 3
 15 * 18 (mod 12) = 6
 15 * 19 (mod 12) = 9
 15 * 20 (mod 12) = 0
 15 * 21 (mod 12) = 3
 15 * 22 (mod 12) = 6
 15 * 23 (mod 12) = 9


A good choice for $m$ with the division method is a prime number.

With a prime number you will still find poor performance if you have a disproportionate number of keys which are congruent $\mod m$. However

#### Multiplication Method

The multiplication method is another way to create hash functions. The functions take the form

$$
h(k) = \lfloor m(kA \mod 1) \rfloor
$$

where $0 < A < 1$ and $(kA \mod 1)$ refers to the fractional part of $kA$. Since $0 < (kA \mod 1) <
1$, the range of $h(k)$ is from $0$ to $m-1$. The advantage of the multiplication method is it works equally well with any size $m$. $A$ should be chosen carefully. Rational numbers should not be chosen for $A$.


In [43]:
def hashtest_multiplication(A, m):
    for k in range(2 * m):
        fraction = (k * A) % 1.0 
        res = math.floor(m * fraction)
        print("floor(%d * (%d * %f mod 1)) = %d" % \
              (m, k, A, res))

In [49]:
hashtest_multiplication(3.0 / 4.0, 10)

floor(10 * (0 * 0.750000 mod 1)) = 0
floor(10 * (1 * 0.750000 mod 1)) = 7
floor(10 * (2 * 0.750000 mod 1)) = 5
floor(10 * (3 * 0.750000 mod 1)) = 2
floor(10 * (4 * 0.750000 mod 1)) = 0
floor(10 * (5 * 0.750000 mod 1)) = 7
floor(10 * (6 * 0.750000 mod 1)) = 5
floor(10 * (7 * 0.750000 mod 1)) = 2
floor(10 * (8 * 0.750000 mod 1)) = 0
floor(10 * (9 * 0.750000 mod 1)) = 7
floor(10 * (10 * 0.750000 mod 1)) = 5
floor(10 * (11 * 0.750000 mod 1)) = 2
floor(10 * (12 * 0.750000 mod 1)) = 0
floor(10 * (13 * 0.750000 mod 1)) = 7
floor(10 * (14 * 0.750000 mod 1)) = 5
floor(10 * (15 * 0.750000 mod 1)) = 2
floor(10 * (16 * 0.750000 mod 1)) = 0
floor(10 * (17 * 0.750000 mod 1)) = 7
floor(10 * (18 * 0.750000 mod 1)) = 5
floor(10 * (19 * 0.750000 mod 1)) = 2



All rational numbers can be formulated as $\frac{a}{b}$ for some $a,b$. Note that $h(K) = m(k\frac{a}{b} \mod 1)$ only permits $b$ possible values for $h(k)$. For example if $b = 4$ then the only possible fractional parts are $.0, .25, .75$. In theory that's a bit of a pickle because all the IEEE floating point numbers are rational.

$\frac{\sqrt{5} - 1}{2}$ is irrational and related to the golden ratio. Chosing this value for $A$ is a special form of hashing known as fibonacci hashing. This has the property that when hashing consecutive keys, each subsequent key falls in between the two widest spaced hash values already computed.

In [50]:
golden_ratio = (math.sqrt(5) - 1.0) / 2.0
hashtest_multiplication(golden_ratio, 10)

floor(10 * (0 * 0.618034 mod 1)) = 0
floor(10 * (1 * 0.618034 mod 1)) = 6
floor(10 * (2 * 0.618034 mod 1)) = 2
floor(10 * (3 * 0.618034 mod 1)) = 8
floor(10 * (4 * 0.618034 mod 1)) = 4
floor(10 * (5 * 0.618034 mod 1)) = 0
floor(10 * (6 * 0.618034 mod 1)) = 7
floor(10 * (7 * 0.618034 mod 1)) = 3
floor(10 * (8 * 0.618034 mod 1)) = 9
floor(10 * (9 * 0.618034 mod 1)) = 5
floor(10 * (10 * 0.618034 mod 1)) = 1
floor(10 * (11 * 0.618034 mod 1)) = 7
floor(10 * (12 * 0.618034 mod 1)) = 4
floor(10 * (13 * 0.618034 mod 1)) = 0
floor(10 * (14 * 0.618034 mod 1)) = 6
floor(10 * (15 * 0.618034 mod 1)) = 2
floor(10 * (16 * 0.618034 mod 1)) = 8
floor(10 * (17 * 0.618034 mod 1)) = 5
floor(10 * (18 * 0.618034 mod 1)) = 1
floor(10 * (19 * 0.618034 mod 1)) = 7


## How Python implements their hashing?

Which idea does Python employ? Surely not division method. Multiplication method? Well, we managed to show it's drawbacks. Some clever cryptographic algorithm?

...

... It uses division method. Which prime number are they using?

...

... They are using powers of two. Ba dum tsh.

In [27]:
def test_dict(base):
    d = {}
    for i in range(1000000):
        d[base*i] = True
    print(len(d))

In [39]:
cProfile.run("test_dict(2*32-1)")

1000000
         20 function calls in 0.257 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.240    0.240    0.240    0.240 <ipython-input-27-5f4d266ba1d0>:1(test_dict)
        1    0.017    0.017    0.257    0.257 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        1    0.000    0.000    0.257    0.257 {built-in method exec}
        2    0.000    0.000    0.000    0.000 {built-in method getpid}
        2    0.000    0.000    0.000    0.000 {built-in method isinstance}
        1    0.000    0.000    0.000    0.000 {built-in method len}
        1    0.000    0.000    0.000    0.000 {built-in method print}
        2    0.000    0.000    0.000    0.000 {built-in method time}
        1    0.000    0.000    0.000 

In [40]:
cProfile.run("test_dict(2**32)")

1000000
         20 function calls in 0.930 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.879    0.879    0.879    0.879 <ipython-input-27-5f4d266ba1d0>:1(test_dict)
        1    0.051    0.051    0.930    0.930 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        1    0.000    0.000    0.930    0.930 {built-in method exec}
        2    0.000    0.000    0.000    0.000 {built-in method getpid}
        2    0.000    0.000    0.000    0.000 {built-in method isinstance}
        1    0.000    0.000    0.000    0.000 {built-in method len}
        1    0.000    0.000    0.000    0.000 {built-in method print}
        2    0.000    0.000    0.000    0.000 {built-in method time}
        1    0.000    0.000    0.000 

Yeah, again we managed to exploit the knowledge of Python, to achieve sigificant runtime discrepancy. It is not exactly the differenc between $O(n)$ and $O(n^2)$ as one might expect, because Python hash quite clever mechanisms in place to recover from this siliness. More details during next lecture and here:
http://stackoverflow.com/questions/9010222/how-can-python-dict-have-multiple-keys-with-same-hash

## Collisions
As we mentioned, we do not want our hash functions to have a lot collisions. But collisions are unavoidable. Think of the \textbf{Pigeon-hole principle}. Usually the size of universe $U$ is much larger than our table size $m$. What do we do if we find that two keys hash to the same value? Two ways which we learn in this class are \textbf{chaining} and \textbf{open addressing}.

### Chaining
Instead of just storing the elements in the slots in the table, let every slot be a linked list which contains all the elements which are in the table and map to that slot. Our operations now become:

- `Insert` $(k,v)$: hash $k$ to an index $i$ in the table; add $k$ along with $v$ to the linked list at that location.
- `Search` $(k)$: search for $k$ in the linked list by iterating through all the list.
- `Delete` $(k)$: search for $k$ and then remove it from the list.

These operations no longer take $O(1)$ time. Lookup on the linked lists takes $O(l)$ time where $l$ is the size of the linked list. We define $\alpha = \frac{n}{m}$ as the **load factor**. If we assume simple uniform hashing, then each element has equal probability to go into any slot. So after $n$ independent elements have been inserted we have and expected length of $\frac{n}{m} = \alpha$ for each chain by linearity of expectation. So the run time of all the above operations is time to hash + time to do these operations which is $O(1 + \alpha)$.

**Note:** It is possible to have expected $O(1 + \alpha)$ runtime for these operations on any given input (i.e. input chosen to make our algorithm perform poorly). This requires more sophisticated hash functions (See "Universal Hashing" in CLRS 11.3.3)

If we assume that $m = O(n)$, then $\alpha = O(1)$ and we get constant time operations. But what if we want to insert more elements into the hash table and we don't know the number of elements to be inserted before hand? Stay tuned...

In [72]:
class LLNode(object):
    def __init__(self, key, value, next_node):
        self.key, self.value, self.next_node = \
                key, value, next_node
            
class LinkedList(object):
    def __init__(self):
        """Key-value storage over Linked List
        
        unique values per key"""
        self.root = None
        
    def find(self, key):
        """Returns node given key"""
        # find a node with a given key,
        # by following links: O(n)
        node = self.root
        while node is not None:
            if node.key == key:
                return node
            node = node.next_node
        
    def __getitem__(self, key):
        """Returns value for a given key"""
        node = self.find(key)
        if node is not None:
            return node.value
        else:
            return None
            
    def __setitem__(self, key, value):
        """Sets key given value"""
        node = self.find(key)
        if node is not None:
            # modify existing node
            node.value = value
        else:
            # append front
            self.root = LLNode(key, value, self.root)

In [73]:
ll = LinkedList()
ll[5] = "fiveo"
ll[4] = "four"
ll[5] = "five"
print(ll[5], ll[6])

five None


In [65]:
def test_datastructure(ds, size=3000):
    for i in range(size):
        ds[i] = str(i)
    for i in range(size):
        temp = ds[i]

In [81]:
ll = LinkedList()

cProfile.run("test_datastructure(ll)")

         15004 function calls in 1.328 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007    1.328    1.328 <ipython-input-65-22a1ad701f4a>:1(test_datastructure)
     6000    1.311    0.000    1.311    0.000 <ipython-input-72-629b29942276>:12(find)
     3000    0.003    0.000    0.003    0.000 <ipython-input-72-629b29942276>:2(__init__)
     3000    0.001    0.000    0.633    0.000 <ipython-input-72-629b29942276>:22(__getitem__)
     3000    0.006    0.000    0.688    0.000 <ipython-input-72-629b29942276>:30(__setitem__)
        1    0.000    0.000    1.328    1.328 <string>:1(<module>)
        1    0.000    0.000    1.328    1.328 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [77]:
class HashTable(object):
    def __init__(self, num_slots):
        """Initializes chained hash table with num_slots slots"""
        self.num_slots = num_slots
        self.table = [ LinkedList() for _ in range(self.num_slots)]
        self.golden = (math.sqrt(5) - 1.0) / 2.0
        
    def multiplication_address(self, key):
        """Multiplication adressing described above"""
        # hash(k) maps any Python object that implements it
        # to a 64-bit integer. It is not necessarily in the 
        # range of our hashtable.
        fraction = (hash(key) * self.golden) % 1.0 
        return math.floor(self.num_slots * fraction)
        
    def __getitem__(self, key):
        relevant_list = self.table[self.multiplication_address(key)]
        return relevant_list[key]

    def __setitem__(self, key, value):
        relevant_list = self.table[self.multiplication_address(key)]
        relevant_list[key] = value

In [78]:
ht = HashTable(2)
ht[5] = "fiveo"
ht[4] = "four"
ht[5] = "five"
print(ht[5], ht[6])

five None


In [79]:
ht = HashTable(500)

cProfile.run("test_datastructure(ht)")

         39004 function calls in 0.066 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011    0.066    0.066 <ipython-input-65-22a1ad701f4a>:1(test_datastructure)
     6000    0.009    0.000    0.009    0.000 <ipython-input-72-629b29942276>:12(find)
     3000    0.004    0.000    0.004    0.000 <ipython-input-72-629b29942276>:2(__init__)
     3000    0.003    0.000    0.007    0.000 <ipython-input-72-629b29942276>:22(__getitem__)
     3000    0.009    0.000    0.018    0.000 <ipython-input-72-629b29942276>:30(__setitem__)
     3000    0.005    0.000    0.019    0.000 <ipython-input-77-fd11d219214a>:16(__getitem__)
     3000    0.008    0.000    0.037    0.000 <ipython-input-77-fd11d219214a>:20(__setitem__)
     6000    0.012    0.000    0.017    0.000 <ipython-input-77-fd11d219214a>:8(multiplication_address)
        1    0.000    0.000    0.066    0.066 <string>:1(<module>)
        1    0.000    0.00

### Hash table with one slot (just like linked list + overhead)


In [80]:
ht = HashTable(1)

cProfile.run("test_datastructure(ht)")

         39004 function calls in 1.426 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007    1.426    1.426 <ipython-input-65-22a1ad701f4a>:1(test_datastructure)
     6000    1.375    0.000    1.375    0.000 <ipython-input-72-629b29942276>:12(find)
     3000    0.015    0.000    0.015    0.000 <ipython-input-72-629b29942276>:2(__init__)
     3000    0.002    0.000    0.710    0.000 <ipython-input-72-629b29942276>:22(__getitem__)
     3000    0.006    0.000    0.688    0.000 <ipython-input-72-629b29942276>:30(__setitem__)
     3000    0.004    0.000    0.719    0.000 <ipython-input-77-fd11d219214a>:16(__getitem__)
     3000    0.005    0.000    0.699    0.000 <ipython-input-77-fd11d219214a>:20(__setitem__)
     6000    0.008    0.000    0.012    0.000 <ipython-input-77-fd11d219214a>:8(multiplication_address)
        1    0.000    0.000    1.426    1.426 <string>:1(<module>)
        1    0.000    0.00

## Super efficient Python hashtable

In [83]:
# Implemented in C...
cProfile.run("test_datastructure({})")

         4 function calls in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003    0.003    0.003 <ipython-input-65-22a1ad701f4a>:1(test_datastructure)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
        1    0.000    0.000    0.004    0.004 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




## A fun annecdote from Professor Indyk

As a student in 1999, Professor Indyk was trying to large set of webpages (i.e. group pages with similar content). He hashed all the words in the sites to reduce space usage. There were two issues.

- His algorithm clustered one of his advisor's home page with some rather shady websites.
- His algorithm was provably correct, with probabily of failure less than $10^{-6}$

The implementation was for a word $x$ compute $h(x) = (ax \mod P) \mod 2^{8}$, $P = 2^{64}-59$, randomly chosen $a$. Additionally, the only used words divisible by $8$ for speed purposes. It turns out that the language they were using computed $ax$ as $ax \mod 2^{64}$ because of the word size. This means the $\mod P$ operation essentially did nothing. So $x | 8 \implies ax | 8 \implies ax \mod 2^8 | 8$. All $h(x)$ had the 3 lowest order bits as zero! The total range for $h(x)$ was $2^{5}$ instead of $2^{8}$ causing word colisions.


# Practice



## Duplicate Detection

Given an array $A$ of $n$ integers and an integer $k$, detect if there is an entry $A[i]$ that is equal to one of the $k$ previous entries $A[i-1] \ldots A[i-k]$.
Your algorithm should run in time $O(n)$.
You can assume you have access to a hash function which satisfies the simple uniform hashing assumption (SUHA).

**Example:** Given an array `A=[1,3,5,7,6,5,2]` and $k=4$, the algorithm should output YES since `A[3]=A[6]=5`.

## Point Lookup
Design a data structure to support the following operations on points in a plane.
You can assume you have access to a hash function which satisfies the simple uniform hashing assumption (SUHA).
Additionally, you can assume you know $n$, an upper bound on the total number of elements to ever be inserted into the structure.
Runtimes can be worst-case or expected time.
Your data structure should use $O(n)$ space.


- **Query($x$)** Of all the points with $x$-coordinate equal to $x$, return the one with the lowest $y$ coordinate. This should run in $O(1)$ time.

- **Insert($x, y$)** Insert the point $(x, y)$ into the structure. This should run in $O(\log{n})$ time.

- **Delete($x, y$)** Remove the point $(x, y)$ from the structure. This should run in $O(\log{n})$ time.