# Hash functions
A good hash functions stisfies approximately the assumption of **simple uniform hashing**:  each key is **equally likely** to hash to any of the $m$ slots, independently of where any other key has hashed to. Three hash functions were described in the books, they are:
* The division method
* The Multiplication method
* Universal hashing based on randomisation of multiple hash functions

## Interpreting keys as natural numbers
Because most hash functions assumw the the universe of keys is the set $\mathbb{N} = \{0,1,2,...\}$ of natural numbers, we find ways to interpret keys as natural numbers if they are not. If we are given a character string:
1. We can express a single alphabet with an integer according to the [ASCII character set](https://en.wikipedia.org/wiki/ASCII)
2. Using the [radix notation](https://en.wikipedia.org/wiki/Radix), we can interpret a character string composed of multiple alphabets with the ASCII integers
3. One can choose the **base** in radix notation. For example, the **radix-128** integer of $(112,116)$ equals to $112\times128^1+116\times128^0=14452$

### In Python:
* We can convert an alphbet to an ASCII integer with the built-in `ord()` function
* The **radix-128** integer can be computed based on loop

Examplary codes as follows:

In [1]:
def radix_p(l,p=128): #default p=128
    """step 1: convert string to ASCII"""
    to_ascii=[ord(i) for i in l]
    
    """step 2: unite multiple ASCII integers to an integer with radix of base p"""
    to_radix=0
    for i in range(len(to_ascii)):
        
        to_radix=to_radix*p+to_ascii[i]
    return to_radix

example='pt'
radix_p(example)

14452

## The division method: $h(k)=k\: mod\: m$
It is a straghtforward way to mao a key $k$ into one of $m$ slots by taking the remainder of $k$ divided by $m$. There are nonetheless some ways of choosing $m$:
* $m$ should not be a power of 2. Otherwise, $h(k)$ is equivalent to p-left lowest digit of $k$, whose distribution pattern we are not sure of
* For similar reason, $m=2^p-1$ should also be avoided (*Exc. 11.3*)
* Recommendation: **a prime not too close to the power of 2**

## The multiplication method: $h(k)=\lfloor m(kA\:mod\: 1)\rfloor$
1. We multiply the key $k$ with $A$ in the range $0<A<1$
2. We extract the **frictinoal part** of $kA$
3. We multiply the value in 2 by $m$ and take the **largest integer smaller than it** (i.e. floor)

Although the multiplication method works for any $A$, but some A works better. For example, Knuth suggests that $A\approx(\sqrt{5}-1)/2$ is  a good choice.
### In Python:
The implementation of these two methods is trivial:

In [11]:
"""The division method:"""
def h_div(k,m):
    return k%m # % is the symbol for modulation
h_div(21,10)

"""The multiplication method:"""
import numpy as np
import math
A_knuth=(np.sqrt(5)-1)/2 #A from Knuth
def h_mul(k,m,A=A_knuth): #default as A= A from Knuth
    return int((k*A-int(k*A))*m) #int(x) is equal to the floor of x if x>=0
h_mul(123456,16384)    
    

67

## Universal hashing
The key idea is to avoid worst-case behavior caused by a single hash function by **randomisation** from a class of hash functions. In universal hashing, a hash function is chosen each time **randomly** such that it is **independent** of the keys that are actually going to be stored. It yields stable and provable *average-case* performance no matter how the keys were chosen.

### By definition:
* $\mathscr{H}$ is a finite collection  of hash functions that map a universe $U$ of keys into the range $\{0,1,...,m-1\}$
* A hash function $h$ is randomly chosen from $\mathscr{H}$ to hash two distinct keys $k$ and $l$
* The chance of $h(k)=h(l)$ is **no more than $1/m$**

### Performance:
We can prove that:
* If the key $k$ is not in the table, the expected length of $T[h(k)]$ that it hashes to is at most the load factor $\alpha=n/m$
* If the key $k$ is in the table, the expected length of $T[h(k)]$ is at most $1+\alpha$

which implies that with the use of universal hashing and chaining (to resolve collision), it takes expected time $\Theta(n)$ to handle any sequence of $n$ INSERT, SEARCH, and DELETE operations containing $O(m)$ INSERT operations. 

### Disigning a universal class of hash functions
1. We choose a large prime number $p$ so that every possible key $k$ is in the range of 0 to $p-1$.
2. Let $\mathbb{Z}_p$ denotes the set $\{0,1,...p-1\}$, $\mathbb{Z}_p^*$ denotes the set $\{1,...p-1\}$
3. Hash function $h_{ab}$ can be defined as:

<center>
$h_{ab}=((ak+b)\:mod\:p)\:mod\:m$ 
<br>where $a\in\mathbb{Z}_p^*$, $b\in\mathbb{Z}_p$
</center>

The family of all such hash functions is $\mathscr{H}_{ab}=\{h_{ab}:a\in\mathbb{Z}_p^*$ and $b\in\mathbb{Z}_p\}$, which contains $p(p-1)$ hash functions.
#### In Python:
* Universal hashing can be achieved using [numpy.random.choice](https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.random.choice.html) as followings:

In [23]:
import numpy as np
def h_ab(k,p,m):
    a=np.random.choice(np.arange(1,p))
    b=np.random.choice(p)
    print ('a:',a)
    print ('b:',b)
    return ((a*k+b)%p)%m
h_ab(k=8,p=17,m=5)    

a: 7
b: 4


4