- Worst case = O(n)
- Average case = O(1)
- Hashing = mapping keys to memory locations
- This is typically achieved by converting the key to a number and taking the remainder of the table size
- Hashing is mixing and recombining the value in some way such that it creates as unique a value as possible
- Universal hashing
- h(k) = [ ( ak + b ) mod p ] mod m
- m is table size
- a and b are random integers [0, p)
- p is prime number > universe of k ==> large prime
- Hash Table
- A generalized array whose indexes can be selected based on a hash of the value stored in that index
- Direct addressing = Direct access table = using an array’s index as the key
- Applicable when we can afford to allocate an array with one position for every possible key
- Hashing allows us to map keys to values
- Useful when the number of possible keys is large compared to the actual number of keys
- When table is full, double table size, like array
- When number of elements shrinks to size/4, shrink table to size/2 ==> avoids repeated O(n) operations when adding/removing around the doubling/halving number of elements
- Hash Functions
- Used to transform the key into the index
- Ideally should map each possible key to a unique slot index, but difficult to achieve in practice
- An efficient hash function should be designed so that it distributes the index values of inserted objects uniformly across the table
- Must be calculated quickly, minimize collisions, and efficiently resolve collisions
- Collision = when to records are stored in the same location
- Collision Resolution Techniques
- Chaining = the value at each index is a linked list, which allows collisions to be resolved by having several elements in a single index
- Open Addressing
- in notes below,
m
indicates the table size - If a collision occurs, the value is stored at a different index
- Insert - If initial hash index is filled, probe for a new index until empty or DeleteMe index is found, then insert value in table
- Search - If value at hash index is not desired key, probe for a new index until desired key or None is found
- Delete - If value at hash index is not desired key, probe for a new index until desired key is found, then delete the value at the table and replace with a DeleteMe flag, which indicates that it is empty at present
- Linear probing = stored at the next open index
- h(k, i) = (h'(k) + i) % m
- may result in poor distribution throughout the table (cluster)
- Quadratic probing = stored in a subsequent index, progressing in a quadratic manner
- h(k, i) = (h'(k) + c1i + c2i^2) % m
- may result in a better distribution throughout the table as keys get spread out
- Double hashing = use a primary hash function for initial index, then combine with a second hash function for the probing sequence
- h(k, i) = (h1(k) + i*h2(k)) % m
- h2(k) and m must be relatively prime ==> this can be achieved by making m a power of 2 and making h2 always return an odd number
- Expected number of probes = 1 / (1 - load factor)
- in notes below,
- Load factor = ( # elements in hash table ) / ( size of hash table)
- If average number of elements in an index > load factor, rehash elements with a bigger hash table size
- Bloom Filter
- Probabilistic data structure which was designed to check whether an element is present in a set with memory and time efficiency
- Tells us that element is either NOT in set or MAY BE in set
- Initialize an array of all zeros
- When an element is added, use k hash functions to hash the element itself
- Set the array index to 1 for each of the k indexes yielded by the hash functions
- If an element is queried and the indexes yielded by the hash functions are not all 1, then the element DEFINITELY HAS NOT been added
- If they ARE all, then it is likely that it has been added but they all may have had collisions
- Implements rolling hash ADT with O(1) operations
r
maintains an internal stringx
- Treat
x
as multi digit numberu
in base a (a = size of alphabet — ASCII)- This can be done because the string
x
is an array of its characters, which can be represented as an array of ASCII integers - For example, for
a = 10
,x = [ 1, 2, 3, 4, 5 ] = 1*10^4 + 2*10^3 + 3*10^2 + 4*10^1 + 5*10^0
- This is the number that will then be used as input to the hash function
- This can be done because the string
- This is advantageous because when rolling, only two O(1) operations must be performed: append new element to end & remove first element
- To achieve this:
- Add element to end: multiply
x[ i+1 : ]
bya
and add the new element - Remove element from front:
x_i - a ^ (L-1) * x[i]
- Generalized:
h(x_i+1) = a * ( h(x_i) - a ^ ( L-1 ) * x[i] ) + x[ i + L ] mod m
, where L is the length of s
- Add element to end: multiply
- To achieve this:
- Treat
r.append(char)
= add char to end ofx
r.append(char)
= delete first char fromx
(assuming it ischar
)r()
= hash value of x = h(x)
# looking for substring s in corpus t
# initialize rolling hash for s = rs
for c in s:
rs.append(c)
# initialize rolling hash for t = rt
for c in t[:len(s)]:
rt.append(t)
# check equality
if rs() == rt():
# substring of t MAY match s
# check whether s == t[ i-len(s)+1 : i+1 ]
if s == t[ i-len(s)+1 : i+1 ]:
return True
# rs and rt not initially equal
# roll through and look for equality
for i in range(len(s), len(t)):
rt.skip(t[i-len(s)])
rt.append(t[i])
if rs() == rt():
# substring of t MAY match s
# check whether s == t[ i-len(s)+1 : i+1 ]
if s == t[ i-len(s)+1 : i+1 ]:
return True
else:
# happens with probability <= 1/s
pass