### Applying Hashing: Substring Search

- In every browser, you can do a `Find` to get to the text you want very quickly, even if the webpage is large. We will explore how we can implement an algorithm that does this!

- Problem: Given a text $T$ and string $P$, find all occurrences of $P$ in $T$
    - Substring notation: Let $S[i...j]$ be the subsstring of $S$ starting at position $i$ and ending at $j$
        - S = 'hashing'
        - S[0...3] = 'hash'
        - S[4...6] = 'ing'
        - S[2...5] = 'shin'
    - Inputs: Strings $T$, $P$
    - Output: All positions $i$ in $T$ such that $0 \le i \le |T| - |P|$ that $T[i ... (i + |P| - 1)] = P$

In [None]:
def naive_find_string(string, substring):
    '''
    Time complexity: O(N * M), 
        - You search every position in string with length N
        - At each position, you go through every character to check for the substring of length M 
    '''

    substring_len = len(substring)
    positions_found = []
    for i in range(len(string)-substring_len)+1:
        if string[i:i+substring_len] == substring:
            positions_found.append(i)
    return positions_found

### Lousy Rabin-Karp Algorithm

- A more efficient implementation of this substring search using hashing!
- Recall how we previously solved it
    - We want to compare some pattern $P$ against all substrings $S$ of a string $T$, where every substring has length $|P|$ 

- Notes
    - Assume a hash function $h()$
    - If $h(P) \neq h(S)$, then $P \neq S$
    - If $h(P) = h(S)$, then check if $P = P$
    - Use polynomial hash family `Polyhash` discussed in section 2, with some large prime number $\mathbb{p}$
    - The idea is that if $P \neq S$, then probability of hash collision (i.e. $P(h(P) = h(P)) \le \frac{|P|}{\mathbb{p}}$)
        - That is, the collision probabilty is bounded by the length of the pattern divided by the value of the prime number chosen
    - So if we choose a large prime number $\mathbb{p}$, we will almost never have false comparison of strings!

    - For a given string with length $|T|$ and pattern with length $|P|$, we search $|T| - |P| + 1$ positions
    - Hence, probability of collision is simply $(|T| - |P| + 1) \cdot \frac{|P|}{p}$

In [1]:
import numpy as np
big_prime_number = 2147462143

def PolyHash(string, polynomial=10, prime=1000000007):
    string_list = list(string)[::-1]
    hash_val = 0
    for char in string_list:
        hash_val = ((hash_val * polynomial) + ord(char)) % prime
    return hash_val

def RabinKarp(string, pattern, prime):
    '''
    This is a poorly implemented RabinKarp algorithm, and serves only to illustrate the idea. It is poor, because it does not actually improve 
    on the naive approach! We see that it is still O(N*M) time complexity

    Time complexity: O(M) + O((N-M+1) * M) + O(q M) ~ O(N * M); 
        - O(M) from hashing pattern
        - O(N-M+1) for the number of loops we need to run, and multiply this by M because in each loop, we hash a substring with length M
            - This simplifies to O(N*M) on the basis that N*M > M*M > M 
        - There is the strange term O(q * M)
            - This comes about because we must account for the situation where the hash computes to the same value, and we need to compare the actual 
            pattern with the substring. Remember, this comparison takes O(M) time, where M is the length of the pattern
            - Let's suppose there are q occurences of the pattern in the string
            - And we know from the analysis above that the hash collision happens (|T| - |P| + 1) . |P|/p times in total
            - If p is large, this term is almost 0 (i.e. collisions unlikely)
            - So the total times we incur this cost reduces to q
            - Hence, the q * M term
            - Since q is probably going to be smaller than N, we assume it drops out in the comparison
    Space complexity: O(N) for `positions` array
    '''
    polynomial = np.random.randint(1, prime)
    positions = []

    ## Compute hash for pattern: O(M)
    pattern_hash = PolyHash(pattern, polynomial, prime) 
    pattern_len = len(pattern)

    ## Looping over length of string - length pattern + 1: O(N - M + 1)
    ## In each loop, compute the hash for substring with the same length as pattern: O(M)
    ## Total: O((N-M+1) * M) = O(N*M) + O(M^2) + O(M) ~ O(N*M)
    for i in range(len(string)-pattern_len):
        string_hash = PolyHash(string[i:(i+pattern_len)], prime, polynomial)
        if string_hash != pattern_hash:
            continue
        else:
            if string[i:(i+pattern_len)] == pattern:
                positions.append(i)
            continue
    return positions

### Better Rabin Karp

- In the previous implementation of Rabin-Karp algorithm, we incur a huge cost from computing the hash of each substring. But there is actually a way to optimise this!!

- The trick here is that `PolyHash` can actually be written as a recurrance relation. Let's show this:
    - Let the string to be hashed be $T$
    - We want to check all substrings in $T$ with lengths matching the pattern $P$; that is, check $T[i:i+|P|]$ for every value $i \le |T| - |P| + 1$
    - For each position $i$, representing string $T[i: i+|P|]$, we store the hash value in the i-th value of an array, $H[i]$ 
    - For polyhash, $H[i] = \sum_{z=i}^{i+|P|-1} (T[z] \cdot x^{z-i}) \mod p$
    - For polyhash, $H[i+1] = \sum_{z=i+1}^{i+|P|} (T[z] \cdot x^{z-i-1}) \mod p$
    - Rewriting $H[i]$...

$$\begin{aligned}
    H[i] &= \sum_{z=i}^{i+|P|-1} (T[z] \cdot x^{z-i}) \mod p \\
    &= [\sum_{z=i+1}^{i+|P|} (T[z] \cdot x^{z-i}) + T[i] - T[i + |P|] \cdot x^{|P|}] \mod p \\
    &= [x \cdot \sum_{z=i+1}^{i+|P|} (T[z] \cdot x^{z-i-1}) + T[i] - T[i + |P|] \cdot x^{|P|}] \mod p \\
    &= [x \cdot H[i+1] + T[i] - T[i + |P|] \cdot x^{|P|}] \mod p \\
\end{aligned}$$

- By rewriting $H[i]$ using $H[i+1]$, notice that:
    - $x^{|P|}$ is computed once
    - $x$, $T[i]$, $T[i+ |P|]$ are all known values
    - And if $H[i+1]$ is known, then the computation of H[i] is computed in $O(1)$!!

- Let's implement this efficient Rubin-Karp algorithm
    - We simply make use of the recurrence relation to compute all values of $H[i]$
    - Once we have the hash in every position, we can simply run the comparison between the substring and the chosen pattern in $O(1)$ time, since no computations are necessary!

In [6]:
def polyhash(string, polynomial=10, prime=1000000007):
    '''
    Time complexity: O(N) where N is length of `string`
    '''
    reversed_string_list = list(string)[::-1]
    hash_val = 0
    for char in reversed_string_list:
        hash_val = ((hash_val * polynomial) + ord(char)) % prime
    return hash_val

def precompute_hash(string, pattern, prime, polynomial):
    '''
    Time complexity: O(M + M + N - M) = O(M+N) 
        - O(M) from computing the last valid substring of with same length as pattern
        - O(M) from computing x^{length of pattern}
        - O(N - M) for looping over all remaining valid substrings
        - Therefore
    '''
    pattern_len = len(pattern)
    string_len = len(string)
    
    hash_arr = [None] * (string_len - pattern_len + 1)
    last_valid_substring = string[(len(string)-pattern_len):(len(string))]

    ## Compute hash of last substring: O(M)
    hash_arr[string_len - pattern_len] = polyhash(last_valid_substring, polynomial, prime)

    ## Compute x^{pattern length}: O(M)
    x_power_p = 1
    for _ in range(pattern_len):
        x_power_p = (x_power_p * polynomial) % prime

    ## Loop over remaining substrings: O(N - M)
    for i in range(string_len - pattern_len - 1, -1, -1):
        hash_arr[i] = (polynomial * hash_arr[i+1] + ord(string[i]) - x_power_p * ord(string[i+pattern_len])) % p
 
    return hash_arr

def RabinKarpEfficient(string, pattern):
    '''
    Time complexity: 
        - O(M) from computing pattern's hash
        - O(N+M) for running precompute hash
        - O(N-M+1) for looping over all possible substrings
            - For each iteration of the loop, if the hashes match, we incur a cost fo compare the substring and pattern: O(M) 
            - If hashes don't match, then it is just O(1)
            - Let's suppose we have `q` matches in total. Then the total work done is O((N-M+1) + q*M). 
                - If we assume q is small, then average time is O(N-M+1)
                - q has maximum value N, so it is possible that work done becomes O((N-M+1) + N*M) = O(N*M) in the worst case
    Space complexity:
        - O(N) from storing ositions, and storing array of precomputed hashes
    '''
    prime = 32452843
    polynomial = np.random.randint(1, prime)
    positions = []

    ## Compute pattern hash: O(M)
    pattern_hash = polyhash(pattern, prime, polynomial)

    ## Precompute all substring hash: O(M+N)
    all_substring_hash = precompute_hash(string, pattern, prime, polynomial)

    ## Loop over N-M+1 entries in substring hashes, and check for hash equality: O(M+N+1)
    for i in range(len(string)-len(pattern)+1):
        if all_substring_hash[i] != pattern_hash:
            continue
        if string[i:(i+len(pattern))] == pattern:
            positions.append(i)
    
    return positions