## Main

- There are 3 ways to do this, the naive "iterate through the string" algorithm, Knuth-Morris-Pratt (KMP) algorithm, and the Rabin-Karp algorithm

- We will implement all of these in this example:

- Brute force
    - Iterate through the `haystack` in $O(N)$ time
    - If we find a matching first character of `needle` in `haystack`, then iterate through `needle` in $O(M)$
    - In total, we take $O(n \cdot m)$ time, and $O(1)$ space

    - Let's first try to understand where the "duplicative" work occurs in this brute force approach.
    - Let's take the `haystack = 'aaabaaa'` and `needle=aaaa`
    - Start at index 0
        - We find that index 0 to 2 of `haystack` matches `needle`
        - At index 3, there is a mismatch
    - Using the brute force algorithm, the inner loop exits, and we move on to index 1, and check against needle again
        - But wait, if index 3 of haystack doesn't match index 3 of needle, there is no way for a matching substring to start at index 1!!!
        - This is a useless check!??!
    
    - Removing this duplicated check is where the efficiency gains will be

- KMP
    - KMP reduces the work done by implementing a **preprocessing** step on the pattern

    - Imagine I have a pattern `p = ABABC`, and I want to find this pattern in string `s = ABABABC`
        - We compare from the starting index, similar to the brute force method
        - This gets us to index 3. But now, `p[4] = "C" != s[4] = "A"`
        - Now, the brute force method would require us to reset the pointers to the start of the pattern, and try matching from index 1 of the string. 
            - So now we will try to match `p[0]` against `s[1]` etc...
        
        - But wait. I already know that I successfully matched `p[2:4] = "AB"` with `s[2:4] = "AB"`
            - From the failure to match `p[4]` and `s[4]`, we know that we cannot match `s[2:4]` with `p[2:4]`
            - But since "AB" is repeated, `s[2:4]` can match `p[0:2]`!
            - So the idea is:
                - at each point of `p`, either we find a match and we can return, or at some point we run into a mismatch
                - When we run into a mismatch, we don't necessarily need to restart from the end, because there could be patterns in `p` that are repeated
            - Case in point here: 
                - At `p[4] = "C"`, we find a mismatch with `s[4] = "A"`
                - But instead of backtracking to `p[0]`, make use of the fact that `s[2], s[3]` matches `p[0], p[1]`
                - So compare `s[4]` and `p[2]`, and find that they match!
                - Then `s[5]` and `p[3]` match
                - Until the end, and we find that the pattern successfully matches 
        - Notice, in this way, I never actually backtrack in the main string `s`

    - I'd like a way to know where I can "backtrack" to when I fail to match a particular index in `p`
        - Let's take the same example: `p = ABABC`, `s = ABABABC`
        - Let's start matching from index 0 of both `p` and `s`
            - `A == A`, `B == B`, ...
            - Up till index 4, where `C != A`
            - Since I fail to match at index 4, I want to know, up to and including index 3 (i.e. `ABAB`), what is the longest possible match to `p` excluding the whole string 
                - **THAT IS; WHAT IS THE LONGEST POSSIBLE SUFFIX THAT MATCHES A PREFIX??**
            - Clearly, it is `AB` (i.e. the `AB` in positions 0 and 1 matches the `AB` in positions 2 and 3)
            - So longest possible prefix-suffix match is 2
            - So index 4's match has failed, and we know that the preceding `AB` is not the `AB` in positions 2 and 3 of `p`
            - BUT from the longest prefix-suffix, we know that the `AB` that we just saw can be in positions 0 and 1 of `p`
        - So we continue to match index 4 of `s`, against index 2 of `p`
        - And we are able to find the full match!

    - For the above strategy to work, we need a way to find out the `longest_prefix_suffix` (**LPS**) at each possible index of `p`
        - How do we find LPS?
        - Simple! 
            - Let `p = AABAAAC`
            - Init an array of size `len(p)`, which will be our LPS array. In this case, `lps = [0,0,0,0,0,0,0]`
            - Since there is no "prefix-suffix" when looking at only the first character, we start iterating from index 1. So loop over `p` from index 1, with pointer `i`
            - Let's init another pointer `j` that points to the start of `p` at index 0. 
            - Procedure
                - Now, `i=1` and `j=0`
                    - `p[1] == p[0]`, so `lps[1] = lps[0] + 1 = 1`
                    - `lps = [0,1,0,0,0,0,0]`
                    - `i += 1`, `j += 1`
                - Now, `i=2` and `j=1`
                    - `p[2] != p[1]`, and `j != 0` 
                    - Set `j = lps[j-1] = lps[0] = 0`
                    - `lps = [0,1,0,0,0,0,0]`
                - Now, `i=2` and `j=0`
                    - `p[2] != p[0]`, and `j == 0`
                    - Set `lps[2] = 0` 
                    - `lps = [0,1,0,0,0,0,0]`
                    - `i += 1`
                - Now, `i=3` and `j=0`
                    - `p[3] == p[0]`, so `lps[3] = lps[2] + 1 = 1`
                    - `lps = [0,1,0,1,0,0,0]`
                    - `i += 1`, `j += 1`                
                - Now, `i=4` and `j=1`
                    - `p[4] == p[1]`, so `lps[4] = lps[3] + 1 = 2` 
                    - `lps = [0,1,0,1,2,0,0]`
                    - `i += 1`, `j += 1`
                - Now, `i=5` and `j=2`
                    - `p[5] != p[2]`, and `j != 0`
                    - Set `j = lps[j-1] = lps[1] = 1`
                    - `lps = [0,1,0,1,2,0,0]`
                - Now, `i=5` and `j=1`
                    - `p[5] == p[1]`, so `lps[5] = lps[1] + 1 = 2`
                    - `lps = [0,1,0,1,2,2,0]`
                    - `i += 1`, `j += 1`
                - Now, `i=6` and `j=2`
                    - `p[6] != p[2]`, and `j != 0`
                    - Set `j = lps[j-1] = lps[1] = 1`
                    - `lps = [0,1,0,1,2,2,0]`
                - Now, `i=6` and `j=1`
                    - `p[6] != p[1]`, and `j != 0`
                    - Set `j = lps[j-1] = lps[0] = 0`
                    - `lps = [0,1,0,1,2,2,0]`
                - Now, `i=6` and `j=0`
                    - `p[6] != p[0]`, and `j == 0`
                    - So `lps[6] = 0`
                    - `lps = [0,1,0,1,2,2,0]`
            - Why does this procedure work?
                - Let `p = AABAAAC`
                - Now let's forward to the step where `i=5` and `j=2`
                - To get to `j=2`, it imples that `i=4` matches `j=1` and `i=3` matches `j=0`
                - So we know if a mismatch occurs at `i=5` and `j=2`, then `i=4` cannot be the match for `j=1`
                - BUT `i=4` can certainly be the match for `j=0`, because we already established that `i=4` matches `j=1`, and from the earlier step, we established that `j=1` matches `j=0`!!
                - So if `i=4` matches `j=0`, then we next check if `i=5` matches `j=1`
                - This is the same comparison we make by following the procedure of moving `j` to `lps[1]`
                
    - Time complexity of KMP: $O(N+M)$ 
        - $N$ from iterating over string
        - $M$ from building the LPS array
    - Space complexity of KMP: $O(M)$ from pattern length $M$

- Rabin Karp
    - This is the "rolling hash" matching algorithm
    - Basically instead of matching character by character, we hash the pattern 
    - Then, slide a rolling window across `haystack` to compute the hash at each window
    - Let's assume we can compute the hash in $O(1)$ time; then the whole operation becomes $O(N+M)$
        - $N$ to slide the window across the whole of the `haystack`
        - $M$ to hash the whole of the pattern
    
    - How can we hash in $O(1)$ time? (I'm assuming no collisions)
        - Use `Polyhash`
            - Assume letters 'a' to 'z' map to 1 - 26
            - At each position, multiply the value of the letter by $10^{\text{pattern length} - i}$
            - For example, $\text{Polyhash("ABC")} = 1 * 10^{3} + 2 * 10^2 + 3 * 10^1 = 123$
            - To avoid overflow, take modulo by a large prime `1e9+7` or `1e9+9` 

        - How does this help us compute hash in $O(1)$?
            - Imagine string "ABCD"
            - We computed $\text{Polyhash("ABC")} = 123$ above
            - To slide the window 1 to the right, we simply take $(123 - 100)*10 + 4 = 234$
            - This is a constant time operation!!

In [46]:
class Solution:
    def strStr_bruteforce(self, haystack: str, needle: str) -> int:
        for haystack_index in range(len(haystack)):
            haystack_curr = haystack_index
            needle_curr = 0
            
            while (haystack_curr < len(haystack)) and (needle_curr < len(needle)):
                if haystack[haystack_curr] != needle[needle_curr]:
                    break
                haystack_curr += 1
                needle_curr += 1
                if needle_curr == len(needle):
                    return haystack_index
            
        return -1
    
    def strStr_kmp(self, haystack: str, needle: str) -> int:
        def make_lps(needle: str) -> list:
            lps = [0]*len(needle)
            pattern_index,lps_index = 0,1
            while lps_index < len(needle):
                if needle[pattern_index] == needle[lps_index]:
                    pattern_index += 1
                    lps[lps_index] = pattern_index
                    lps_index += 1
                else:
                    if pattern_index == 0:
                        lps[lps_index] = 0
                        lps_index += 1
                    else:
                        pattern_index=lps[pattern_index-1]  
            return lps
        
        lps = make_lps(needle)
        needle_index,haystack_index = 0,0
        while haystack_index < len(haystack):
            if needle[needle_index] == haystack[haystack_index]:
                needle_index += 1
                haystack_index += 1
            else:
                if needle_index == 0:
                    haystack_index += 1
                else:
                    needle_index = lps[needle_index-1]
            if needle_index == len(needle):
                return haystack_index - needle_index

        return -1
    
    def strStr_rabinkarp(self, haystack: str, needle: str) -> int:
        def polyhash(string: str) -> int:
            power = len(string)-1
            hashval = 0
            for char in string:
                # print('='*50)
                # print(f"{char=}, {ord(char)=}, {power=}, {hashval=}")
                hashval += ord(char) * 10**power
                # print(f"{char=}, {ord(char)=}, {power=}, {hashval=}")
                power -= 1
            return hashval
        
        pattern_hash = polyhash(needle)
        # print(f"{pattern_hash=}")
        string_hash = 0
        for index in range(len(haystack) - len(needle) + 1):
            # print(f"{index=}, {haystack[index]=}, {haystack[:len(needle)]=}")
            # print(f"{string_hash=}")
            if index == 0:
                string_hash = polyhash(haystack[:len(needle)])
            else:
                # print(f"{haystack[index-1]=}")
                # print(f"{string_hash=}")
                string_hash = string_hash - (ord(haystack[index-1]) * (10**(len(needle)-1)))
                # print(f"{string_hash=}")
                string_hash = (string_hash * 10)
                # print(f"{string_hash=}")
                string_hash = (string_hash + ord(haystack[index+len(needle)-1]))
                # print(f"{string_hash=}")

            if string_hash == pattern_hash:
                # print(f'{haystack[index:(index+len(needle))]}, {needle=}, {string_hash=}, {pattern_hash=}')
                mismatch = False
                for hchar, pchar in zip(haystack[index:(index+len(needle))], needle):
                    # print(f'{hchar=}, {pchar=}')
                    if hchar != pchar:
                        mismatch = True
                        break
                    
                if not mismatch:
                    return index
        
        return -1
            
soln = Solution()
# soln.strStr_bruteforce('leetcode', 'efg')
soln.strStr_kmp('aabaaabaaac', 'aabaaac')
# soln.strStr_rabinkarp('leetcode', 'e')

4

## Review

- The KMP took you forever to implement and understand
- Even now, I'm not even sure I understand it
- Better review it at some point