# String Matching: Brute Force
---
### Time Complexity
- Worst case $O(nm)$
    - `n`is length of text and `m` is length of pattern
    - Example: `t = aaa..a` and `p = baaa` gives the worst case complexity

In [8]:
def stringmatch(t,p):
    poslist = []
    for i in range(len(t)-len(p)+1):
        matched = True
        j = 0
        while j < len(p) and matched:
            if t[i+j] != p[j]:
                matched = False
            j = j+1
        if matched:
            poslist.append(i)
    return(poslist)

stringmatch("which finally halts.  at that point", "at that")

[22]

# String Matching: Boyer-Moore Algorithm
- See lecture PDF 10.2 for knowing more about the algorithm.
- We find a mismatch at `t[i+j]`
    - If `j > last[t[i+j]]`, shift by `j - last[t[i+j]]`
    - If `last[t[i+j]] > j`, shift by 1
        - Should not shift p to left!
    - Both cases can be handled by `max()` as case 2 results in a `j-last[t[i+j]]` negative number
---

### Time Complexity
- Worst case remains $O(nm)$
    - `n`is length of text and `m` is length of pattern
    - `t = aaa..a` and `p = baaa`
    - Every time it only shifts by one
    - So $O(n)$ positions in `t` and $O(m)$ comparisons
- If the programming language does not have dictionary
    - Computing `last` is a bottleneck, complexity is $O(|Σ|)$
    - Here $|Σ|$ is length of alphabet
    - This is because every other letter (not present in `p`) has to initialized with default value
- Performance of algorithm improves as pattern length grows
    - This is because more characters skipped every time
    - Longer the length of `p` bigger the skips

In [9]:
# t: text
# p: pattern
# Goal is to get all list of starting indices in t where pattern is found
def boyermoore(t,p):
    # last, key: character, value: index of right most occurance of character
    # Example "tht": {t:2, h:1}
    last = {}
    # Traversing p to fill last
    for i in range(len(p)):
        last[p[i]] = i
    

    # List contains indexes where the pattern is matched
    poslist = []
    # Instead of looping through each index, like we did in brute force for finding pattern
    # We use i, as some times we skip some indexes
    i = 0
    
    # len(t)-len(p): where len(p) is size of pattern
    # Going beyond this will never match p as len(p) > len(t[i:])
    while i <= (len(t)-len(p)):
        # We assume we found a match at i
        matched = True
        # We take j as last index of pattern
        # This is becasue we are comparing from right to left
        j = len(p)-1
        
        # j: goes from len(p)-1 to 0
        # In other words from last index of p to first character of p
        while j >= 0 and matched:
            # Found a mis-match
            if t[i+j] != p[j]:
                matched = False
            # Decrease j for moving from right to left
            # NOTE: This decrements j even after mis-match found
            j = j - 1
    
        # Match found
        if matched:
            poslist.append(i)
            # Shift one position right
            i = i + 1
        else:
            # As we decremented j after mismatch was found. So, we need to restore it, so increment j again by 1
            # So now again j is the index in p where the first mis-match was found
            j = j + 1
            # Do this when mis-match letter is in p
            if t[i+j] in last.keys():
                # See PDF 10.2, slide 5 for visualizing
                # Example t = "..axa" and p = "bxa" we find a mis-match "a" (in t) and "b" (in p)
                # Now if we algin p we have to shift backward
                # we don't want this as "a" (in p) is already aligned with "a" (in t) before
                # So in cases where we need to shift slice backwards for algining, we simply shift by 1
                # Else we shift by j-last[t[i+j]] for aligning
                i += max(j-last[t[i+j]],1)
            
            # Do this when mis-match letter not in p
            else:
                # Shift i by j+1
                # This ensures next slice do not include mis-match letter
                i += j + 1
    return(poslist)

boyermoore( "which finally halts.  at that point", "at that")

[22]

# Rabin-Karp Algorithm
- Below implementation does not work for the base 10 system
- Number of characters we can type by keyboard is roughly 80
- So we need to use base 80 system for converting a character to number
- But using big base system means we need to do large (if not very large) arithmetic operations
    - We need to add a ones digit `x` to `n`, so multiply `n` by 80 and add `x` to `n`
    - Now what if we want to add tense and one digit, this time we need to multiply `n` by $80^2$
    - You can see the numbers quickly get very big
- This reduces efficiency.

In [17]:
# t: text
# p: pattern
# Goal is them same, to find string match starting indices in t
def rabinkarp(t, p): # O(m+n) as two loops
    poslist = []
    # Instead og matching strings now we will compare numbers
    # numt is the number representing the current slice (or block) string
    numt = 0
    # nump is the number representing the pattern string
    nump = 0
    
    # Converting string to number
    for i in range(len(p)): # O(m)
        # int(t[i]) is the number representation of character t[i]
        numt = 10 * numt + int(t[i])
        nump = 10 * nump + int(p[i])
    
    # We handle first case outside loop
    if numt == nump:
        poslist.append(0)
    
    for i in range(1, len(t) - len(p) + 1): # O(n)
        # On each iteration the slice shifts
        # Thus we need to update numt as well
        numt = numt - int(t[i - 1]) * (10 ** (len(p) - 1))
        numt = 10 * numt + int(t[i + len(p) - 1])
        
        if numt == nump:
            poslist.append(i)
    return poslist

# Knuth-Morris-Pratt Algorithm
- See this video https://www.youtube.com/watch?v=M9azY7YyMqI
---

### Time Complexity: `kmp_fail(p)`
- `j` incremented `m−1` times in `while`
    - Note that `j` never decrements, it only moves forward
    - But does this ensure while loop runs `m` times?
- We also have iterations where `j` is unchanged and `k` decreased
    - ```python
        elif k > 0:
            k = fail[k-1]
    ```
    - Due to this while loop definitely runs more than `m` times
- Total number of decreases to `k` cannot exceed total number of increments to `k`
    - See the code, you'll know why
- Overall `k` is incremented at most `m−1` times
    - This is because `k` is incremented only when `j` is incremented *(first `if` block in `while`)*
    - This means `k` cannot decrease more than `m-1` times
- So in `m-1` iteration in total `k` is decreased at most `m-1` times -- **Amortised Analysis**
    - In other words at most `m-1` iterations occur where `j` is not increased
- Hence overall complexity is $O(m)$
---

### Time Complexity: `find_kmp(t, p)`
- `while` runs $O(n)$ times
- We also compute `fail` which takes $O(m)$ time
- Overall all KMP algorithm works in time $O(m+n)$

In [27]:
# p: is the pattern
# Returns an array containing longest proper prefix suffix (lps)
def kmp_fail(p):
    m = len(p)
    # Initializing lps of all to zero
    fail = [0 for i in range(m)]
    
    # j: Pointer, never comes backward, goes from 1 to m-1
    # k: Pointer, moves back and forth, k+1 signifies LENGTH of current lps
    j,k = 1,0
    
    # Iterate until j reaches end of p
    while j < m:
        if p[j] == p[k]: # k+1 chars match
            # Remeber k+1 is the length of lps
            # Update the value of lps in fail
            fail[j] = k+1
            # Both pointers increase
            j,k = j+1,k+1

        # When characters not match and k > 0
        elif k > 0: # find shorter prefix (suffix)
            # We do not increase j in this case
            # We decrease k, new k is lps of k-1
            k = fail[k-1]
        
        # When k = 0 and characters not match
        # Note k = 0 implies an empty string
        else:
            j = j+1
    return(fail)

In [28]:
# t: text, p: pattern
# Returns index starting index in t where the match is found
def find_kmp(t, p):
    # n: length of text
    # m: length of pattern
    n,m = len(t),len(p)
    
    # If pattern is empty return 0
    if m == 0:
        return 0
    
    # Find the fail (lps) array
    fail = kmp_fail(p) # preprocessing
    # index into text
    j = 0 
    # index into pattern
    k = 0 
    
    # Run until j reaches end of text (t)
    while j < n:
        # Characters in p and t match
        if t[j] == p[k]:
            # check if match is complete
            if k == m - 1: 
                # return index of match in text (t)
                return(j - m + 1)
            # extend match
            j,k = j+1,k+1
        
        elif k > 0:
            # decrease k
            k = fail[k-1]
        
        # When k = 0, implies empty match
        else:
            # Go to next index in t
            j = j+1
    
    # reached end without match, no matching pattern found in t
    return(-1) 

# Trie
---

### Code
- Constructor
    - Accepts list of words `S`
    - Adds all the words in `S` in Trie
```python
class Trie:
    def __init__(self,S=[]):
        self.root = {}
        for s in S:
            self.add(s)
```

- `add()` inserts a new word into the trie
    - Accepts a string `s` to add to the Trie
```python
def add(self,s):
    curr = self.root
    s = s + "$"
    for c in s:
        if c not in curr.keys():
            curr[c] = {}
        curr = curr[c]
```

- `query` checks for a complete word
    - True — s is a complete word in T
    - False — s is not found in T
    - None — s is a prefix of some word in T
```python
    def query(self,s):
        for c in s:
            if c not in curr.keys():
                return(False)
            curr = curr[c]
        if "$" in curr.keys():
            return(True)
        else:
            return(None)
```

In [32]:
class Trie:
    # S is the list of words
    def __init__(self,S=[]):
        # root is a dictionary containing key: character value: another dictionary
        # Basically root is dictionary of children of the current node
        self.root = {}
        # Constructor should initialize the trie by building it
        # Each word in S is added to trie
        for s in S:
            self.add(s)
    
    # How to add the word s to trie?
    def add(self,s):
        # curr is dictionary of children of current node
        # We update curr regularly to traverse the trie
        curr = self.root
        # Append "$" to s to mark the end of word
        # s is the word we want to add
        s = s + "$"
        # We start from the root node and move down to add new node if needed
        for c in s:
            # If c is not a child of current node then we need to add a node with label c
            # Now this node is a child of current node
            if c not in curr.keys():
                # The child node also has children, so initialize it to a dictionary
                curr[c] = {}
            # Now we need to move down by changing to current node to a matching child node which has label c
            curr = curr[c]

    # Helps searching string s in trie
    def query(self,s):
        # We start from the root node
        # curr will help us to traverse the trie
        curr = self.root
        # We start from 1st character in s
        for c in s:
            if c not in curr.keys():
                # c not present, thus word s is not in the trie
                return(False)
            # If c is a child of parent node then update curr to child node with label c
            # This helps to move down (traverse)
            curr = curr[c]
        # We reach this line only if s is present as a unique path in trie
        # curr is the node with label s[-1] (last character in the word s)
        if "$" in curr.keys():
            # If s[-1] has "$" as its child then we have found a match
            return(True)
        else:
            # Return None if s is a prefix of another word in the trie
            # We found the word but we did not reach the leaf node (containing "$")
            return(None)

# Suffix trie
---

### Code

- Constructor
    - Constructor builds a trie with every suffix of s
    
```python
class SuffixTrie:        
    def __init__(self,s):                
        self.root = {}        
        s = s + "$"        
        for i in range(len(s)):            
            curr = self.root                        
            for c in s[i:]:
                if c not in curr.keys():                    
                    curr[c] = {}                
                curr = curr[c]
```

- `followPath()` follows the path dictated by s
    - Return None if path fails
    - Return last node in the path if it succeeds

```python
def followPath(self,s):        
    curr = self.root        
    for c in s:            
        if c not in curr.keys():                
            return(None)            
        curr = curr[c]                
    return(curr)
```

- `hasSubstring()` returns True if substring is present in Trie
    - If `followPath` finds a path, `s` is a valid substring
   
```python
def hasSubstring(self,s):        
    return(self.followPath(s) is not None)
```

- `hasSuffix()` returns True if substring `s` is a suffix
    - If `followPath` ends in $, `s` is a suffix

```python
def hasSuffix(self,s):        
    node = self.followPath(s)        
    return(node is not None and "$" in node.keys())

```

In [None]:
class SuffixTrie:
    # Build a suffix trie by adding all possible suffixes to Trie
    # s is a word
    def __init__(self,s):
        # Each node has a dictionary containing its children
        # key: character, value: another dictionary which contains children of key
        self.root = {}
        # Appending "$" to mark the end of string
        s = s + "$"
        # Traverse s from start to end to find suffixes and build the suffix trie
        for i in range(len(s)):
            # curr is the current node, helps to traverse the trie
            curr = self.root
            # take each sub string of s from i to end
            # This gives us all possible suffixes
            for c in s[i:]:
                if c not in curr.keys():
                    # Make a new child node if c not a child of curr
                    curr[c] = {}
                # Move down to next child
                curr = curr[c]

    # Accepts string s
    # Returns None of s not in suffix trie
    # Returns last node if s is present in the suffix trie
    def followPath(self,s):
        # We start from the root node
        curr = self.root
        # Visit each character of s to find match in the trie
        for c in s:
            # What if c is not a child of curr (current node)
            if c not in curr.keys():
                # Return None as string is not present in trie
                return(None)
            # Move to next child node
            curr = curr[c]
        # Return the last node
        # The last node will have a label s[-1]
        return(curr)
    
    # hasSubstring returns True if substring s is present in trie
    # If followPath finds a path, s is a valid substring
    def hasSubstring(self,s):
        # False if followPath(s) returns None
        return(self.followPath(s) is not None)
    
    # Returns True if substring s is a suffix
    def hasSuffix(self,s):
        # node is the last node labelled s[-1]
        node = self.followPath(s)
        # If node has a child node labelled "$", implies it's a suffix
        return(node is not None and "$" in node.keys())

# Regular Expressions
- Each automaton matches the words it accepts.
- Conversely, an automaton can be constructed for each pattern to accept matching words.
- Text processing with an automaton facilitates pattern matching.
- Python offers a regex library for matching.

# Notes

In [38]:
print((int is not None))
print((None is not None))

True
False
