# Pattern matching algorithms
Determine if the pattern string P is matched with the text string T. If so, return the first position of match.

Examples:
1. T = 'this is an example', P = 'is', return 2
2. T = 'aaaaaaaaaa', P = 'aaab', return -1

# Brute force
Match each character in P with T by scanning each possible locations, time complexity is O(m*(n-m+1)), where m is the length of P and n is the length of T.

In [10]:
def brute_force_match(P, T):
    start_pos = 0
    while start_pos < len(T):
        flag = True
        for pos in range(start_pos, start_pos+len(P)):
            if pos>=len(T):
                return -1
            elif T[pos] != P[pos-start_pos]:
                flag = False
                break
        if flag:
            return start_pos
        else:
            start_pos += 1
    return -1

In [11]:
T = 'this is an example'
P = 'is'

In [12]:
brute_force_match(P,T)

2

In [13]:
T = 'ababababababababababababa'
P = 'abababc'

In [14]:
brute_force_match(P,T)

-1

In [17]:
T = 'abababababababababababababc'
P = 'abababc'

In [18]:
brute_force_match(P,T)

20

# Rabin-Karp algorithm

Idea: The same as brute force alg. Instead of comparing each character, we use hash function to hash each substring with the same length to a number. Compare hash functions first, if found a match, then check if strings are exact match  
Worst case complexity is still O(m*(n-m+1)), as there could be hash collisions for each substring. 
However, average complexity is O(n).

In [38]:
def Rabin_Karp_match(P, T):
    # uses hash function sum(ord(ch)) for characters in substring with length m
    
    if len(P)>len(T):
        return -1
    
    hash_P = 0
    for i in range(len(P)):
        hash_P += ord(P[i])
        
    hash_T = 0
    
    for i in range(len(P)):
        hash_T += ord(T[i])
        
    start_pos = 0
    end_pos = len(P)
    while True:
        if hash_T == hash_P:
            flag = True
            for i in range(len(P)):
                if P[i] != T[start_pos+i]:
                    flag = False
                    break
            if flag:
                return start_pos
        if end_pos<len(T):
            hash_T = hash_T - ord(T[start_pos]) + ord(T[end_pos])
            start_pos += 1
            end_pos += 1
        else:
            break
        
    return -1

In [44]:
Rabin_Karp_match('is','this is an example')

2

In [41]:
T = 'abababababababababababababc'
P = 'abababc'
Rabin_Karp_match(P,T)

20

In [46]:
Rabin_Karp_match('t','This is an example to show it.')

19

# Knuth-Morris-Pratt (KMP) 

For each i, compute the length of longest substring of P[0:i] which is both suffix and prefix. For example,
1. P = 'abcde', prefix = 00000
2. P = 'abcabbcab', prefix = 000120012, at position 4, the string is 'abcab' and 'ab' is the longest suffix and prefix
3. P = 'aaaaaaaab', prefix = 012345670, at position 3, the string is 'aaaa' and 'aaa' is the longest suffix and prefix

We maintain a prepos index for the length of longest prefix/suffix for P[0:pos-1]. Moving to P[0:pos], we only need to compare P[prepos] and P[pos]:
* if they have the same character, we can simply increase the length of prefix/suffix at P[0:pos-1] by 1
* if they are not the same, we can reuse the longest prefix/suffix for P[0,prepos] to determine the next possible prefix/suffix for P[0:pos]

Time complexity is O(m).

In [53]:
def find_prefix(P):
    # Find prefix
    prefix = [0 for i in range(len(P))]
    pos = 1
    prepos = 0
    while pos<len(P):
        if P[pos]==P[prepos]: # if can continue match the next character, we increase the prefix by 1 and move to the next position
            prefix[pos] = prepos + 1
            prepos += 1
            pos += 1
        else:                 # if a match cannot be found
            if prepos == 0:   # already moved to the first character, then the prefix/suffix length must be zero
                pos += 1
            else:             # we move to the prefix of the previous prefix string and compare again
                prepos = prefix[prepos-1]
    return prefix

In [54]:
find_prefix('abcabbcab')

[0, 0, 0, 1, 2, 0, 0, 1, 2]

In [55]:
find_prefix('ababcababcabababc')

[0, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5]

Next, we use the pre-computed prefix/suffix length to check pattern match. Each time we compare P[posP] and T[posT]. If it is a success, we move both index up 1 and compare the next one. If it is a failure, we use the prefix/suffix length of P and move posP to the longest prefix/suffix of P[0:posP-1].

The time complexity is O(n) as the while loop will be excute at most 2n times. The reasons are: 
* if it is a success, posT will increase and there is only 1 iteration for the posT. 
* if it is a failure, posP will roll-back and it took ***at most*** the number of previous successes to remain at posT. So in average, posT has at most 2 iterations to do the comparison.

In [70]:
def KMP_match(P,T):
    prefix = find_prefix(P)
    
    if P=='' or T=='':
        return -1
    posT = 0
    posP = 0
    while True:
        if P[posP] == T[posT]:   # Current positions match
            posP += 1
            posT += 1
            if posP == len(P):
                return posT-posP
        else:                    # Current positions don't match
            if posP == 0:        
                posT += 1
            else:
                posP = prefix[posP-1]
        if posT>=len(T):
            break
    return -1

In [57]:
KMP_match('is','this is an example')

2

In [58]:
KMP_match('t','This is an example to show it.')

19

In [68]:
T = 'abababababababababababababc'
P = 'abababc'
KMP_match(P,T)

20

In [69]:
find_prefix(P)

[0, 0, 1, 2, 3, 4, 0]

In [67]:
KMP_match("bba","aaaaa")

-1

# Boyer-Moore algorithm